Workshop: Text Processing with Unix

Dusting off a powerful software toolbox

This workshop aims to give an overview over some basic Unix utilities. If you digitally handle textual data, you most likely use a general-purpose programming language like Python for it. The goal of this workshop is to draw attention to a more traditional, often overlooked, and sometimes surprisingly more efficient way to do it. We will work directly on the command line and use the GNU coreutils for demonstration. (Non-GNU implementations work similarly, but not always identically.) People who wish to code along need access to a Unix shell (for example, by bringing a laptop running Linux or Mac), but just watching is fine too. Programming experience won't be required.

"Textual data" can mean CSV files, corpus data, scraped articles, interview transcripts, or drafts of papers you are preparing. "Handling" that data can mean transforming it into some other format or checking some properties (like counting words, searching for strings, or spell-checking). The Unix way of approaching such problems is not to write a single new program that solves a high-level task. Instead, you chain together multiple already existing programs that each solve a very basic task. Such a chaining of programs is called a pipeline: conceptually, there is a stream of text, and the programs, so-called filters, manipulate that stream. This leads to very short programs that are easy to write on the spot. In fact, for many text-related tasks, solving the problem on the command line takes fewer characters than describing it in English.

Take the following story as illustration. In 1986, Jon Bentley asked Donald Knuth (the inventor of TeX) to write a program to solve the following problem: "Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency." Knuth wrote a "literate program" spanning multiple pages (which, to be fair, also included its own explanation). Doug McIlroy of Bell Labs (the birthplace of Unix) replied with a program that did the job in less than 80 bytes. Instead of writing a high-level program, he simply chained together four existing Unix tools. (Find the original paper in the link list on the right.) We will see this and similar examples during the workshop.

If time permits, we might also touch on the AWK programming language and the roff typesetting system (Unix's alternative to TeX). AWK allows us to do arbitrary computations (including statistics and Markov chains) on the input (like Python), but is specialized for use as a filter (unlike Python). Roff, unlike TeX, can emit multiple output formats besides DVI/PS/PDF, including HTML and plain text, and does so in a single pass, without producing additional files.

Info

Day: 2025-05-17
Start time: 10:40
Duration: 01:00
Room: GWZ 2.115
Track: Other
Language: en

Feedback

Click here to let us know how you liked this event.

Schedule 77. StuTS