If you’ve gotten interested in picking up a programming language, Perl (and to be clear, we’ll be talking about Perl 5 here) is unlikely to be at the top of your list. If you look at rankings of the most popular or in demand programming languages, Perl is way down towards the bottom of the list, hanging out with less common languages like Haskell and Clojure. Web searches on Perl subjects often bring up pages (or mailing list results!) from years ago. At first glance, Perl can look like a curio from a different age.
But this appearance is deceiving. Perl has one of the best book series of any language for learning the language, some great resources for present-day users of the language (chromatic’s Modern Perl, in particular, is great), and a community that’s more vibrant than web searches might suggest. (And Perl also has a storied, fascinating past: for years, Perl scripts were the workhorses of the web, and Larry Wall, Perl’s creator, made the intriguing decision to draw upon the characteristics of natural languages when he designed the language.)
If you’re interested in the Digital Humanities, in particular, there are compelling reasons to look at Perl besides its historical interest– there’s a reason that a surprising number of DH people have a background in Perl. A simple example will illustrate the reason that a little bit of Perl can be a powerful tool. If you’ve been using some form of Linux or Unix for any amount of time, you may have a sense that powerful magic can be wrought on the command line: you can get a list of the most frequent words in a file, for example, with a quick invocation of
uniq (well, a quick and dirty command would be
cat file | tr ' ' '\n' | sort | uniq -c). Fortunately, as well, there are good books on getting to know and mastering command line tools.
But let’s say you’ve dutifully digitized some modern novel or some interesting diplomatic transcription, and you’re ready to start exploring the text (with NLTK or something else). As soon as you start digging into it, however, you realize that your results are being contaminated by words that have been hyphenated across a line break: instead of ‘imagining’ as a single word, you have ‘imag-‘ and ‘ining’, which is messing up your distant reading of imagination and cognition in VanderMeer’s Annihilation. What command line tool do you use to fix your text file so that you can get back to your analysis?
It’s actually a trick question: pore as you will through the venerable Unix Power Tools, you’ll find that *nix command line tools almost all operate on a single line of text by itself (and, in fact, many strip out the newline before processing it), and so the
sed command you have to run to cut out your hyphenations will end up being pretty gnarly. But there’s an easier way: the Perl one-liner
perl -p i.bak -e 's/-\n//' will actually do it for you with far less trouble. With this command, Perl will both edit your file in place and save a backup of your original file in case you screw something up; if you’re feeling foolhardy, you can just use:
perl -p -i -e 's/-\n//'. In a single, easy line, Perl solves a problem that would take substantially more effort to solve another way, whether you try to use a command line tool or write a script in a different language.
This, in a nutshell, is the magic of Perl: Perl can do some extremely useful things in a remarkably concise way. Even more powerfully, Perl scripts also allow you to concisely roll your own tools for working with files on the command line. Have a few files that have been junked up with HTML
<br /> tags? A Perl script of about six lines can strip them all out for you:
$^I = ".bak";
(To be clear, though, parsing HTML with a Perl script is a Bad Idea: is the kind of thing that causes even the Elder Gods to shift in their troubled slumber. More robust ways to work with HTML are easily accessible over at CPAN, Perl’s module repository.)
Nor are you confined to such simple tools: Perl is absolutely brilliant for those cases when you’ve got a set of files you need to modify in some idiosyncratic way that would be complicated and time-consuming to do by hand, or even with command-line tools– say converting a set of notes from one format to another while preserving the structure of your data, or restructuring a BibTeX or Biblatex file– and Perl allows you to do it in scripts that aren’t much longer than the one above.
These cases, for my money, are where Perl is absolutely unbeatable: when you’re doing something powerful in one hundred lines of code or less and you can keep all of the moving pieces of your script in your head at the same time. In my experience (and ymmv, of course), I find Perl scripts that are substantially longer than can be a headache to use and extend. More than that, the concision and nifty little tricks that make Perl such a powerful tool can also make a script completely inscrutable if someone else (or, more likely, you in a few months) needs to modify or fix the damn thing. In the script above, for example,
$^I is the magical Perl variable you set to a) modify a file in place and b) specify what the extension of your backup file will be (it essentially works like a command-line switch). In many other scripts, you’ll see the variable
$_ , which is often the variable Perl will supply if you don’t give a variable explicitly (and yes, you can do that– it’s part of what makes the script given above so concise). But you’re going to be left scratching your head if you haven’t seen these variables before, and these are the ones that you see a lot. In other cases, Perl can seem needlessly obscure: the index of the last item in the array
@array, for example, lives in
$#array; in many other languages, you’re obliged to express the same thing with less concision, but far, far more clarity.
But for all of its annoyances and flaws, there’s something thrilling about Perl: the first time you run a Perl script that does what you want, you get a taste of the wizardry the omnipotent coders who populate TV shows and movies seem to have: with a few keystrokes, you’re able to command immense power.
A few quick pointers will make your early experiments with Perl easier. If you’ve installed Perl and written a script but are having trouble getting it to run, there are a few things to check.
1. Does your copy of Perl live where your script thinks it does? Perl often lives in
/usr/bin/perl, but not always, and it may be better for the first line of your script, which tells your computer where to find the program to run it with (entertainingly called the “shebang”) to read
#!/usr/bin/env perl (You can check where the copy of Perl on your computer lives by running
which perl on the command line.)
2. Have you made your Perl script executable? Before you’ll be able to run your script, you need to make sure that you’ve set its permissions correctly. Run
ls -l on it; if the first characters that are printed are ‘x’ free (looking like like
-rw-r--r-- instead of
-rwxr-xr-x), you need to run
chmod +x or
chmod u+x on your script.
3. Does attempting to run your script result in ‘command not found’ instead of Perlish wizardry? You probably typed
your_script.pl. You need to make sure that you’re telling the shell to run the copy of the script in the current directory by running
Like the language itself, Learning Perl (7th ed.), by Randal Schwartz and brian d foy is a miracle of concision: whenever I turn back to it, I’m always surprised at how many powerful features of the language they manage to introduce in the first few chapters. Similarly, if you want to get the most out of Perl, you’ll want to delve more deeply into regular expressions, a powerful way of describing patterns that’s widely used in Perl, other languages, and on the command line. There are a number of good resources for learning regular expressions out there, but Friedl’s Mastering Regular Expressions (3rd ed.) is a classic, and– if you really want to get into the weeds– the most recent edition of Programming Perl goes into a lot of detail on how Perl implements things like character classes (
\w, etc.), which, in the Unicode era, can be trickier to use than explicit character classes (