JavaScript for the Slow-Moving

This’ll be brief, but I wanted to make a quick plug for Marijn Haverbeke’s Eloquent JavaScript for anyone who might be interested in learning or brushing up on Javascript. JS is not something I work with all that much, but I’ve really been liking Haverbeke’s approach. He does a nice job of laying out the features and quirks of the language, weaving in some CS concepts, thoughtful exercises, and even a sprinkling of some advanced subjects along the way. I really admire the way that– in a way that few programming language primers do– he manages to cover a huge range of the way JS is actually used in the world (though, presumably for reasons of feasibility, not the vast forest of JS frameworks), and not just some idealized subset of the language. It’s not a book with a ton of hand-holding, to be entirely frank, but I’m finding it a great read.

And– even better– there’s a new, freely available edition available now, at https://eloquentjavascript.net/3rd_edition/, which incorporates discussion of the things that have changed about JS with ECMAScript 6.

Advertisements

Making a Start with with d3.js

Over the weekend, I came to the late-night realization that a project I’ve been working on for several years does not, primarily, require me to write up more articles (though I do, indeed, plan to). Rather, I realized that, in this domain, merely writing up a selection of one’s results gives only an impressionistic– and sometimes even misleading– understanding, when what’s needed in many cases is a more global view of the whole. The attempts I’ve made to present such a global view in the past, however, have been hampered by the difficulty of presenting the nuances and complexity of this global view in tabular or textual form. What needs to exist, I believe, is a tool that allows for the visualization of the many dimensions of this project, but a tool to do this is something that emphatically does not yet exist (my life would be far easier if it did), so I’ll just have to build it.

The nice thing, however, is that this visualization tool will, in the long run, nicely stitch together several of my big projects in a way that will show why those big projects are so important. Moreover, because the tools for making data visualizations have matured in the last decade or so, I believe that it’ll be possible to make a really dynamic and powerful visualization that is also robust and (hopefully) future-proof, and to provide a set of tools for scholars who want to do similar things in other areas as well.

Over the next couple of months, then, you’re going to see a bit more discussion of data visualization than is usual here, and a number of posts about me finding my feet in both d3.js and JavaScript more generally. I don’t imagine that I’ll be breaking much new ground in these early stages, but I suspect that my visualization will actually be pretty impressive if I can pull it off. To begin with, however, I’ll be pulling together some resources for working in d3.js into an annotated bibliography. (And I may even give some pointers on getting started with JavaScript, too.)

An Introduction to Text File Munging with Perl

If you’ve gotten interested in picking up a programming language, Perl (and to be clear, we’ll be talking about Perl 5 here) is unlikely to be at the top of your list. If you look at rankings of the most popular or in demand programming languages, Perl is way down towards the bottom of the list, hanging out with less common languages like Haskell and Clojure. Web searches on Perl subjects often bring up pages (or mailing list results!) from years ago. At first glance, Perl can look like a curio from a different age.

But this appearance is deceiving. Perl has one of the best book series of any language for learning the language, some great resources for present-day users of the language  (chromatic’s Modern Perl, in particular, is great), and a community that’s more vibrant than web searches might suggest. (And Perl also has a storied, fascinating past: for years, Perl scripts were the workhorses of the web, and Larry Wall, Perl’s creator, made the intriguing decision to draw upon the characteristics of natural languages when he designed the language.)

If you’re interested in the Digital Humanities, in particular, there are compelling reasons to look at Perl besides its historical interest– there’s a reason that a surprising number of DH people have a background in Perl. A simple example will illustrate the reason that a little bit of Perl can be a powerful tool. If you’ve been using some form of Linux or Unix for any amount of time, you may have a sense that powerful magic can be wrought on the command line: you can get a list of the most frequent words in a file, for example, with a quick invocation of tr, sort, and uniq (well, a quick and dirty command would be cat file | tr ' ' '\n' | sort | uniq -c). Fortunately, as well, there are good books on getting to know and mastering command line tools.

But let’s say you’ve dutifully digitized some modern novel or some interesting diplomatic transcription, and you’re ready to start exploring the text (with NLTK or something else). As soon as you start digging into it, however, you realize that your results are being contaminated by words that have been hyphenated across a line break: instead of ‘imagining’ as a single word, you have ‘imag-‘ and ‘ining’, which is messing up your distant reading of imagination and cognition in VanderMeer’s Annihilation. What command line tool do you use to fix your text file so that you can get back to your analysis?

It’s actually a trick question: pore as you will through the venerable Unix Power Tools, you’ll find that *nix command line tools almost all operate on a single line of text by itself (and, in fact, many strip out the newline before processing it), and so the sed command you have to run to cut out your hyphenations will end up being pretty gnarly. But there’s an easier way: the Perl one-liner perl -p -i.bak -e 's/-\n//' will actually do it for you with far less trouble. With this command, Perl will both edit your file in place and save a backup of your original file in case you screw something up; if you’re feeling foolhardy, you can just use: perl -p -i -e 's/-\n//'. In a single, easy line, Perl solves a problem that would take substantially more effort to solve another way, whether you try to use a command line tool or write a script in a different language.

This, in a nutshell, is the magic of Perl: Perl can do some extremely useful things in a remarkably concise way. Even more powerfully, Perl scripts also allow you to concisely roll your own tools for working with files on the command line. Have a few files that have been junked up with HTML <br /> tags? A Perl script of about six lines can strip them all out for you:

#!/usr/bin/env perl

$^I = ".bak";

while(<>) {

s/<br \/>//g;

print;

}

(To be clear, though, parsing HTML with a Perl script is a Bad Idea: is the kind of thing that causes even the Elder Gods to shift in their troubled slumber. More robust ways to work with HTML are easily accessible over at CPAN, Perl’s module repository.)

Nor are you confined to such simple tools: Perl is absolutely brilliant for those cases when you’ve got a set of files you need to modify in some idiosyncratic way that would be complicated and time-consuming to do by hand, or even with command-line tools– say converting a set of notes from one format to another while preserving the structure of your data, or restructuring a BibTeX or Biblatex file– and Perl allows you to do it in scripts that aren’t much longer than the one above.

These cases, for my money, are where Perl is absolutely unbeatable: when you’re doing something powerful in one hundred lines of code or less and you can keep all of the moving pieces of your script in your head at the same time. In my experience (and ymmv, of course), I find Perl scripts that are substantially longer than can be a headache to use and extend. More than that, the concision and nifty little tricks that make Perl such a powerful tool can also make a script completely inscrutable if someone else (or, more likely, you in a few months) needs to modify or fix the damn thing. In the script above, for example, $^I is the magical Perl variable you set to a) modify a file in place and b) specify what the extension of your backup file will be (it essentially works like a command-line switch). In many other scripts, you’ll see the variable  $_ , which is often the variable Perl will supply if you don’t give a variable explicitly (and yes, you can do that– it’s part of what makes the script given above so concise). But you’re going to be left scratching your head if you haven’t seen these variables before, and these are the ones that you see a lot. In other cases, Perl can seem needlessly obscure: the index of the last item in the array @array, for example, lives in $#array; in many other languages, you’re obliged to express the same thing with less concision, but far, far more clarity.

But for all of its annoyances and flaws, there’s something thrilling about Perl: the first time you run a Perl script that does what you want, you get a taste of the wizardry the omnipotent coders who populate TV shows and movies seem to have: with a few keystrokes, you’re able to command immense power.

Technical details:

A few quick pointers will make your early experiments with Perl easier. If you’ve installed Perl and written a script but are having trouble getting it to run, there are a few things to check.

1. Does your copy of Perl live where your script thinks it does? Perl often lives in /usr/bin/perl, but not always, and it may be better for the first line of your script, which tells your computer where to find the program to run it with (entertainingly called the “shebang”) to read #!/usr/bin/env perl (You can check where the copy of Perl on your computer lives by running which perl on the command line.)

2. Have you made your Perl script executable? Before you’ll be able to run your script, you need to make sure that you’ve set its permissions correctly. Run ls -l on it; if the first characters that are printed are ‘x’ free (looking like like -rw-r--r-- instead of -rwxr-xr-x), you need to run chmod +x or chmod u+x on your script.

3. Does attempting to run your script result in ‘command not found’ instead of Perlish wizardry? You probably typed your_script.pl. You need to make sure that you’re telling the shell to run the copy of the script in the current directory by running ./your_script.pl.

Further Reading:

Like the language itself, Learning Perl (7th ed.), by Randal Schwartz and brian d foy is a miracle of concision: whenever I turn back to it, I’m always surprised at how many powerful features of the language they manage to introduce in the first few chapters. Similarly, if you want to get the most out of Perl, you’ll want to delve more deeply into regular expressions, a powerful way of describing patterns that’s widely used in Perl, other languages, and on the command line. There are a number of good resources for learning regular expressions out there, but Friedl’s Mastering Regular Expressions (3rd ed.) is a classic, and– if you really want to get into the weeds– the most recent edition of Programming Perl goes into a lot of detail on how Perl implements things like character classes (\d, \w, etc.), which, in the Unicode era, can be trickier to use than explicit character classes ([0-9] and [A-Za-z]).

[edit, 03/11/2018: fixed a glaring typo in my one-liner!]

An Introduction to working with RTL languages and bidirectional text

Let’s say that you’re interested in the Digital Humanities, and you’re working on a project dealing with a textual corpus that’s written right-to-left (RTL), or where you’d like to incorporate some amount of RTL text into a text that’s predominantly written left-to-right (LTR)– say, for example, if you’re writing up a blog post in English with some Arabic in it. (The latter is called a bidirectional text.) We will return to some of the computational problems you might encounter in doing this in a later post, but for today let’s focus on the very simplest thing you might want to do– producing a document you might want someone else to read at some point– which we can quaintly call typesetting.

(Note here that these are separate problems from using transliterated, Romanized text: proper transliterations may require a bit more labor to type into a computer, but most transliterated characters can be rendered in LTR unicode.)

What tools can you use? What will absolutely not work? And what can work if you’re willing to invest a little bit of time? This is a complicated, somewhat technical subject, and I’m happy to for any additions or corrections. The following is the result of a subjective (and very likely imperfect) attempt to draw more heavily on Arabic texts in DH and other work, and to quote from and work with Arabic sources in texts that have been predominantly written in English.

Before we begin, it may be helpful to give a quick overview of the way RTL and bidirectional texts work at a technical level before discussing specific tools. In the beginning, there was a wild profusion of encodings for different languages, and it is still possible (though much rarer) to run afoul of character encoding problems. (If you’ve ever opened a file and seen a bunch of little squares or question marks, you may have tried to read it in the wrong encoding.) In the past ten years or so, however, one dialect of Unicode (UTF-8) has become the de facto standard for encoding a text: a huge percentage of webpages are encoded in UTF-8, for example, and UTF-8 support is baked into many programming languages in a way that makes working with many of the languages of the world much, much easier if you’re an end user. (If you’re a programmer, Unicode has made your life more complicated in some ways.) Also, if you’re a fan of emoji, they are widespread and interoperable because the Unicode Consortium specifies and describes their character encodings, and there are a number of interesting technical wrinkles there.

But the varying directionalities of writing systems present a different set of problems. One technical solution for a text in a single language is to specify, for example, that the entire document renders RTL (CBA) even if the file itself is stored LTR (ABC). But if you have a dictionary with entries in a mix of Hebrew and English, for example, you can’t specify language direction at the outset; there has to be some way of indicating that these characters go RTL while those are LTR. (I don’t know if this helps, but you want the characters in a bidirectional string to be displayed ABC FED GHI and not ABC DEF GHI, where DEF are the characters in an RTL language.) There are a couple of different ways to solve this, but the central approach in Unicode is that characters have inherent directionality: Hebrew and Arabic characters are rendered RTL (unless they have a good reason not to be: there are actually varying degrees of directionality and invisible control characters that can force directionality… for the full story, both Korpela’s Unicode Explained, pp. 265ff. and the full Unicode bidirectionality spec provide more detail).

If your passion is not the minutiae of Unicode, however, you probably just want to know what works and what doesn’t. The good news is that a lot of things have some measure of bidirectionality support. The bad news is that this support can vary dramatically. Here are some options, and I’ve also indicated the degree of support with emoji. This is a big subject, and I haven’t been able to revisit all of the tools and their possible configurations I’ve listed below, so again, don’t hesitate to get in touch with any advice or corrections.

  • MS Word/Office in Windows 🙂 Support for RTL languages in modern versions of Windows and Office (in Windows) is generally fine. I wouldn’t recommend doing a lot of it, but it’ll work in a pinch. Bidirectionality, on the other hand, can be a real headache, resulting in passages that won’t render correctly but where you can’t see what the problem is. If you need bidirectional support, it can be safer to cut and paste the passages in the opposite direction from another document, at the end of your editing process.
  • MS Office in macOS 😠 For indefensible reasons, RTL and bidirectional support was left out of most versions of Office for Mac until the March 16, 2016 update to Office 2016 for Mac (iOS support came earlier). This means that it will not work to share documents with anyone running older versions of Office, that your documents will be garbled if you open them in an old version and accidentally save them, and that exporting Office documents to another format (say, an Excel file to a CSV) will not work. Feel free to try it now– I haven’t– but be careful.
  • Markdown❓Markdown is a way of writing a plain text file so that it can be converted into HTML. Given that a number of forms of Markdown support the use of HTML tags, you can use them to specify the directionality of a document, though bidirectionality appears to remain largely unimplemented. There’s more discussion here. (If you’re interested in a lightweight markup language for Arabic texts, however, Maxim Romanov’s Open Arabic mARkdown project is worth checking out.)
  • Pandoc 😐 Your handy document conversion tool Pandoc now has some support for bidirectional text, which is quite a bit better than it used to be. For years, Pandoc would garble RTL text, but you can specify the directionality of a document on the command line (with pandoc -V dir=rtl) and at the beginning of a document. Like in Markdown, directionality can be specified at the paragraph level by using HTML div elements which specify directionality. To be honest, I still find it a bit fiddly, and I have not had great success consistently converting either RTL or bidirectional texts with Pandoc, but more information about Pandoc RTL and bidirectionality support can be found here (scroll down) and here. (I suspect that there are further refinements to be made: Pandoc does not appear to make allowance for characters with weak directionality, and I have not had enough success with Pandoc to see how well character mirroring works, either).
  • LaTeX (TeX, ConTeXt, XeLaTeX, etc.) 🙂 Given the fact that TeX and its descendants provide some of the most powerful tools to typeset a document and produce something camera-ready, LaTeX has an enormous amount of potential for typesetting bidirectional documents. However, you will likely find that LaTeX is chiefly used to typeset documents in the sciences, and that documentation addressing the needs of humanities scholars is harder to come by. In working with LaTeX, much depends on hammering out your preamble (including the order of the packages you’re using) and getting the various parts of your LaTeX setup to play nicely with each other. (If this sounds complicated, well… it is.) If you do want to work with multiple, bidirectional languages in LaTeX, I have used polyglossia in the past. (The other major option is babel, which I have not used as extensively.) A clear discussion of multilingual LaTeX typesetting and some sage advice (including on RTL typesetting and working with bidirectional text) can be found in David J. Perry’s “Creating Scholarly Multilingual Documents” (though be advised that some things may have changed since it was written in 2010), and in the polyglossia documentation itself.
  • Text editors 🤔 A full discussion of RTL languages and bidirectional text is beyond the scope of this piece, but a brief listing can be found here. RTL languages have decent support in many popular and/or modern text editors (and Linux support for RTL languages is pretty good, actually). On the other hand, it’s important to point out that some of the claims made on the subject (and in that handy Wikipedia table) are overly optimistic. Vim, for example, is claimed to have bidirectional support, while a quick search makes clear that bidi is basically unsupported. More anecdotally, I’ve found Notepad++ to be pretty decent, bidirectional text in Atom can be a bit gnarly (in particular, when you’re dealing with characters that are mirrored or have weak directionality), and support in Sublime Text has been pretty bad (even for RTL documents, which other editors handle fine).
    Finally, though I’ll skip the longer case for GNU Emacs, RTL/bidi in Emacs is less terrible than a lot of the other options (and the manual goes into helpful detail about its bidi implementation). Emacs does not always put point (the cursor) in a visually helpful place in bidi documents, but the fact that you can always navigate the structure of your document with keyboard shortcuts can be very helpful.
  • XML, HTML, etc. 😢 Let’s be blunt: tag-based markup languages for RTL languages make baby Jesus cry. You know the situation is bad when even the W3C working group laconically notes the “lack of good editing environments” for RTL-language HTML authoring. Because the tags of these markup languages are LTR, any RTL text is a bidirectional text, and so you quickly run into all of the complexities of working with a bidirectional text; and it can be very difficult to see, at a glance, if you’ve written correct HTML (/XML/whatever). If you’ve made headway with this, you’re a braver soul than I am.

Even in 2018, then, there are few solutions that allow you to display bidirectional texts with great success and a minimum amount of bother: it’s just one of those things that you have to fiddle with a bit to get to work right. But the situation in 2018 is better than it was five years ago, so we can always hope that support for bidirectional text in the next five years will make this post completely irrelevant. Meliora speramus.

Mastodon: Space and Community

Over the past year or so, there’s been a soft but steady drumbeat of information about the social networking platform Mastodon. It got written up in tech circles a couple of times, was mentioned on a higher education site or two, has been loudly praised and more quietly mused about over on Metafilter. But there’s been less hyperbolic coverage than sometimes accompanies these things, and it’s easy to think that the groundless hype that has accompanied other sites and services (cough Diaspora cough) may have made people leery of the next big social media thing. Also, my sense is that it’s only been in the past few months that Mastodon has become really vibrant: I joined mastodon.social back in December 2016, but the vibe then and for several months after I joined was like a lot of niche social networking sites: a lot of promise, but not a lot of people.

There are important things to say about the technical side of the platform– it’s part of the reason that Mastodon has so much potential, and why I continued to fund development of it even when I wasn’t using it all that much. It’s based on ideas of decentralization and federation that have been kicking around the indie/FOSS social media world for a number of years, but which will seem completely baffling if they’re new to you.

But all of this stuff is not the important thing, because something funny has happened in the past few months: Mastodon has become a community, and one that reminds us of what communities are and can be. And a niggling little thing made me realize this: Mastodon gets quiet.

For years, I worked in the public spaces that are most hung out in– coffee shops, public libraries– and one of the things you learn is that these spaces have a life of their own. There’s a rhythm to the day and times when you know it’s going to be busy. But there are also times activity slows to a crawl, and there’s little to do but read or neaten the place up or work on longer-term projects. But an entire human ecosystem moves in tandem with these rhythms: there’s the lonely guy who always stops in during the afternoon lull and stays to talk, there’s the adorable older couple that comes in every day at the same time, the rush of students, the slow-moving retirees. As Gombrich knew, rhythm is woven into our bodies, and there’s something deeply satisfying about places like these that allow for the rhythms of human existence.

In the last few years, however, I’ve realized that I’ve lost that sense of rhythm. If I’m bored waiting in line or on a bus, I can always jump back into reading the news, surround myself with the voices of the smartest podcasters, research something else I might want to buy. Some of this is due to the confluence of technological factors and the human desire for stimulation, but I think it’s the Skinner box of corporate social media that has rewired me. Before the ubiquity of fast internet, I was often alone: I was one of the people who relied on those clean, well-lighted places. But those places only get you so far, and you still have the ache of walking home in the dark by yourself. Or you used to.

Because we now live in a time of paradisiacal and almost unimaginable abundance. With its simple, homey design, Facebook gently guides you to what you want to see, gives you glimpses into the lives of old friends, shows you what’s been going on in the world, or adorable pictures of your nieces and nephews are doing. Twitter, by contrast, is like some galactic hub where the quickest wits, sharpest minds, and most reprehensible dirtbags from known space have gathered together to have a conversation or, more likely, yell at each other. And at any hour of the day, you can check in and feel like you’re in the thick of it: there’s always something going on on Twitter, and there are always more kids and cats and opinions and feels on Facebook, detached in time and presented to you in a single, unending scroll.

But Mastodon isn’t like that at all. There are days almost no one’s around, or evenings people go to bed early. There are times everyone is acting maniacally goofy and you’re not in the mood. There are times you find that you’ve come late to (or totally missed) an interesting conversation, and your belated rejoinders, carefully crafted as they are, can’t quite revive the subject: the moving finger of Mastodon writes, and having writ, moves on.

But the potential of Mastodon is precisely that it restores us, in some sense, to our postlapsarian existence. It reminds us that we are often lonely, bored, or both, that time passes rapidly, and that human connection is hard. But after you have been lonely in a new place for a while, the human connections are all the sweeter: you are immensely grateful to the people who are warm and welcoming at first, you’re impressed by the knowledge and wit of those around you, and you are reminded that a real community, which Mastodon is fast becoming, is an incredibly lucky thing.

Some great Digital Humanities resources

So I’ve been looking around for good materials on the Digital Humanities, and I’ve been a bit disappointed that so many of the resources that are out there don’t quite manage to do justice to either DH in relation to traditional scholarship or DH as a set of technologies. I’ve just come across Ted Underwood’s DH syllabi from earlier this year, though, that nicely cover both aspects of the field. I’m particularly impressed by the second syllabus (“Data Science in the Humanities”), which– though I’m not through all the materials on the syllabus yet– does an excellent job of giving in-depth consideration to the technical and interpretive challenges of the field.

His full blog post on both syllabi is well worth reading, too (as is his blog more generally!):

Two syllabi: Digital Humanities and Data Science in the Humanities.

Müller & Guido, Introduction to Machine Learning with Python (O’Reilly, 2016)

From O’Reilly and others, there’s been a profusion of data science books in the past few years. Given that many of these books are intended to introduce readers to data science methods and tools, it’s perhaps unsurprising that many of these books overlap at various points: you’ve got to introduce the reader to NumPy, pandas, matplotlib and the rest somehow, after all.

Müller & Guido’s Introduction to Machine Learning with Python is distinct from many of these other works in both its stated aims and in its execution. In contrast to many of the more introductory books on data science, Müller & Guido give readers with a serious interest in the practice of machine learning a thorough introduction to scikit-learn. That is to say, their Introduction largely eschews coverage of the data science tools often treated in introductory data science texts (though they briefly note the other tools they draw upon in Chapter 1). At the same time, because their book focuses on practice and scikit-learn, they neither discuss the mathematical underpinnings of machine learning, nor do they cover writing algorithms from scratch.

What is here is a comprehensive overview of things already implemented in scikit-learn (which is a considerable amount, as they show). More precisely, they focus on classification and regression in supervised learning, and clustering and signal decomposition in unsupervised learning. If your interest falls in those areas (particularly the former), their coverage is quite good. Chapters 2 and 3 discuss the algorithms for supervised and unsupervised learning respectively, and in considerable detail. That said– and though it’s somewhat less thorough– I might turn to the discussion of some of the same algorithms in Chapter 5 of VanderPlas’ Python Data Science Handbook before Müller & Guido’s; VanderPlas’ treatment is more conversational and less dry. (Note, however, that Müller & Guido do cover more territory.) Similarly, I was left wanting more from Chapter 7’s coverage of working with text.

Müller & Guido’s book really shines, though, when it discusses all of the other things that go into machine learning, beyond their march through the algorithms themselves. Chapter 4 discusses ways to numerically model categorical variables, also (briefly) covering ANOVA and other techniques of feature selection; Chapter 5 covers cross-validation and techniques for carefully tuning model parameters; Chapter 6 compellingly explains the importance of using the Pipeline class to prevent data leakage (during preprocessing, for example); and Chapter 8 discusses where scikit-learn and Python fit within the wider horizons of machine learning. The strongest parts of the book, then– and the parts where it’s the most fun to read– are where Müller & Guido discuss the practical details of machine learning. (One wonders if they felt a bit hamstrung by avoiding the mathematics of the algorithms they discuss.) There are points where the book is less engaging than other introductory data science books, but then it’s not really in the same category; rather than an introductory overview of the entire landscape, Müller & Guido provide a clear, comprehensive, detailed guidebook to one particular part of the map.