Janssens, Data Analysis at the Command Line (O’Reilly, 2014)

It normally takes me a week or two to read through a new tech book, but Janssens’ Data Science at the Command Line went by quickly. In part, this was because I was unusually excited about the premise of the book. I’ve been working with a number of my own data files recently, both on the command line and in Perl, and I was eager to learn new tricks and techniques. Does Janssens’ book live up to my (admittedly high) expectations? Partly, but the book was also a quick read because it’s more limited than I had hoped.

To start with the positive, Janssens’ book introduces users to a number of the most important command line tools: sed, awk, and grep, among others. A real strength of the book is that Janssens covers a number of lesser-known tools that are welcome additions to the usual suspects: jq (for working with JSON data), curlicue (a curl variant that handles the hassle of OAuth authentication), and the tools of csvkit (for both working with CSV files and converting other formats to CSV). Janssens has even written a few of his own tools that serve to soften the sometimes steep learning curve of the command line.

Furthermore, Janssens gives a helpful overview of ways of working with data on the command line. Like many users, I know a fair amount about working with text at the command line, but Janssens opens up topics like creating attractive visualizations and using GNU Parallel for managing parallel commands. In giving this overview, Janssens demonstrates how the philosophy of the *nix command line can be applied to data analysis. However, the book seems to be intended to prove the viability of doing data analysis at the command line more than to serve as a systematic introduction. Important points are occasionally glossed over; the book fails to mention that regular users will be unable to chmod files outside of their home directory without sudo, for example (pp. 44-5). Likewise, I imagine many readers would benefit from a clear discussion of using the tee command to drop data into a file when you’re piping data all over the place. Janssens gives examples of using sed and awk, but with only brief explanations of how they operate; I imagine that many users will need to turn to the clearer, more systematic discussions in other resources (like Classic Shell Scripting or Unix Power Tools) to really move beyond the examples Janssens provides.

Furthermore, if you’re more comfortable with another way of working with your data than the command line, I’m not convinced that the command line is always the best approach. Some of the approaches Janssens suggests are rather clunky, for example. There are heaps of XML (and HTML) data out there, but the book suggests the awkward approach of converting HTML to JSON to CSV. Having spent time fussing with XML parsing, I genuinely understand the attraction of this approach, but it would have been nice if he’d covered both proper XML parsing as well as just dumping everything into CSV files. (To be frank, I’m not sure there is a robust way to work with XML on the command line, though.)

In conclusion, then, Janssens’ book is worth a read, and I will be exploring the possibilities of command line data analysis in greater detail after reading this book. On the other hand, Janssens book is something of a missed opportunity: it is not the final statement on the subject, and is a bit skimpy as an introductory resource.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s