Müller & Guido, Introduction to Machine Learning with Python (O’Reilly, 2016)

From O’Reilly and others, there’s been a profusion of data science books in the past few years. Given that many of these books are intended to introduce readers to data science methods and tools, it’s perhaps unsurprising that many of these books overlap at various points: you’ve got to introduce the reader to NumPy, pandas, matplotlib and the rest somehow, after all.

Müller & Guido’s Introduction to Machine Learning with Python is distinct from many of these other works in both its stated aims and in its execution. In contrast to many of the more introductory books on data science, Müller & Guido give readers with a serious interest in the practice of machine learning a thorough introduction to scikit-learn. That is to say, their Introduction largely eschews coverage of the data science tools often treated in introductory data science texts (though they briefly note the other tools they draw upon in Chapter 1). At the same time, because their book focuses on practice and scikit-learn, they neither discuss the mathematical underpinnings of machine learning, nor do they cover writing algorithms from scratch.

What is here is a comprehensive overview of things already implemented in scikit-learn (which is a considerable amount, as they show). More precisely, they focus on classification and regression in supervised learning, and clustering and signal decomposition in unsupervised learning. If your interest falls in those areas (particularly the former), their coverage is quite good. Chapters 2 and 3 discuss the algorithms for supervised and unsupervised learning respectively, and in considerable detail. That said– and though it’s somewhat less thorough– I might turn to the discussion of some of the same algorithms in Chapter 5 of VanderPlas’ Python Data Science Handbook before Müller & Guido’s; VanderPlas’ treatment is more conversational and less dry. (Note, however, that Müller & Guido do cover more territory.) Similarly, I was left wanting more from Chapter 7’s coverage of working with text.

Müller & Guido’s book really shines, though, when it discusses all of the other things that go into machine learning, beyond their march through the algorithms themselves. Chapter 4 discusses ways to numerically model categorical variables, also (briefly) covering ANOVA and other techniques of feature selection; Chapter 5 covers cross-validation and techniques for carefully tuning model parameters; Chapter 6 compellingly explains the importance of using the Pipeline class to prevent data leakage (during preprocessing, for example); and Chapter 8 discusses where scikit-learn and Python fit within the wider horizons of machine learning. The strongest parts of the book, then– and the parts where it’s the most fun to read– are where Müller & Guido discuss the practical details of machine learning. (One wonders if they felt a bit hamstrung by avoiding the mathematics of the algorithms they discuss.) There are points where the book is less engaging than other introductory data science books, but then it’s not really in the same category; rather than an introductory overview of the entire landscape, Müller & Guido provide a clear, comprehensive, detailed guidebook to one particular part of the map.

Advertisements

Klemens, 21st Century C (2nd ed., O’Reilly, 2014)

As everyone knows, Mark Twain defined a classic as a book that everyone wants to have read and no one wants to read. Everybody who does some programming knows that K&R is a classic, by any standard– it’s the Rosetta stone of modern C programming, but it also helps to clarify the design principles (and of course the syntax) of many of the modern programming languages that are derived from C. Even beyond languages with a clear C lineage, it’s easy to see the way that a whole host of other modern programming languages have been written to simplify things that are tedious or risky in C. At the same time– and like a lot of other people, I’d imagine– K&R sits on my shelf and stares at me most of the time. When I open it, sometimes I think “What an impressively written book! What concision and clarity!” Most of the time, though, I think, “Wow, how gnarly– thank God for Python!”; or I wonder whether any of the details K&R fuss over are relevant today.

Ben Klemens’ 21st Century C is intended to resolve some of this shock, and to serve as an introduction to modern ways of working in C. The first part of the book presents tools and best practices for this (including debugging, testing, and version control), while the second half discusses how to write modern C– C in a world where, among other things, the rigid memory constraints taken for granted in K&R no longer apply. Some parts of the book (like the discussion of pointers) are clearly meant as an introduction or refresher for readers who aren’t comfortable in C, and the book includes a handy appendix on the basics of C. Other parts live up to the book’s billing as an explanation of what has changed in the world of C. The discussion of new structures in modern C, for example, is a highlight of the book. Klemens’ discussion of string handling in Chapter 9 was also interesting, though briefer than it might have been. (Perhaps with good reason: as someone who works almost exclusively with strings– and even though Unicode in modern languages isn’t always fun, either– I remain unconvinced that working with strings in C is something I want to do on a regular basis.)

As my comments above suggest, I am not an experienced C programmer (despite the occasional stab at the exercises in K&R), and am thus rather unqualified to pass judgment on the soundness of any of Klemens’ code. I can only assume that the infelicities and problems mentioned in reviews of the first edition of the work have been resolved. As a C tyro, though, I felt that Klemens effectively explains the ways that different practices– and the C standards, as well– have evolved over the years. It would be tempting, I think, for the book to remain at the level of vague generalities, but the book strikes a nice balance between high-level discussion of the way C programming has changed over the years and detailed discussion of what’s going on under the hood. It helps immensely, I think, that Klemens has a light, humorous touch– he notes that the manual memory model “is why Jesus weeps when he has to code in C”– and the humorous asides help to leaven some of the necessarily technical passages of the book.

Klemens’ book has the unenviable task of competing with K&R, and there are parts where 21st Century C suffers for the comparison. I still prefer K&R’s discussion of pointers; and I felt that there were a handful of sections that add little to what’s already in K&R. Klemens is fond of comparing C to punk rock, and upon reflection, I believe the comparison is an apt one. To push the metaphor further, there are ways in which K&R is, like a classic punk album, indelible in its simplicity and directness. To my mind, Klemens’ book is a worthy attempt to take that simplicity and directness and make it speak to a changed world. Klemens’ book isn’t perfect; if we’re honest with ourselves, though, even the hardiest classics aren’t always, either.

Goodliffe, Becoming a Better Programmer (O’Reilly, 2014): A non-professional’s take

Goodliffe’s Becoming a Better Programmer is marketed to a wide range of readers: to veterans, newcomers, and also to those who do some programming on the side as a hobby (Hi!). This is not entirely accurate– the book clearly has professional developers in mind most of the time– but I found the book to be an interesting discussion of aspects of the art and craft of programming all the same.

The first two parts of the book discuss a number of features of writing and working with code, both the theoretical/philosophical side and lower-level issues like producing and maintaining consistently formatted code. These sections are clearly oriented primarily towards professional developers, who are probably working in a production environment, have an existing codebase to work with, and may well be under pressure to skimp on design or testing in order to ship code more quickly. Even so, in talking about all of the problems that particular code or a particular codebase can have, these parts also talk at length about the principles and design behind good, sane code, and I found these sections useful and interesting. He discusses cohesion and coupling, omitting needless code (the YAGNI principle), and producing simple and sufficient code– along with practical advice about stuff like testing and version control.

The last three parts of the book are concerned with the softer side of being a developer, both personally and interpersonally– working well with your team, responding to superiors, and even personal things like ethical considerations and the importance of good posture. These sections are lighter weight (and often briefer), and sometimes repetitively summarize earlier points in the book. But they’re an easy read, and can be humorous in a way that the sometimes strained jokes of other sections aren’t. (For example, Goodliffe talks about your relationship with your primary language as a marriage, but then notes that, unlike most marriages, it can be quite helpful to play around on your “spouse.”)

It’s worth pointing out that Goodliffe’s book seems much more oriented towards discussion than to armchair reading. Each chapter takes up a subject, discusses different approaches to that subject (sometimes briefly), and then gives a set of questions. In most cases, Goodliffe is undogmatic– he lays out his position in the text, but the questions leave open the possibility that other experienced developers might have a different take. This format seems like it would work well for reading with a mentor (as the book suggests) or even a book club.

Goodliffe’s language-agnostic approach makes the book broadly accessible but also somewhat abstract. I think the book would have been stronger if he were clearer about applying principles to particular languages. Goodliffe’s book will not replace the resources that give advice and best practices for the idiomatic use of whatever language you’re working in, therefore, but it’s a quick read, and may get you thinking about the way you code, even if it’s only something you do in your spare time.

Janssens, Data Analysis at the Command Line (O’Reilly, 2014)

It normally takes me a week or two to read through a new tech book, but Janssens’ Data Science at the Command Line went by quickly. In part, this was because I was unusually excited about the premise of the book. I’ve been working with a number of my own data files recently, both on the command line and in Perl, and I was eager to learn new tricks and techniques. Does Janssens’ book live up to my (admittedly high) expectations? Partly, but the book was also a quick read because it’s more limited than I had hoped.

To start with the positive, Janssens’ book introduces users to a number of the most important command line tools: sed, awk, and grep, among others. A real strength of the book is that Janssens covers a number of lesser-known tools that are welcome additions to the usual suspects: jq (for working with JSON data), curlicue (a curl variant that handles the hassle of OAuth authentication), and the tools of csvkit (for both working with CSV files and converting other formats to CSV). Janssens has even written a few of his own tools that serve to soften the sometimes steep learning curve of the command line.

Furthermore, Janssens gives a helpful overview of ways of working with data on the command line. Like many users, I know a fair amount about working with text at the command line, but Janssens opens up topics like creating attractive visualizations and using GNU Parallel for managing parallel commands. In giving this overview, Janssens demonstrates how the philosophy of the *nix command line can be applied to data analysis. However, the book seems to be intended to prove the viability of doing data analysis at the command line more than to serve as a systematic introduction. Important points are occasionally glossed over; the book fails to mention that regular users will be unable to chmod files outside of their home directory without sudo, for example (pp. 44-5). Likewise, I imagine many readers would benefit from a clear discussion of using the tee command to drop data into a file when you’re piping data all over the place. Janssens gives examples of using sed and awk, but with only brief explanations of how they operate; I imagine that many users will need to turn to the clearer, more systematic discussions in other resources (like Classic Shell Scripting or Unix Power Tools) to really move beyond the examples Janssens provides.

Furthermore, if you’re more comfortable with another way of working with your data than the command line, I’m not convinced that the command line is always the best approach. Some of the approaches Janssens suggests are rather clunky, for example. There are heaps of XML (and HTML) data out there, but the book suggests the awkward approach of converting HTML to JSON to CSV. Having spent time fussing with XML parsing, I genuinely understand the attraction of this approach, but it would have been nice if he’d covered both proper XML parsing as well as just dumping everything into CSV files. (To be frank, I’m not sure there is a robust way to work with XML on the command line, though.)

In conclusion, then, Janssens’ book is worth a read, and I will be exploring the possibilities of command line data analysis in greater detail after reading this book. On the other hand, Janssens book is something of a missed opportunity: it is not the final statement on the subject, and is a bit skimpy as an introductory resource.

Sklar and Trachtenberg, PHP Cookbook (3rd ed., 2014)

Sklar and Trachtenberg’s PHP Cookbook is a difficult book to review; the book is clearly written with at least two different audiences in mind, and this means that parts of the book vary in sophistication and depth. On the one hand, the book is intended in part to complement Sklar’s (2004!) Learning PHP 5, to serve as a second book for PHP novices, to cover some of the many topics that book’s “PHP with training wheels” approach did not. On the other hand, the book is intended for readers who are familiar with the basics of the language who want to learn how to do things well in PHP.

For the beginner, for example, Chap. 6 begins with an introduction to functions, and Chap. 7 on objects likewise begins with a very gentle introduction to objects; Chap. 1 covers the basics of working with substrings, and Chap. 18 introduces some issues in PHP security. However, beginners might be better served by reading the relevant sections in Tatroe et al.’s 2013 Programming PHP. Both cover much of the same ground, but I find Programming‘s coverage to be clearer and in greater depth.

This book also does a lot of what it says on the tin by providing a reference for a lot of situations you might run into when programming PHP. Need to work with email? Drop in some regular expressions? Mess around with an object using array syntax? (Look at 4.25 for the latter– a nice trick.) I personally don’t spend all of my time in PHP, and it’s nice to have code snippets at hand when you need them.

More than these individual recipes, the value of the book to my mind lies in the sections for the programmer who realizes that there are different ways to tackle a particular problem in PHP. PHP comes with a heap of built-in functions (some of which are redundant); these functions are further complemented by libraries, packages, and software that extend and supplement the core of PHP. A lot of PHP programming is simply coming to terms with all of these competing ways to do things, and this is one of the strengths of the Cookbook. Sklar and Trachtenberg often tell you which function to use, and why (though some parts simply list different possibilities without much differentiation). To pick an example at random, (*rimshot*) they explain why the (built-in) mt_rand() function is better than the (built-in) rand() function for generating random numbers within a particular range. This is not something Programming PHP is always good at, actually, and its function reference simply lists all of the functions without explaining differences between them (php.net can often be helpful in this regard, too).

It limits the usefulness of the book, however, that you have to go digging for these sections, that you don’t know in advance whether the recipe you’re interested in is a brief or introductory discussion for the beginning user or a helpful guide through the PHP wilds. In some ways, this book is like the maps nature parks often give out to tourists: some parts only give you a vague idea, while other parts of the map can be a reliable guide to the terrain. Sklar and Trachtenberg’s PHP Cookbook can still help you to get around, but it’s a good idea to keep your wits about you, and to make use of other resources, as well.

Korpela, Unicode Explained (O’Reilly, 2006)

Korpela’s Unicode Explained was originally intended for three audiences, I think. The first was the casual user who might need to make some basic use of Unicode in everyday life (entering a little bit of Unicode in Windows, for example). The second was the advanced user who might need to draw on some Unicode wizardry in a few specialized cases: programming, HTML or other markup, or the internet. The final audience is those wanting an introduction to the principles behind Unicode without making a brute force attack on the Unicode Standard itself. The passage of time means that the book may now be less useful for either casual users or advanced users (who really do need current information). Nevertheless, Korpela’s work remains helpful, and his discussion of the theoretical side of Unicode is excellent, both clear and nuanced.

Major changes have happened in the Unicode world since this book was written. Characters and scripts have been added of course, but the real difference is that Unicode support is much more pervasive than it was when the book was written. Unicode is much, much more common on web sites now, Emacs 24 finally has long-overdue support for certain Unicode features such as bidi text (though vim is still a holdout in this case), and most programming languages have come to adopt Unicode as fundamental to the way that strings work (such as the adoption of Unicode in Python 3).

Very many of these changes occurred after the 2006 publication date of Korpela’s book, and this means that at points the book reads like a period piece– the changes were in the foreseeable future, but not there yet. This also means that some parts of the book are very out of date. The section on Perl, for example, is completely out of date: however sluggish it may have seemed in 2006, Perl has now adopted Unicode to such an extent that it’s even changed some of the fundamental ways Perl works. Long-beloved character class shortcuts speak Unicode now, which means it’s often less trouble to just use full character classes. (For more on Unicode in Perl, check out the relevant sections in the llama and– if you’re brave of heart– in the camel.)

Much of the utility of a book like this is expert discussion of such advanced topics; having to check the book’s information against more recent sources defeats the purpose. On the other hand, my sense is that because much of the core structure of Unicode was in place in 2006, many of the basic ways of working with Unicode are the same. Though I’m no expert, my sense is that working with Unicode in HTML is unchanged, even for complicated stuff (bidi), while working with Unicode in MS Word can still be a pain.

But the real reason to read this book isn’t so much the practical advice– much of which you’d be better off looking up on StackExchange anyway– but the lucid explication of the structure and design of the Unicode framework. Not all of the explanations are equally clear– I found the first chapter a bit muddled, oddly– but Korpela remains a useful guide to the Unicode terrain.

Building Web Apps with WordPress, by Messenlehner and Coleman (O’Reilly, 2014)

Writers of books on WordPress are presented with a bit of a quandary, I think. On the one hand, one of the best resources for working with WordPress is the WordPress Codex itself, which is free, complete, regularly updated, and can cover a lot more territory than neatly fits within the covers of any one book. On the other hand, writers of WordPress books have to contend with the fact that a phenomenal book on WordPress already exists, Williams, Damstra, and Stern’s Professional WordPress: Design and Development. Messenlehner and Coleman’s Building Web Apps with WordPress enters this crowded field and acquits itself reasonably well. It’s no Professional WordPress, and it’s not the book it might have been, but it is a solid addition.

The book covers a lot of territory, starting from the basics of WordPress as a CMS and an app platform all the way to how to optimize your WordPress performance. The basic idea of the book, then, is that it will take you from some basic understanding of WordPress and WordPress plugins through to scaling and optimizing your wildly successful app in a production environment. Early chapters introduce WordPress and give some rough idea of how it works. Chapters 4 through 8 are the core of the book, and cover themes, custom post types, users and roles, other miscellaneous APIs and objects, and security. Later chapters introduce more specialized, supplementary topics such as mobile WordPress apps, and ecommerce apps.

Despite the clear layout of the chapters, the organization could be better, and important material is hidden in unexpected places. For example, the section of chapter 5 on custom post types does not actually cover the functions used to work with post metadata. Fundamental concepts like the loop, hooks, and the standard WordPress global variables are not in chapter 2, on WordPress Basics, but buried in Chapter 3, on Leveraging WordPress Plugins.

The quality of the chapters varies. Some chapters of the book are introductory overviews while others are advanced discussions; some are crammed full of advice, insight, and helpful code examples while others are essentially a function reference (a wider failing of PHP books, in my experience). Thorough, insightful discussions of WordPress development are scattered through the book: their comparison of custom taxonomies and post metadata in chapter 5, for example, is one of the best discussions I’ve seen. In general, though, I think the book is hampered by the decision to make it cover WordPress from basics to advanced topics. This means that the book competes with Professional WordPress on its own turf (not to mention a whole host of other books that cover the basics of WordPress), rather than striking out for fresher territory.

Messenlehner and Coleman do have experience designing and building apps, and it would have been interesting to get a deeper perspective on the nuances of WordPress app development. For one, there are a range of ways to interact with your data in WordPress, everything from the WP_Query class to the $wpdb object to using custom tables. Some of this is touched on, in chapter 3 and much later in chapter 16. But the commitment of the book to the whole basics-to-advanced gamut means that these discussions are less sustained, and less helpful, than they might have been if they had just dropped the pretense. This might also help to resolve some of the organizational problems the book has: they discuss working with custom tables in chapter 3, but the full explanation doesn’t come until chapter 16. (Part of the explanation has to do with performance when querying post metadata, which is not discussed in the discussions of post metadata in chapters 2 and 5.)

For similar reasons, the book uses a single app as the example throughout the book (their Schoolpress app). As a number of reviewers on Amazon have pointed out, this sample app is not in fact complete (in private beta, at the moment), nor is the code up on github. If this app was in the early design stages when the book was written, one possibility would have been to give more thorough consideration to a range of examples, a range of design possibilities: an app where much of the work is in the theme, and the code in the functions.php file; a middle-of-the-road app, with some custom post types; and a very complex app like Schoolpress. I don’t think the fact that the Schoolpress app is incomplete is entirely fatal, but it seems like a missed opportunity: if the development process of Schoolpress hasn’t gone as smoothly as anticipated, the book might well have been enriched by the lessons of the development process.

Though I’ve dwelt on the book’s problems, the book contains insightful discussions of working with WordPress and making your app work well– though they may not be where you’d expect. With some reorganization, and a clearer sense of the book’s purpose, a second edition of this book may well earn a place next to Professional WordPress as an essential work for WordPress development.