Müller & Guido, Introduction to Machine Learning with Python (O’Reilly, 2016)

From O’Reilly and others, there’s been a profusion of data science books in the past few years. Given that many of these books are intended to introduce readers to data science methods and tools, it’s perhaps unsurprising that many of these books overlap at various points: you’ve got to introduce the reader to NumPy, pandas, matplotlib and the rest somehow, after all.

Müller & Guido’s Introduction to Machine Learning with Python is distinct from many of these other works in both its stated aims and in its execution. In contrast to many of the more introductory books on data science, Müller & Guido give readers with a serious interest in the practice of machine learning a thorough introduction to scikit-learn. That is to say, their Introduction largely eschews coverage of the data science tools often treated in introductory data science texts (though they briefly note the other tools they draw upon in Chapter 1). At the same time, because their book focuses on practice and scikit-learn, they neither discuss the mathematical underpinnings of machine learning, nor do they cover writing algorithms from scratch.

What is here is a comprehensive overview of things already implemented in scikit-learn (which is a considerable amount, as they show). More precisely, they focus on classification and regression in supervised learning, and clustering and signal decomposition in unsupervised learning. If your interest falls in those areas (particularly the former), their coverage is quite good. Chapters 2 and 3 discuss the algorithms for supervised and unsupervised learning respectively, and in considerable detail. That said– and though it’s somewhat less thorough– I might turn to the discussion of some of the same algorithms in Chapter 5 of VanderPlas’ Python Data Science Handbook before Müller & Guido’s; VanderPlas’ treatment is more conversational and less dry. (Note, however, that Müller & Guido do cover more territory.) Similarly, I was left wanting more from Chapter 7’s coverage of working with text.

Müller & Guido’s book really shines, though, when it discusses all of the other things that go into machine learning, beyond their march through the algorithms themselves. Chapter 4 discusses ways to numerically model categorical variables, also (briefly) covering ANOVA and other techniques of feature selection; Chapter 5 covers cross-validation and techniques for carefully tuning model parameters; Chapter 6 compellingly explains the importance of using the Pipeline class to prevent data leakage (during preprocessing, for example); and Chapter 8 discusses where scikit-learn and Python fit within the wider horizons of machine learning. The strongest parts of the book, then– and the parts where it’s the most fun to read– are where Müller & Guido discuss the practical details of machine learning. (One wonders if they felt a bit hamstrung by avoiding the mathematics of the algorithms they discuss.) There are points where the book is less engaging than other introductory data science books, but then it’s not really in the same category; rather than an introductory overview of the entire landscape, Müller & Guido provide a clear, comprehensive, detailed guidebook to one particular part of the map.