31 August 2014

Haskell Data Analysis Cookbook - a Book Review

As with my previous post, Clojure Data Analysis Cookbook - a Book Review, I was this time offered to review Haskell Data Analysis Cookbook by Nishant Shukla.  First impressions: those are two very similar and related books that have some overlapping ideas, but not only are the programming languages used totally different in "genre", the content itself also cover some different data analysis grounds and could be treated as complementary books in that way.


The book itself is very example oriented (much like the Clojure Data Analysis Cookbook), basically being a collection of code recipes for accomplishing various common tasks for data analysis.  It does give you some quick explanations of why and what else to "see also".

It gives you recipes to take in raw data in the form of CSV, JSON, XML, or whatever, including data that lives on web servers (via HTTP GET or POST requests).  Then there are recipes to build up datasets in MongoDB, or SQLite databases.  To recipes to clean up that data, do analysis (e.g. clustering with k-means), to visualizing, presenting, and exporting that analysis.

Each recipe is more or less self-contained, without much in building on top of previous recipes.  It makes the book more "random access".  It's less a book to read through cover to cover, and more of a handy reference to use by full-text searching for key terms, clicking on the relevant topic in the table of contents, or by looking up terms in the index.  It's definitely a book I'd rather have as a PDF ebook so that I can access it anywhere in the world, and so I can do full-text search in.  It does come in Mobi as well as ePub formats, and code samples are provided in a separate zipped download as well.

Having said that, you can tell whether a book was made to be seriously used as a reference or not by looking at its index.  There are 9 pages of indices, equivalent to about 2.9% of the number of pages previous to the index.  This book can certainly be used as a reference.

As a reference book, it's great for people who have already a familiarity with Haskell in general.  If you don't know Haskell, this book won't teach it to you.  That is, unfortunately, possibly a missed marketing opportunity, as those who don't know Haskell (but have knowledge of another programming language) really only needs a small bit to understand enough of how functions are written in Haskell to pick up what's going on in the book.  This means if you know another programming language, know a bit about data analysis, you could use this book to learn some Haskell so long as you pick up the basic syntax with another tutorial in hand (so it's really not a show stopper to using this book).

Similarly, I'd say you had best be familiar with how to do data analysis as a discipline in itself.  If you don't know whether to do clustering or regression, or whether to use a K-NN or K-means, this book won't teach it to you.

Much of that is, of course, echoing the Clojure Data Analysis Cookbook.  Where the Haskell Data Analysis Cookbook differs, makes the two books have a set of complementary ideas.  Whereas both books talk about concurrency and parallelism, the Clojure DAC goes into those topics (including distributed computing) in much more detail.

On the other hand, whereas both books talk about preparing and processing data (prior to performing statistics or machine learning on it), the Haskell DAC goes into much more detail on topics like processing strings with more advanced algorithms (as in computing the Jaro-Winkler distance between strings, not like doing substring/concat operations), computing hashes and using bloom filters, and working with trees and graphs (as in node-and-link graph theory graphs, not grade-school bar graphs).

So in some sense, the Haskell Data Analysis Cookbook has more theory heavy topics (graphs and trees!), whilst the Clojure Data Analysis Cookbook has more "engineering" topics (concurrency, parallelism, and distributed computing).

Neither books are comprehensive treatise on the topic, but someone who needs a practical refresher on working with graphs and trees may find Haskell Data Analysis Cookbook to be quite useful.

All in all, I'd say this is a decent book, because if you have some familiarity of Haskell, have some familiarity with some of the basic technologies like JSON, MongoDB, or SQLite, have taken a class or two of data analysis or machine learning in university (or a MOOC?), and aren't expecting a lot of hand holding from the book, then this book is a great guide to start you off to doing some data analysis with Haskell.

No comments: