31 August 2014

Haskell Data Analysis Cookbook - a Book Review

As with my previous post, Clojure Data Analysis Cookbook - a Book Review, I was this time offered to review Haskell Data Analysis Cookbook by Nishant Shukla.  First impressions: those are two very similar and related books that have some overlapping ideas, but not only are the programming languages used totally different in "genre", the content itself also cover some different data analysis grounds and could be treated as complementary books in that way.


The book itself is very example oriented (much like the Clojure Data Analysis Cookbook), basically being a collection of code recipes for accomplishing various common tasks for data analysis.  It does give you some quick explanations of why and what else to "see also".

It gives you recipes to take in raw data in the form of CSV, JSON, XML, or whatever, including data that lives on web servers (via HTTP GET or POST requests).  Then there are recipes to build up datasets in MongoDB, or SQLite databases.  To recipes to clean up that data, do analysis (e.g. clustering with k-means), to visualizing, presenting, and exporting that analysis.

Each recipe is more or less self-contained, without much in building on top of previous recipes.  It makes the book more "random access".  It's less a book to read through cover to cover, and more of a handy reference to use by full-text searching for key terms, clicking on the relevant topic in the table of contents, or by looking up terms in the index.  It's definitely a book I'd rather have as a PDF ebook so that I can access it anywhere in the world, and so I can do full-text search in.  It does come in Mobi as well as ePub formats, and code samples are provided in a separate zipped download as well.

Having said that, you can tell whether a book was made to be seriously used as a reference or not by looking at its index.  There are 9 pages of indices, equivalent to about 2.9% of the number of pages previous to the index.  This book can certainly be used as a reference.

As a reference book, it's great for people who have already a familiarity with Haskell in general.  If you don't know Haskell, this book won't teach it to you.  That is, unfortunately, possibly a missed marketing opportunity, as those who don't know Haskell (but have knowledge of another programming language) really only needs a small bit to understand enough of how functions are written in Haskell to pick up what's going on in the book.  This means if you know another programming language, know a bit about data analysis, you could use this book to learn some Haskell so long as you pick up the basic syntax with another tutorial in hand (so it's really not a show stopper to using this book).

Similarly, I'd say you had best be familiar with how to do data analysis as a discipline in itself.  If you don't know whether to do clustering or regression, or whether to use a K-NN or K-means, this book won't teach it to you.

Much of that is, of course, echoing the Clojure Data Analysis Cookbook.  Where the Haskell Data Analysis Cookbook differs, makes the two books have a set of complementary ideas.  Whereas both books talk about concurrency and parallelism, the Clojure DAC goes into those topics (including distributed computing) in much more detail.

On the other hand, whereas both books talk about preparing and processing data (prior to performing statistics or machine learning on it), the Haskell DAC goes into much more detail on topics like processing strings with more advanced algorithms (as in computing the Jaro-Winkler distance between strings, not like doing substring/concat operations), computing hashes and using bloom filters, and working with trees and graphs (as in node-and-link graph theory graphs, not grade-school bar graphs).

So in some sense, the Haskell Data Analysis Cookbook has more theory heavy topics (graphs and trees!), whilst the Clojure Data Analysis Cookbook has more "engineering" topics (concurrency, parallelism, and distributed computing).

Neither books are comprehensive treatise on the topic, but someone who needs a practical refresher on working with graphs and trees may find Haskell Data Analysis Cookbook to be quite useful.

All in all, I'd say this is a decent book, because if you have some familiarity of Haskell, have some familiarity with some of the basic technologies like JSON, MongoDB, or SQLite, have taken a class or two of data analysis or machine learning in university (or a MOOC?), and aren't expecting a lot of hand holding from the book, then this book is a great guide to start you off to doing some data analysis with Haskell.

15 August 2014

Java has deep expression problem for beginning students

There are many problems with Java as the first programming language to teach students if we wish to provide the most effective learning experience.  I've written on this in Learn Python instead of Java as your first language in the past even.  So what now?

Newbie, meet the Expression Problem

Stuart Sierra provides a very lucid explanation of the Expression Problem, a classic problem in software programming, in Solving the Expression Problem with Clojure 1.2.  Needless to say, Clojure provides a very clean solution.

Java, however, is a quagmire and requires some heavy OOP software engineering concepts to solve the Expression Problem.  One wouldn't ordinarily think this has anything to do with beginning students just learning to program though, but it does, and here's how.

Imagine our beginning student, "Sam", starts to learn Java and eventually starts to write a classic game of asteroids.  Sam plugs away and gets a decent game of a single player ship shooting lasers at one kind of asteroids to begin working.  Not bad!  But Sam wants to do more.  Sam wants to not just have one kind of (big) asteroids, he also wants to have small asteroids to shoot at.

Alright, so Sam begins to modify the BigAsteroids class to also be able to represent a smaller sized kind of asteroids.  The teacher catches wind of this and tells Sam, "no, that's not good", and that Sam needs to use OOP principles to write a different class for SmallAsteroids.

Now most students would say "why, Mr. Teach", my way works.  But Sam is a good student and does as he's told.

So Sam goes and creates a second class for SmallAsteroids.  Except his program was built presuming that the only things to draw, to shoot lasers at, and to move around, were BigAsteroids.  None of those methods he wrote to draw, to shoot lasers at, and to move around BigAsteroids work for SmallAsteroids.  hmm...  Welcome to the Expression Problem, Sam.