2013-08-01

Select columns by filtering on column names with Clojure Incanter

Incanter is a pretty amazing library for working with data.  With a table of data, it's easy to select rows of data to work on by filtering on the data in each row.

For example:

(let [data (to-dataset [{:a 1 :b 2} {:a 3 :b 4}])]
        ($where {:a {:$gt 2}} data))

That will select all rows where given a row, its data under column :a is greater than 2.

But what if you want to filter the dataset to get rid of certain columns?  Say you only want column 0.  This can do that:

(let [data (to-dataset [{:a 1 :b 2} {:a 3 :b 4}])]
        ($ :all 0 data))

What if you want columns 0 and 2?  This can do that:

(let [data (to-dataset [{:a 1 :b 2 :c 3} {:a 3 :b 4 :c 5}])]
        ($ :all [0 2] data))

What if you want to select by column name, e.g. only selecting columns with name :a and :b?  This can do that:

(let [data (to-dataset [{:a 1 :b 2 :c 3} {:a 3 :b 4 :c 5}])]
        ($ :all [:a :b] data))

Here's the tricky one.  What if you want to select by column name, but you want to match the column name against a regular expression (say only names containing a vowel)?  The $where method only lets you select rows of data based on a query — it's a row-wise filtering operation.  We want column-wise filtering on column-names.

My first attempt involved taking the dataset, turning it into a clojure map, then filtering on the map's keys, and finally turning it back into an Incanter dataset.  But doing so ran into a problem in Incanter: to-dataset and to-map are not inverse functions of each other.

That is to say, there exists a file of data *filepath* that can be read into Incanter with read-dataset such that this does not work: (to-dataset (to-map (read-dataset *filepath*))).  For example, if the file is a CSV spreadsheet file with missing data in some cells (not nil, not 0, just no data).

Turns out the solution is much easier:

(let [data (to-dataset [{:a 1 :b 2 :c 3} {:a 3 :b 4 :c 5}])
        columns (filter #(re-find #"aeiou" (str %)) (:column-names data))]
        ($ columns data))

The idea is to create a list of column names of the columns you want, then use that list of column names with $ to select them out of the dataset.

Column-wise selection isn't as convenient as selecting row-wise, but this is one way that works.


No comments: