vellum

2021-08-07

Data Frames for Common Lisp

Upstream URL

github.com/sirherrbatka/vellum

Author

Marek Kochanowicz

License

BSD simplified

README.org

Si vis memoriae, para vellum.

0.1What is Vellum?

Vellum is a data frame library build for Common Lisp. It is designed to be easy to both learn and use (especially in interactive fashion), practical, and reasonably efficient. Note that this software is in beta quality at the moment.

0.2Why I made Vellum?

Common Lisp is a fine language with exceptional interactivity, multiple implementations providing the potential for acceptable performance, useful features like condition system, and a stable standard base. This, in theory, makes is very much suited toward data processing and analysis. Although practice paints a less favorable picture, there are still use cases where Common Lisp could be considered to be a very good pick. Also, I simply like writing Lisp code.

Before Vellum existed I usually represented my data as vector-of-vectors (outer vector hold rows, inner vectors are rows). Then I would define accessors to obtain specific value from the data structure. This works but has a few distinct shortcomings. First off, it is not very flexible: I don't want to keep track of what a column with a specific number holds, I just want to refer to it by name. Secondly, I found that code written around such representations is difficult to read. After getting back to an old file I had to explain each individual line back to myself and this means that I am very likely to introduce mistakes If I have to change a thing. Higher-level semantics would help with that. Finally, I often work with a very sparse, and rather large, datasets. Enough so actually that I may simply run out of memory to represent all of my giganulls.

None of the above issues alone is substantial enough to be considered a real problem. However, in combination, and especially in an interactive programming context; when you want to get your code running as soon as possible, they really add up to a frustrating environment. I needed a nice, easy to use data frame library. Vellum is a such library for me, and I sincerely hope that it is a useful tool for you as well!

0.3Mindset

The main goal of the Vellum was to create a lispy data frame library, and therefore it should come with no surprise that the design is very different from a more famous Pandas for python. Although I attempted to emulate R lang workflow, Common Lisp, even with all of its awesomeness, lacks some of the R features (namely first-class expressions and environments) therefore the end result can be considered to be quite a bit unique. You are probably considering this is a drawback. However, I hope to demonstrate that Vellum is a approachable library.

0.4Header + Columns = Data Frame

Data Frame is composed of header and columns. The header can be thought of as a data schema, containing column names, column types, and predicates for stored values and is required to create a table. Let's construct an empty data frame.
  (defparameter *table*
    (vellum:make-table
     :header (vellum:make-header 'vellum.header:standard-header
                                 '(:name first-column)
                                 '(:name second-column :type fixnum))))

You may be under impression that there has to be a better way to it. You would be correct.

  (defparameter *table*
    (vellum:make-table :columns '(first-column
                                  (:name second-column :type fixnum))))

0.5Read and write data

A freshly constructed data frame is obviously empty. However, the meaning of empty is somewhat blurred in vellum. Data frames are sparse in concept, meaning that they lack a natural bound. This means that although, you can locate the last non-empty row...
(print (vellum:row-count *table*)) ; => 0

But an attempt to get non-existing data won't signal error.

(print (vellum:at *table* 5 'first-column)) ; => :NULL
;; oh, btw, column names are actually strings, meaning that this...
(print (vellum:at *table* 5 "first-column")) ; => :NULL
;; will also work. Also, this means that you don't have to import symbols just to access data.

Furthermore, VELLUM:AT is actually a setfable place which means that you can even...

(setf (vellum:at *table* 5 'first-column) "first value")
(print (vellum:at *table* 5 'first-column)) ; => first value
(print (vellum:row-count *table*)) ; => 6
(setf (vellum:at *table* 5 'first-column) :null) ; => error signalled, to remove value from the table use erase! function
(vellum:erase! *table* 5 'first-column)
(print (vellum:row-count *table*)) ; => 0

If one would want to, he could fill the content of the data frame in a loop. But that would be awkward, wouldn't it? I suspect that you would prefer some sort of a batch method that would be handier.

0.6TRANSFORM and BIND-ROW

This is where the TRANSFORM function and BIND-ROW macro combination comes in. The intent is to construct simple to use and universal facility to perform all kinds of row operations in a handy way. It is easy to demonstrate this on an example For instance, let's say that we want to calculate body mass index from weight and height. The formula is simple enough: weight (in kilograms) divided by height (in meters) squared. Given that our data frame already contains weight, height, and BMI columns, code becomes...
  (vellum:transform *table*
                    (vellum:bind-row (bmi weight height)
                      (setf bmi (/ weight (* height height))))
                    :in-place t)

BIND-ROW is a macro that will expand to a lambda form returning another lambda form with a variables corresponding to the columns designated and extra boilerplate code for putting those values back into the data frame. Keyword argument :in-place allows for destructive changes. You may think that this is a bad idea sometimes given that a stray condition will leave your data messed up and I am happy to inform you that this is not the case. TRANSFORM is an exception-safe function that will alter the content of the table only after completion. IN-PLACE defaults to NIL anyway though.

TRANSFORM allows for a few extra tricks though. For instance, if we want to calculate BMI for individuals above a certain height, we can do this by using DROP-ROW. To demonstrate this idea:

  (defparameter *above-170-only*
    (vellum:transform *table*
                      (vellum:bind-row (bmi weight height)
                        (if (<= height 1.7)
                            (vellum:drop-row)
                            (setf bmi (/ weight (* height height))))))))

Notice that this time I've left the original frame untouched. Other keyword arguments for the transform functions are :START (the first row that should be touched, defaults to 0) and :END (1+ the last row that should be touched, defaults to the ROW-COUNT). It is possible to pass NIL as :END which will mean that the programmer decided to terminate transform manually. This is quite handy when populating an empty data frame...

  (defparameter *table*
    (alexandria:with-input-from-file (stream "bmi data.txt")
      (vellum:transform (vellum:make-table :columns '((:name weight)
                                                      (:name bmi)
                                                      (:name height)))
                        (vellum:bind-row (weight height bmi)
                          (let ((line (read-line stream nil nil)))
                            (if (null line)
                                (vellum:finish-transformation)
                                (destructuring-bind ((new-weight new-height) (parse-data-line line))
                                    (setf weight new-weight
                                          height new-height
                                          bmi (/ weight (* height height)))))))
                        :end nil)))

This won't work if your data source insists on driving reading data. If API has just MAP-DATA function and no iterator you may think that you are screwed. Don't worry, there is also TRANSFORMATION that construct a sort of transform iterator that you can move on your own. More of that later.

One more thing: you can actually use name different from :name in the column. This comes in handy when the name is a reserved symbol or (perhaps even more often) when there is simply no name, to begin with. In such case, the following will do the trick.

  (defparameter *table*
    (alexandria:with-input-from-file (stream "bmi data.txt")
      (vellum:transform (vellum:make-table :columns '((:name weight)
                                                      (:name bmi)
                                                      (:name height)))
                        (vellum:bind-row ((weight-but-different 0) height bmi)
                          (let ((line (read-line stream nil nil)))
                            (if (null line)
                                (vellum:finish-transformation)
                                (destructuring-bind ((new-weight new-height) (parse-data-line line))
                                    (setf weight-but-different new-weight
                                          height new-height
                                          bmi (/ weight-but-different (* height height)))))))
                        :end nil)))

0.7TO-TABLE and COPY-FROM

Instead of bothering with TRANSFORM just to create table, one can use either TO-TABLE or COPY-FROM. They actually do almost the same thing. TO-TABLE is a CL-DS aggregation function, meaning that it works on all kinds of CL-DS ranges (as long as they return CL:SEQUENCE) and SEQUENCEs holding SEQUENCEs. The inner SEQUENCE is a row. It is probabbly easier to demonstrate.
  (defparameter *table*
    (vellum:to-table '((:moose 1 2) (:gazelle 3 4))
                     :columns '(animal category1 category2)))

The above will construct table with 3 columns: ANIMAL, CATEGORY1 and CATEGORY2 and 2 rows. The first row is :moose 1 2, the second row is :gazelle 3 4. CL-DS ranges sometimes can be more memory efficient then the plain vector/list of vectors/lists. CL-DS:XPR for instance can be used to construct a lazy generator. Table with 500 numbers...

  (defparameter *table*
    (vellum:to-table (cl-ds:xpr (:i 0)
                       (when (< i 500)
                         (cl-ds:send-recur (list i) :i (1+ i))))
                     :columns '(iota)))

Aleternatively, one can use simpler CL-DS:XPR form, combined with the CL-DS.ALG:RESTRAIN-SIZE function.

  (defparameter *table*
    (vellum:to-table (cl-ds.alg:restrain-size
                      (cl-ds:xpr (:i 0)
                        (cl-ds:send-recur (list i) :i (1+ i)))
                      500)
                     :columns '(iota)))

TO-TABLE is an CL-DS aggregation function, meaning that it can be used in conjuction with GROUP-BY and other layer functions. This makes certain operations very simple. For instance, if we have a list with some montly values, and we want to calculate 3 months average for the each data point...

  (defparameter *monthly-numbers*
    '(1000 5000 2000
      3000 2130 150
      4000 9000 1130
      390 2000 1000))

  (defparameter *table*
    (vellum:to-table
     (cl-ds.alg:sliding-window *monthly-numbers* 3)
     :columns '(month-1 month month+1 avg)
     :body (vellum:bind-row (avg)
             (setf avg (cl-ds.math:average (vellum:vs 0 1 2) :sum 0.0)))))

But more on that later.

COPY-FROM is almost like TO-TABLE, but the input it is inteded to be used on objects that are not CL-DS ranges or CL sequences. This includes, for instance, paths to files, SQL queries, network handlers, and so one. If you want to implement new data source, just specialize COPY-FROM generic function and you are done! Let me demonstrate...

  (defmethod vellum:copy-from ((format (eql :custom-tab-separated-format)) file-path &key columns)
    (let ((columns-count (length columns)))
      (with-open-file (stream file-path)
        (vellum:transform (vellum:make-table :columns columns)
          (vellum:bind-row ()
             (let ((row (read-line stream nil nil)))
               (when (null row)
                 (vellum:finish-transformation))
               (let ((values (cl-ppcre:split #\tab row)))
                 (loop :for i :from 0 :below columns-count
                       :for elt :in values
                       :do (setf (vellum:rr i) elt)))))
          :in-place t))))

The above example sacrifices some of the features (most notably: converting from the textual format to a lisp data type) in the name of clarity. By the way, generic functions can make wonders for the extensions! As for the extensions... vellum-csv system implements COPY-FROM :CSV while vellum-postmodern implements COPY-FROM :POSTMODERN. If you want to see more examples of COPY-FROM implementations you can check the source code of these systems.

0.8Columns manipulation

Usually, BMI would not be present in the data frame from the start and must be somehow added. Vellum does not make this needlessly complex. Simply use NEW-COLUMNS function.
  (defparameter *table*
    (alexandria:with-input-from-file (stream "bmi data.txt")
      (vellum:add-columns (vellum:transform (vellum:make-table :columns '((:name weight)
                                                                          (:name height)))
                                            (vellum:bind-row (weight height)
                                              (let ((line (read-line stream nil nil)))
                                                (if (null line)
                                                    (vellum:finish-transformation)
                                                    (destructuring-bind (new-weight new-height) (parse-data-line line)
                                                        (setf weight new-weight
                                                              height new-height)))))
                                            :end nil)
                          '(:name bmi))))

Let's admit it: this is getting a little bit nested. From now one I will use threading macro ~> from serapeum. By using this macro we are getting a somewhat more understandable form that does the exact same thing.

  (defparameter *table*
    (alexandria:with-input-from-file (stream "bmi data.txt")
      (serapeum:~>
       (vellum:make-table :columns '((:name weight) (:name height)))
       (vellum:transform (vellum:bind-row (weight height)
                           (let ((line (read-line stream nil nil)))
                             (if (null line)
                                 (vellum:finish-transformation)
                                 (destructuring-bind (new-weight new-height) (parse-data-line line)
                                     (setf weight new-weight
                                           height new-height)))))
                         :end nil)
       (vellum:add-columns '(:name bmi))))

Either way, it is impossible to change the number of columns in the table in a destructive way. This is by design as vellum headers are immutable as well.

Selecting a subset of the columns is equally important to add new columns. To do this in Vellum we should use VELLUM:SELECT function. Now, this function is slightly more complex, and it is all because of the input. For instance, to select just a single column…

  (defparameter *table* (vellum:make-table :columns '((:name first-colum)
                                                      (:name second-column)
                                                      (:name third-column))))
  (defparameter *just-second-and-third* (vellum:select *table*
                                          :columns '(1 2)))
  (defparameter *just-second-and-third* (vellum:select *table*
                                          :columns '(second-column third-column)))
  (defparameter *just-second-and-third* (vellum:select *table*
                                          :columns (vellum:s (vellum:from :from 'second-column)))
  (defparameter *just-second-and-third* (vellum:select *table*
                                          :columns (vellum:s (vellum:from :from 1))))

All four ways to select second and third columns are equally valid. Selecting by range is probably not all that useful in the context of columns, however, the exact same syntax is used for :ROWS where it really it is in it's element.

  (defparameter *table* (vellum:make-table :columns '((:name first-colum)
                                                      (:name second-column)
                                                      (:name third-column)
                                                      (:name fourth-column)
                                                      (:name fifth-column)
                                                      (:name sixth-column)
                                                      (:name seventh-column)
                                                      (:name eight-column)
                                                      (:name nine-column))))
  (defparameter *columns-subset* (vellum:select *table*
                                   :columns '(2 3 4 7 8 9)))
  (defparameter *columns-subset* (vellum:select *table*
                                   :columns (alexandria:iota 6 :start 2)))
  (defparameter *columns-subset* (vellum:select *table*
                                   :columns (vellum:s '(2 . 10))))
  (defparameter *columns-subset* (vellum:select *table*
                                   :columns (vellum:s 2 (vellum:between :to 10))))
  (defparameter *columns-subset* (vellum:select *table*
                                   :columns (vellum:s 2 3 4 7 (vellum:between :to 10))))

Depending on the specific use case each of those ways can be the most suitable.

Select can be also used to establish new names for columns. To do so, provide list with two elements.

(defparameter *table* (vellum:make-table :columns '((:name first-colum)
                                                    (:name second-column)
                                                    (:name third-column))))
(defparameter *just-second-and-third* (vellum:select *table*
                                        :columns '((1 (:name new-first-column)) 2)))
(defparameter *just-second-and-third* (vellum:select *table*
                                        :columns '((second-column new-first-column)
                                                   third-column)))

The second element supports all the same options as the make-table :columns element itself.

0.9Joins

So, working with relations? Chances are you gonna use JOIN.
(let* ((frame-1 (transform (vellum:make-table :columns '(a b))
                           (vellum:bind-row (a b)
                             (setf a vellum.table:*current-row*)
                             (setf b (format nil "a~a" a)))
                             :end 5))
         (frame-2 (transform (vellum:make-table :columns '(a b))
                             (vellum:bind-row (a b)
                               (setf a vellum.table:*current-row*)
                               (setf b (format nil "b~a" a)))
                             :end 5))
         (result (vellum:join :hash :inner
                              `((:frame-1 ,frame-1 a)
                                (:frame-2 ,frame-2 a)))))
    (vellum:show :text result))

The above example demonstrates how to use the join function to construct a new data frame containing columns from both frame-1 and frame-2. As you may noticed, both frame-1 and frame-2 contain the columns of the same names, meaning that the input to the join function must contain the co called label (the first element of the inner list, in this specific example :frame-1 and :frame-2). Labels will be used to construct names of columns in the result table. In this specific example the result table looks as follows.

FRAME-1/A  FRAME-1/B  FRAME-2/A  FRAME-2/B
==========================================
4          a4         4          b4
3          a3         3          b3
2          a2         2          b2
1          a1         1          b1
0          a0         0          b0

Column A in both columns was used to perform the join (third element in the inner list). It is also possible to use multiple columns as a join key , in which case they will be combined into list. To do so, simply add additional columns after the second argument, for instance: (:frame-1 frame-1 a b). Note, that hash-join uses by default 'eql test for the internal hash-table, which is not suited to such use case.

Notice that the label was used as a prefix. This often means that you will want to rename columns. Use the SELECT function to do that (either on the join arguments, or the join result). It is also possible that the column names won't conflict in the resulting table, in such case you can place NIL as the label. For example:

(let* ((frame-1 (transform (make-table :columns '(a b))
                           (vellum:bind-row (a b)
                             (setf a vellum.table:*current-row*)
                             (setf b (format nil "a~a" a)))
                           :end 5))
       (frame-2 (transform (make-table :columns '(a b))
                           (vellum:bind-row (a b)
                             (setf a vellum.table:*current-row*)
                             (setf b (format nil "b~a" a)))
                           :end 5))
       (result (join :hash :inner
                     `((nil ,frame-1 a)
                       (:frame-2 ,frame-2 a)))))
  (vellum:show :text result))

A  B   FRAME-2/A  FRAME-2/B
===========================
4  a4  4          b4
3  a3  3          b3
2  a2  2          b2
1  a1  1          b1
0  a0  0          b0

Currently only the hash join algorithm is supported. On the other hand, you can use both :left and :inner joins, just like in the SQL.

0.10Other functions

Use HSTACK and VSTACK to concatenate tables column-wise and row-wise. Use ORDER-BY to sort table content. Use NEW-COLUMNS to add new columns to a table (non destructive).

0.11A few remarks about the inner representation and efficiency

Vellum stores data in a column format, where each column is a sparse variant of an RRB trie. I've chosen this type of representation for efficient copy-on-write. Copy-on-write is important as it allows for exception safety in the transform function as well as reduces memory usage by allowing safe sharing of the common data bits. However, at the same time, data frames expose a mutable interface. You could consider it to be unusual.

In fact, Vellum has a concept of ownership, meaning that each RRB trie node is owned by a data frame instance. If it happens that you are attempting to mutate a node owned by the current data frame, mutating is allowed. Otherwise, a new copy of the node is created but owned by the current data frame. This prevents spilling side effects outside of the data frame.

This also means that constructing a copy of the data frame can be optimized beyond a deep copy. REPLICA function will return a new instance of a data frame passed as the first argument. Changes performed on the returned data frame won't leak to the passed data frame. Additionally, if you pass T as the second argument (defaults to NIL) changes to the original data frame won't leak the new data frame. You probably won't be using REPLICA function all that often, but if you want to keep a history of your data changes in your lisp process this trick can be quite useful.

0.12Riding on the cl-data-structures

CL-data-structures is my other library. The name is a misnomer as the library grew into a hulking abomination of feature creep. I want to eventually divide it into smaller pieces but it is useful regardless. This is especially because of the ranges and algorithms implemented within. They work like Java Stream interface. Consider a common task of calculating the average of column. To do it with cl-data-structures you will just…
  (vellum:with-table (*table*)
    (cl-ds.math:average *table* :key (vellum:bind-row-closure (vellum:bind-row (column-name) column-name))))

Form passed as a :KEY is a very common pattern. Enough so that there is a shortening macro.

  (vellum:with-table (*table*)
    (cl-ds.math:average *table* :key (vellum:brr column-name)))

BRR stands for body row reference if you are wondering. The above code can be shortened further…

  (vellum:pipeline (*table*)
    (cl-ds.math:average :key (vellum:brr column-name)))

PIPELINE is called so because it is typically used to build longer control flows. For instance, by incorporating GROUP-BY.

  (vellum:pipeline (*table*)
    (cl-ds.alg:group-by :key (vellum:brr grouping-column-name))
    (cl-ds.math:average :key (vellum:brr column-name)))

Besides GROUP-BY there are also other functions altering how aggregation is performed. For instance CL-DS.ALG:ARRAY-ELEMENTWISE will apply aggregation function independently for each position in the array of the input and will return array as a result. Anyway, you probably would rather have GROUP-BY return a data frame instead of the cl-data-structures range. This will require just one more extra form.

  (vellum:pipeline (*table*)
    (cl-ds.alg:group-by :key (vellum:brr grouping-column-name))
    (cl-ds.math:average :key (vellum:brr column-name))
    (vellum:to-table :columns '((:name group) (:name aggregation-result))))

And if you simply want to write code that mimics MS Excel, you can do that as well by using AGGREGATE-ROWS macro.

  (vellum:aggregate-rows *table*
   :column1 ((cl-ds.math:average) :skip-nulls t)
   :column2 ((cl-ds.math:average) :skip-nulls t))

The above will construct a new data frame containing two columns (with names :column1 and :column2) and one row, holding the average of the column1 and column2 of the original table. The :SKIP-NULLS option prevents code from erroring out on the :NULL. It also improves performance somewhat.

0.13Integrating with other stuff

As hinted before, Vellum is designed to be easy to use with other libraries. For instance, let's say you want to use postmodern to access the postgres database where you are keeping your data safe and warm. To do so, you can do the following.
  (defparameter *table* (vellum:make-table :columns '((:name first-column) (:name second-column))))
  (vellum:with-table (*table*)
    (postmodern:with-connection '("database" "username" "password" "localhost")
      (let ((transformation (vellum.table:transformation *table* nil :in-place t :start 0))
            (vellum.header:set-row (vellum.table:standard-transformation-row transformation))
            (postmodern:doquery (:select 'first_column 'second_column :from 'table)
                (first_column second_column)
              (vellum.table:transform-row
               transformation
               (vellum:bind-row (first-column second-column)
                 (setf first-colum first_column
                       second-column second_column)))))
            (vellum.table:transformation-result transformation))))

Postmodern is a really nice library, but doquery insists on driving its own iteration. As you can see that's not a big deal. For libraries that present us with an iterator-like interface, you can simply use TRANSFORM directly. Just don't forget to pass NIL as :END and call VELLUM:FINISH-TRANSFORMATION from the BIND-ROW form. I already showed you how.

0.14The future

Support for additional data sources and storage formats will be added into seperated systems in the vellum project itself. I actually already added support for constructing data frames out of the postmodern queries as well as CSV file handling build with fare-csv library. I would like to also support parquet files (with the help of cl-apache-arrow). Stay tuned.

Dependencies (9)

  • alexandria
  • cl-data-structures
  • closer-mop
  • documentation-utils-extensions
  • iterate
  • lparallel
  • metabang-bind
  • prove
  • serapeum
  • GitHub
  • Quicklisp
  • Sponsor