cl-transducers

2023-10-21

Ergonomic, efficient data processing.

Upstream URL

git.sr.ht/~fosskers/cl-transducers

Author

Colin Woodbury <colin@fosskers.ca>

License

LGPL-3.0-only
README
Transducers: Ergonomic, efficient data processing

I think Transducers are a fundamental primitive that decouples critical logic from list/sequence processing, and if I had to do Clojure all over I would put them at the bottom.

-- Rich Hickey

Transducers are an ergonomic and extremely memory-efficient way to process a data source. Here "data source" means simple collections like Lists or Vectors, but also potentially large files or generators of infinite data.

Transducers...

  • allow the chaining of operations like map and filter without allocating memory between each step.
  • aren't tied to any specific data type; they need only be implemented once.
  • vastly simplify "data transformation code".
  • have nothing to do with "lazy evaluation".
  • are a joy to use!

Example: While skipping every second line of a file, sum the lengths of only evenly-lengthed lines.

(transduce
  ;; How do we want to process each element?
  (comp (step 2) (map #'length) (filter #'evenp))
  ;; How do we want to combine all the elements together?
  #'+
  ;; What's our original data source?
  #p"giant-file.txt")

This library has been confirmed to work with SBCL, CCL, ECL, and Clasp.

⚠ ABCL cannot be used due to lack of support for tail-call elimination and tail-recursive labels.

1History and Motivation

Originally invented in Clojure and adapted to Scheme as SRFI-171, Transducers are an excellent way to think about - and efficiently operate on - collections or streams of data. Transduction operations are strict and don't involve "laziness" or "thunking" in any way, yet only process the exact amount of data you ask them to.

This library draws inspiration from both the original Clojure and SRFI-171, while adding many other convenient operations commonly found in other languages.

2Installation

This library is available on Ultralisp. With Ultralisp installed as a distribution, you can simply run the following to download the main system:

(ql:quickload :transducers)

For the JSON extensions:

(ql:quickload :transducers-jzon)

3Usage

3.1Importing

Since this library reuses some symbol names also found in :cl, it is expected that you import transducers as follows in your defpackage:

(:local-nicknames (#:t #:transducers))

You can then make relatively clean calls like:

(t:transduce (t:map #'1+) #'t:vector '(1 2 3))
;; => #(2 3 4)

3.2Transducers, Reducers, and Sources

;; The fundamental pattern.
(t:transduce <transducer-chain> <reducer> <source>)

Data processing largely has three concerns:

  1. Where is my data coming from? (sources)
  2. What do I want to do to each element? (transducers)
  3. How do I want to collect the results? (reducers)

Each full "transduction" requires all three. We pass one of each to the transduce function, which drives the process. It knows how to pull values from the source, feed them through the transducer chain, and wrap everything together via the reducer.

  • Typical transducers are map, filter, and take.
  • Typical reducers are +, count, t:cons, and fold.
  • Typical sources are lists, vectors, strings, hash tables, and files.

Generators are a special kind of source that yield infinite data. Typical generators are repeat, cycle, and ints.

Let's sum the squares of the first 1000 odd integers:

(t:transduce
 (t:comp (t:filter #'oddp)             ;; (2) Keep only odd numbers.
         (t:take 1000)                 ;; (3) Keep the first 1000 filtered odds.
         (t:map (lambda (n) (* n n)))) ;; (4) Square those 1000.
 #'+         ;; (5) Reducer: Add up all the squares.
 (t:ints 1)) ;; (1) Source: Generate all positive integers.
;; => 1333333000 (31 bits, #x4F790C08)

Two things of note here:

  1. comp is used here to chain together different transducer steps. Notice thatthe order appears "backwards" from usual function composition. It may help toimagine that comp is acting like the ->> macro here. comp is supplied here asa convenience; you're free to use alexandria:compose if you wish.
  2. The reduction via + is listed as Step 5, but really it's occuring throughoutthe transduction process. Each value that makes it through the composedtransducer chain is immediately added to an internal accumulator.

Explore the other transducers and reducers to see what's possible! You'll never write a loop again.

3.3Using the fold Reducer

A reducer is a function that "reduces" or "folds" the results of the transducer chain into some single value. This could be a collection or some scalar. Some reducers can even short-circuit, yielding a desired value early.

fold is the ultimate reducer, and thus deserves special attention. fold creates an ad-hoc reducer based on a given 2-argument function. A SEED is required as the initial accumulator value, which also becomes the return value in case there were no input left in the transduction.

The normal CL functions + and * are automatically valid reducers, because they yield sane values even when given 0 or 1 arguments. Other functions like max cannot be used as-is as reducers since they can't be called without arguments. For functions like this, fold is appropriate.

;; The length of the longest word in this README.
(let ((xf (t:comp (t:map #'str:words)
                  #'t:concatenate
                  (t:filter (lambda (w) (every #'alpha-char-p w)))
                  (t:map #'length))))
  (t:transduce xf (t:fold #'cl:max 0) #p"README.org"))
;; => 14

In Clojure this function is called completing.

3.4Processing JSON Data

The system transducers-jzon provides automatic JSON streaming support via the jzon library. Like transducers itself, it is expected that you import this system with a nickname:

(:local-nicknames (#:j #:transducers-jzon))

Only two functions are exposed: read and write.

  • read is a source that accepts a pathname, open stream, or a string. Itproduces parsed JSON values as Lisp types. JSON Objects become Hash Tables.
  • write is a reducer that expects an open stream. It writes the stream of Lisptypes into their logical JSON equivalents.

Here is a simple example of reading some JSON data from a string, doing nothing to it, and outputting it again to a new string:

(with-output-to-string (stream)
  (t:transduce #'t:pass (j:write stream) (j:read "[{\"name\": \"A\"}, {\"name\": \"B\"}]")))
;; => "[{\"name\":\"A\"},{\"name\":\"B\"}]"

Note that the JSON data must be a JSON array. There is otherwise no size limit; the library can handle any amount of JSON input.

For more examples, see the Gallery below.

4API and Compatibility Charts

This library offers an extensive implementation of the Transducer pattern compared to a number of languages. Note that the xforms library offers a number extensions to what is normally available in Clojure.

4.1Transducers

CL transducers loop macro Clojure Scheme Rust Haskell
pass map identity map identity Just collect map id
map for x being the...
filter if / when
filter-map keep mapMaybe
remove unless
unique distinct nub
dedup dedupe
drop
drop-while
take
take-while while etc.
replace
Flat Map mapcat tappend-map flat_map >>=
uncons cat
concatenate cat flatten join
flatten
segment partition-all
window chunks
group-by partition-by
intersperse interpose
enumerate map-indexed
step by take-nth
scan
random-sample
log Print in loop body trace

4.2Higher-order Transducers

Transducers which can alter the transduction chain itself during runtime.

CL transducers loop macro Clojure Scheme Rust Haskell
branch
inject
split
zip

4.3Reducers

CL transducers loop macro Clojure Scheme Rust Haskell
Into List collect conj
Into Vector vconcat conj
Into String concat str
Into Map conj
Into Set conj
count
average
anyp
allp
first return etc.
last
fold completing
max maximize
min minimize
find return etc.
for-each do mapM_

Why oh why is it so difficult to find an implementation of average in many languages?

4.4Generators

CL transducers loop macro Clojure Scheme Rust Haskell
ints for x from N to M 1.. [1..]
cycle
repeat repeat
random
shuffle

4.5Data Sources

CL transducers loop macro Clojure Scheme Rust Haskell
File Lines
JSON Stream

5Example Gallery

5.1Words in a File

(t:transduce (t:comp (t:map #'str:words) #'t:concatenate)
             #'t:count #p"README.org")
;; => 977

5.2Reducing into Property Lists and Assocation Lists

There is no special reducer function for plists, because none is needed. If you have a stream of cons cells, you can break it up with uncons and then collect with cons as usual:

(t:transduce (t:comp (t:map (lambda (pair) (cons (car pair) (1+ (cdr pair)))))
                     #'t:uncons)
             #'t:cons (t:plist '(:a 1 :b 2 :c 3)))
;; => (:a 2 :b 3 :c 4)

Likewise, Association Lists are already lists-of-cons-cells, so no special treatment is needed:

(t:transduce #'t:pass #'t:cons '((:a . 1) (:b . 2) (:c . 3)))
;; => ((:a . 1) (:b . 2) (:c . 3)))

5.3JSON: Calculating average age

Since JSON Objects are parsed as Hash Tables, we use the usual functions to retrieve fields we want.

(t:transduce (t:filter-map (lambda (ht) (gethash "age" ht)))
             #'t:average
             (j:read "[{\"age\": 34}, {\"age\": 25}]"))
;; => 59/2 (29.5)

5.4Sieve of Eratosthenes

An ancient method of calculating Prime Numbers.

(let ((xf (t:comp (t:inject (lambda (prime) (t:filter (lambda (n) (/= 0 (mod n prime))))))
                  (t:take 10))))
  (cons 2 (t:transduce xf #'t:cons (t:ints 3 :step 2))))
;; => (2 3 5 7 11 13 17 19 23 29 31)

6Limitations

  1. This library is generally portable, but assumes your CL implementationsupports tail-recursive usage of labels.
  2. A way to model the common zip function has not yet been found.

7Further Work

  • Notes on performance.
  • More higher-order transducers.

8Resources

Dependencies (4)

  • jzon
  • parachute
  • sycamore
  • trivia

Dependents (0)

    • GitHub
    • Quicklisp