cl-transducers
2024-10-12
Ergonomic, efficient data processing.
Upstream URL
Author
License
I think Transducers are a fundamental primitive that decouples critical logic from list/sequence processing, and if I had to do Clojure all over I would put them at the bottom.
-- Rich Hickey
Transducers are an ergonomic and extremely memory-efficient way to process a data source. Here "data source" means simple collections like Lists or Vectors, but also potentially large files or generators of infinite data.
Transducers...
- allow the chaining of operations like
map
andfilter
without allocating memory between each step. - aren't tied to any specific data type; they need only be implemented once.
- vastly simplify "data transformation code".
- have nothing to do with "lazy evaluation".
- are a joy to use!
Example: While skipping every second line of a file, sum the lengths of only evenly-lengthed lines.
(transduce
;; How do we want to process each element?
(comp (step 2) (map #'length) (filter #'evenp))
;; How do we want to combine all the elements together?
#'+
;; What's our original data source?
#p"giant-file.txt")
This library has been confirmed to work with SBCL, CCL, ECL, Clasp and LispWorks 8.
⚠ ABCL cannot be used due to lack of support for tail-call elimination and tail-recursive
labels
.
1History and Motivation
Originally invented in Clojure and adapted to Scheme as SRFI-171, Transducers are an excellent way to think about - and efficiently operate on - collections or streams of data. Transduction operations are strict and don't involve "laziness" or "thunking" in any way, yet only process the exact amount of data you ask them to.
This library draws inspiration from both the original Clojure and SRFI-171, while adding many other convenient operations commonly found in other languages.
2Installation
This library is available on Quicklisp and Ultralisp. To download the main system:
(ql:quickload :transducers)
For the JSON extensions:
(ql:quickload :transducers-jzon)
3Usage
3.1Importing
Since this library reuses some symbol names also found in :cl
, it is expected
that you import transducers
as follows in your defpackage
:
(:local-nicknames (#:t #:transducers))
You can then make relatively clean calls like:
(t:transduce (t:map #'1+) #'t:vector '(1 2 3))
;; => #(2 3 4)
3.2Transducers, Reducers, and Sources
;; The fundamental pattern.
(t:transduce <transducer-chain> <reducer> <source>)
Data processing largely has three concerns:
- Where is my data coming from? (sources)
- What do I want to do to each element? (transducers)
- How do I want to collect the results? (reducers)
Each full "transduction" requires all three. We pass one of each to the
transduce
function, which drives the process. It knows how to pull values from
the source, feed them through the transducer chain, and wrap everything together
via the reducer.
- Typical transducers are
map
,filter
, andtake
. - Typical reducers are
+
,count
,t:cons
, andfold
. - Typical sources are lists, vectors, strings, hash tables, and files.
Generators are a special kind of source that yield infinite data. Typical
generators are repeat
, cycle
, and ints
.
Let's sum the squares of the first 1000 odd integers:
(t:transduce
(t:comp (t:filter #'oddp) ;; (2) Keep only odd numbers.
(t:take 1000) ;; (3) Keep the first 1000 filtered odds.
(t:map (lambda (n) (* n n)))) ;; (4) Square those 1000.
#'+ ;; (5) Reducer: Add up all the squares.
(t:ints 1)) ;; (1) Source: Generate all positive integers.
;; => 1333333000 (31 bits, #x4F790C08)
Two things of note here:
comp
is used here to chain together different transducer steps. Notice thatthe order appears "backwards" from usual function composition. It may help toimagine thatcomp
is acting like the->>
macro here.comp
is supplied here asa convenience; you're free to usealexandria:compose
if you wish.- The reduction via
+
is listed as Step 5, but really it's occuring throughoutthe transduction process. Each value that makes it through the composedtransducer chain is immediately added to an internal accumulator.
Explore the other transducers and reducers to see what's possible! You'll never
write a loop
again.
3.3Using the fold
Reducer
A reducer is a function that "reduces" or "folds" the results of the transducer chain into some single value. This could be a collection or some scalar. Some reducers can even short-circuit, yielding a desired value early.
fold
is the ultimate reducer, and thus deserves special attention. fold
creates
an ad-hoc reducer based on a given 2-argument function. A SEED is required as
the initial accumulator value, which also becomes the return value in case there
were no input left in the transduction.
The normal CL functions +
and *
are automatically valid reducers, because they
yield sane values even when given 0 or 1 arguments. Other functions like max
cannot be used as-is as reducers since they can't be called without arguments.
For functions like this, fold
is appropriate.
;; The length of the longest word in this README.
(let ((xf (t:comp (t:map #'str:words)
#'t:concatenate
(t:filter (lambda (w) (every #'alpha-char-p w)))
(t:map #'length))))
(t:transduce xf (t:fold #'cl:max 0) #p"README.org"))
;; => 14
In Clojure this function is called completing
.
3.4Processing JSON Data
The system transducers-jzon
provides automatic JSON streaming support via the
jzon library. Like transducers
itself, it is expected that you import this
system with a nickname:
(:local-nicknames (#:j #:transducers-jzon))
Only two functions are exposed: read
and write
.
read
is a source that accepts a pathname, open stream, or a string. Itproduces parsed JSON values as Lisp types. JSON Objects become Hash Tables.write
is a reducer that expects an open stream. It writes the stream of Lisptypes into their logical JSON equivalents.
Here is a simple example of reading some JSON data from a string, doing nothing to it, and outputting it again to a new string:
(with-output-to-string (stream)
(t:transduce #'t:pass (j:write stream) (j:read "[{\"name\": \"A\"}, {\"name\": \"B\"}]")))
;; => "[{\"name\":\"A\"},{\"name\":\"B\"}]"
Note that the JSON data must be a JSON array. There is otherwise no size limit; the library can handle any amount of JSON input.
For more examples, see the Gallery below.
4API and Compatibility Charts
This library offers an extensive implementation of the Transducer pattern compared to a number of languages. Note that the xforms library offers a number extensions to what is normally available in Clojure.
4.1Transducers
CL transducers |
loop macro |
Clojure | Scheme | Rust | Haskell | |
---|---|---|---|---|---|---|
pass |
✓ | map identity |
map identity |
Just collect |
map id |
|
map |
✓ | for x being the... |
✓ | ✓ | ✓ | ✓ |
filter |
✓ | if / when |
✓ | ✓ | ✓ | ✓ |
filter-map |
✓ | keep |
✓ | mapMaybe |
||
remove |
unless |
✓ | ||||
unique |
✓ | distinct |
✓ | nub |
||
dedup |
✓ | dedupe |
✓ | |||
drop |
✓ | ✓ | ✓ | ✓ | ✓ | |
drop-while |
✓ | ✓ | ✓ | ✓ | ✓ | |
take |
✓ | ✓ | ✓ | ✓ | ✓ | |
take-while |
✓ | while etc. |
✓ | ✓ | ✓ | ✓ |
replace |
✓ | ✓ | ||||
Flat Map | mapcat |
tappend-map |
flat_map |
>>= |
||
uncons |
✓ | cat |
||||
concatenate |
✓ | cat |
✓ | flatten |
join |
|
flatten |
✓ | ✓ | ||||
segment |
✓ | partition-all |
✓ | |||
window |
✓ | chunks |
||||
group-by |
✓ | partition-by |
✓ | |||
intersperse |
✓ | interpose |
✓ | ✓ | ✓ | |
enumerate |
✓ | map-indexed |
✓ | ✓ | ||
step |
✓ | by |
take-nth |
|||
scan |
✓ | ✓ | ||||
random-sample |
✓ | |||||
log |
✓ | Print in loop body | ✓ | trace |
4.2Higher-order Transducers
Transducers which can alter the transduction chain itself during runtime.
CL transducers |
loop macro |
Clojure | Scheme | Rust | Haskell | |
---|---|---|---|---|---|---|
branch |
✓ | |||||
inject |
✓ | |||||
split |
✓ | |||||
zip |
✓ | ✓ | ✓ |
4.3Reducers
CL transducers |
loop macro |
Clojure | Scheme | Rust | Haskell | |
---|---|---|---|---|---|---|
Into List | ✓ | collect |
conj |
✓ | ✓ | ✓ |
Into Vector | ✓ | vconcat |
conj |
✓ | ✓ | |
Into String | ✓ | concat |
str |
✓ | ✓ | |
Into Map | ✓ | conj |
✓ | ✓ | ||
Into Set | conj |
|||||
count |
✓ | ✓ | ✓ | ✓ | ✓ | |
average |
✓ | |||||
anyp |
✓ | ✓ | ✓ | ✓ | ||
allp |
✓ | ✓ | ✓ | ✓ | ||
first |
✓ | return etc. |
✓ | ✓ | ||
last |
✓ | ✓ | ✓ | |||
fold |
✓ | completing |
✓ | ✓ | ||
max |
✓ | maximize |
✓ | ✓ | ||
min |
✓ | minimize |
✓ | ✓ | ||
find |
✓ | return etc. |
✓ | |||
for-each |
✓ | do |
✓ | mapM_ |
Why oh why is it so difficult to find an implementation of average
in many
languages?
4.4Generators
CL transducers |
loop macro |
Clojure | Scheme | Rust | Haskell | |
---|---|---|---|---|---|---|
ints |
✓ | for x from N to M |
1.. |
[1..] |
||
cycle |
✓ | ✓ | ✓ | |||
repeat |
✓ | repeat |
✓ | ✓ | ||
random |
✓ | |||||
shuffle |
✓ |
4.5Data Sources
CL transducers |
loop macro |
Clojure | Scheme | Rust | Haskell | |
---|---|---|---|---|---|---|
File Lines | ✓ | ✓ | ✓ | ✓ | ✓ | |
JSON Stream | ✓ | ✓ | ✓ | ✓ |
5Example Gallery
5.1Words in a File
(t:transduce (t:comp (t:map #'str:words) #'t:concatenate)
#'t:count #p"README.org")
;; => 977
5.2Reducing into Property Lists and Assocation Lists
There is no special reducer function for plists, because none is needed. If you
have a stream of cons cells, you can break it up with uncons
and then collect
with cons
as usual:
(t:transduce (t:comp (t:map (lambda (pair) (cons (car pair) (1+ (cdr pair)))))
#'t:uncons)
#'t:cons (t:plist '(:a 1 :b 2 :c 3)))
;; => (:a 2 :b 3 :c 4)
Likewise, Association Lists are already lists-of-cons-cells, so no special treatment is needed:
(t:transduce #'t:pass #'t:cons '((:a . 1) (:b . 2) (:c . 3)))
;; => ((:a . 1) (:b . 2) (:c . 3)))
5.3JSON: Calculating average age
Since JSON Objects are parsed as Hash Tables, we use the usual functions to retrieve fields we want.
(t:transduce (t:filter-map (lambda (ht) (gethash "age" ht)))
#'t:average
(j:read "[{\"age\": 34}, {\"age\": 25}]"))
;; => 59/2 (29.5)
5.4Sieve of Eratosthenes
An ancient method of calculating Prime Numbers.
(let ((xf (t:comp (t:inject (lambda (prime) (t:filter (lambda (n) (/= 0 (mod n prime))))))
(t:take 10))))
(cons 2 (t:transduce xf #'t:cons (t:ints 3 :step 2))))
;; => (2 3 5 7 11 13 17 19 23 29 31)
6Limitations
- This library is generally portable, but assumes your CL implementationsupports tail-recursive usage of
labels
. - A way to model the common
zip
function has not yet been found.