CL-STRING-MATCH aims at providing robust implementations of string matching algorithms. These algorithms are also called "substring search" or "subsequence search" algorithms.
Currently it provides implementations of the following string matching algorithms (see wiki for details):
- Brute-force (also known as na?ve algorithm)
- Boyer-Moore (with mismatched character heuristic and good suffix shift)
- Boyer-Moore-Horspool algorithm
- Knuth-Morris-Pratt algorithm
- Rabin-Karp algorithm
- Shift-OR algorithm (single pattern)
- Aho-Corasick algorithm (with finite set of patterns, forward transition and fail function)
- A simple backtracking regular expressions engine re similar to that of the Lua programming language. At the moment it significantly underperforms compared to the CL-PPCRE.
Some string processing algorithms are also implemented:
- Simple (na?ve) suffix tree construction algorithm
- Ukkonen's suffix tree construction algorithm
- Prefix trie
- Suffix tree
- Testing whether a string has the given suffix or prefix (starts with or ends with the pattern)
Some algorithms (Brute-force, Boyer-Moore-Horspool) have parametric implementations (code templates) making it possible to declare specific implementations for application-specific custom data types and data structures.
This library is routinely tested on Steel Bank CL, Clozure CL, Embeddable CL and Armed Bear CL. Chances are really high that it will work on other platforms without problems (check its status on CL-TEST-GRID).
Check the API Reference for more details.
Since the standard
search function is working fine, one might ask: why do we need a yet another implementation? Answer is simple: advanced algorithms offer different benefits compared to the standard implementation that is based on the brute-force algorithm.
Benchmarks show that depending on environment and pattern of application, a Boyer-Moore-Horspool algorithm implementation can outperform standard search function in SBCL by almost 18 times! Check the code in the
bench folder for further details.
CL-STRING-MATCH exports functions in
cl-string-match package (that is also nicknamed as
Shortcut functions search given pattern
pat in text
txt. They are usually much slower (because they build index structures every time they are called) but are easier to use:
string-contains-brutepat txt ? Brute-force
string-contains-bmpat txt ? Boyer-Moore
string-contains-bmhpat txt ? Boyer-Moore-Horspool
string-contains-kmppat txt ? Knuth-Morris-Pratt
string-contains-acpat txt ? Aho-Corasick
string-contains-rkpat txt ? Rabin-Karp
A more robust approach is to use pre-calculated index data that is processed by a pair of
initialize-accan accept a list of patterns that are compiled into a trie.
Brute-force algorithm does not use pre-calculated data and has no "initialize" function.
Boyer-Moore-Horspool implementation (the
-BMH8 functions) also accepts
:end2 keywords for the "search" and "contains" functions.
Following example looks for a given substring pat in a given line of text txt using Boyer-Moore-Horspool algorithm implementation:
(let ((idx (initialize-bmh "abc"))) (search-bmh idx "ababcfbgsldkj"))
Counting all matches of a given pattern in a string:
(loop with str = "____abc____abc____ab" with pat = "abc" with idx = (sm:initialize-bmh8 pat) with z = 0 with s = 0 while s do (when (setf s (sm:search-bmh8 idx str :start2 s)) (incf z) (incf s (length pat))) finally (return z))
It should be noted that Boyer-Moore-Horspool (
bmh) implementation can offer an order of magnitude boost to performance compared to the standard
However, some implementations create a "jump table" that can be the size of the alphabet (over 1M CHAR-CODE-LIMIT on implementations supporting Unicode) and thus consume a significant chunk of memory. There are different solutions to this problem and at the moment a version for the ASCII strings is offered:
initialize-bmh8 pat and
search-bmh8 bm txt as well as
string-contains-bmh8 pat txt work for strings with characters inside the 256 char code limit.
This project also contains code that is not directly invloved with the pattern search algorithms but nevertheless might be found useful for text handling/processing. Check the contrib folder in the repository for more details. Currently it contains:
ascii-strings.lispaims to provide single-byte strings functionality for Unicode-enabled Common Lisp implementations. Another goal is to reduce memory footprint and boost performance of the string-processing tasks, i.e.
simple-scanfimplements a subset of the original POSIX standard
The project still lacks some important features and is under constant development. Any kind of contributions or feedback are welcome.
Visit project Wiki for additional information.