lol-re

2015-01-13

lol-re

Tiny wrapper around CL-PPCRE, making usage of regexps more perly. Inspired by let-over-lambda's #~m and #~s read-macro (http://www.letoverlambda.com)

This package introduces two car-reader-macro (see CL-READ-MACRO-TOKENS) M~ and S~ (and also MR~, see below). M~ is for matching, MR~ is for matching with autoreset, while S~ is for substitution (in direct analogy with Perl's '=~ m//' and ' =~ s///' idioms).

M~

Syntax:

m~ regexp [string] => function | (or null string)

Basic example:

(with-open-file (out-file "out-file" :direction :output)
  (iter (for line in-file "in-file" using #'readline)
        (and (m~ "(some)regexp(?<with>with)grouping(s)" line)
             ;; plenty of anaphoric bindings are available after the match
             (format out-file #?"$($1) $($2) $($with) $($3)"))))

First argument of M~ is read in with reader of #?r" installed for ". Thus, you don't have to escape backslashes, appearing in front of regexp-meaning characters. Unfortunately, interpolation is not supported for now. Only double quotes are chosen as a delimiter, as otherwise lisp-mode of Emacs would go crazy when seeing something like this: /"/

When called with one argument (regexp), M~ expands into closure, which accepts string. When called with same string repeatedly, it outputs subsequent matches of the regexp on that string.

LOL-RE> (defparameter matcher (m~ "[a-z]"))
LOL-RE> (funcall matcher "asdf")
"a"
LOL-RE> (funcall matcher "asdf")
"s"
LOL-RE> (funcall matcher "asdf")
"d"
LOL-RE> (funcall matcher "asdf")
"f"
LOL-RE> (funcall matcher "asdf")
NIL

When called with different strings, behavior may be strange.

LOL-RE> (defparameter matcher (m~ "[a-z]"))
LOL-RE> (funcall matcher "foo")
"f"
LOL-RE> (funcall matcher "bar") ; matching starts from the position, where first match finished
"a"

However, when called with :RESET keyword, the position counter inside the closure is reset, so closure can be called now on some new string (see MR~ below).

LOL-RE> (defparameter matcher (m~ "[a-z]"))
LOL-RE> (funcall matcher "foo")
"f"
LOL-RE> (funcall matcher :reset)
T
LOL-RE> (funcall matcher "bar")
"b"

When called with two arguments (regexp and string), M~ expands into application of a matching closure to that string, so the first match of regexp on that string.

LOL-RE> (m~ "[0-9]+" "foo123bar456")
"123"

M~ also sets some anaphoric bindings (as seen in the basic example): * $0 is the whole match, $-0 and $+0 are the beginning and the end positions of whole match * $1 is the first group, $-1 and $+1 are the beginning and the end of the first group * ... and so on for all other groups * if a group was named, say, "foo", then also variables $FOO, $-FOO and $+FOO, with similar meaning. When generating symbol-name case of register is reversed, so for register named "fOo" symbols would be $|FoO|, $-|FoO| and $+|FoO|.

Since all those anaphoric bindings are (by default) global dynamic variables * they are equal to the ones relevant for the latest (in physical time) match performed. * this may be tricky when multithreading, but see RE-LOCAL macro below

MR~

Since M~ macro generates matcher-closure, which remembers position, from which to perform the next match, it may behave strangely in seemingly obvious situations.

LOL-RE> (dolist (elt '("1" "2" "3"))
          (format t "~a" (m~ "[0-9]" elt)))
"1"
NIL
NIL

This is because, after the first match, position remembered is already 1, and further matches do not succeed.

This intuitive behavior is provided by MR~ macro (from "match resettingly"), which resets its position to 0, when performing each new match.

LOL-RE> (dolist (elt '("1" "2" "3"))
          (format t "~a" (mr~ "[0-9]" elt)))
"1"
"2"
"3"

Of course, the most intuitive solution (and the most performance penalizing), would be to maintain a hash of positions for each string being matched by the matcher, generated by M~ macro. However, then the behavior in this example

LOL-RE> (dolist (elt '("a" "a" "a"))
          (format t "~a" (mr~ "[0-9]" elt)))
???

would crucially depend on whether strings share structure, which may depend on details of the compiler, which is even more obscuring, than current situation with two macro (M~ and MR~), each of which behaves in the definite way.

Iterate drivers

System also defines two drivers for iterate: IN-MATCHES-OF and MATCHING

(iter (for match in-matches-of "asdf" using (m~ "[a-z]([a-z])"))
      (collect `(,match ,$0 ,$1)))
(("as" "as" "s") ("df" "df" "f"))

As seen from the example, IN-MATCHES-OF iterates over all matches of given regexp in a given string. Both string and regexp are evaluated once-only, in the initialization of the loop.

In contrast, first example could be rewritten using MATCHING driver as follows:

(with-open-file (out-file "out-file" :direction :output)
  (iter (for line in-file "in-file" using #'readline)
        (for match matching line using (m~ "(some)regexp(?<with>with)grouping(s)"))
    (format out-file #?"$($1) $($2) $($with) $($3)")))

but there are couple important things, which MATCHING does differently: * regexp is evaluated once-only, before the loop starts * even if the line didn't match regexp, FORMAT is still executed, printing line of NILs

So, MATCHING is more-or-less analogous to

(let ((matcher (m~ "(some)regexp(?<with>with)grouping(s)")))
  (with-open-file (out-file "out-file" :direction :output)
    (iter (for line in-file "in-file" using #'readline)
          (funcall matcher line)
          (format out-file #?"$($1) $($2) $($with) $($3)"))))

TODO: * (done) creation of scanner, when regex-spec is just plain string * (wont do) do all the expansion at read-time, so that ((m~ "asdf") str) syntax be possible * usage of cl-interpol strings as regex-spec * (done) list of strings instead of just one string (auto joining) * ability to turn off some anaphoric bindings * (done) convenient iterate macros * (done) for iterating over all matches within a given string * (done) for iterating over multiple strings with the same regexp * ability to use only #?r syntax on implementations not supported by CL-READ-MACRO-TOKENS

For more usage patterns, see tests.lisp file and use-cases.lisp. use-cases.lisp was assembled by grepping of some quicklisp-available libs and rewriting CL-PPCRE-using pieces with help of M~ and S~.

S~

For now, replacing only can replace first occurence of the match. But, still, it's not needed to escape all those backslashes.

LOL-RE> (s~ "(\d{4})-(\d{2})-(\d{2})" "\3/\2/\1" "2014-04-07")
07/04/2014

When called with just 2 arguments, generates replacer closure

LOL-RE> (funcall (s~ "(\d{4})-(\d{2})-(\d{2})" "\3/\2/\1") "2014-04-07")
07/04/2014

TODO: * (done) creation of substituter, both target and replacement are plain strings, no named groups allowed * named groups are allowed in target * named groups are also allowed in replacement * cl-interpol #?r strings * lists are allowed instead of plain strings * G symbol switch to do all possible replacements

re-local

The purpose of this macro is to tackle issues, that arise when multithreading.

(re-local (all-the-variables like $1 $2 $a)
          (arising-from-use-of-m~)
          (are-declared-local-special)
          (inside-re-local))

So

;; this is not thread-safe, as some other thread may corrupt $1 before PRINC gets executed
(and (m~ "foo(bar)") (princ $1))

But

;; this is (supposedly) thread-safe, as all the relevant variables are implicitly
;; rebound as local dynamic, which are per-thread
(re-local (and (m~ "foo(bar)") (princ $1)))

How it works: codewalks (with help of HU.DWIM.WALKER) the body, with M~ and S~ redefined as MACROLETs, with same expansion, but with side-effect of telling RE-LOCAL, what variables they are going to initialize.

N.B.: If M~ was to be defined fully as read-time macro, then it's not possible to write RE-LOCAL even using code-walking, since it's not possible (read: very hard and ugly) to read the form twice from a stream. So, I won't define M~ to be read-time macro, at cost of sometimes being required to write FUNCALL.

Gotchas

  • potential racing conditions when multithreading (but, read re-local section)
Author
Alexander Popolitov <popolit@gmail.com>
License
GPL
Categories
regular expression