parser.common-rules
2020-07-15
Provides common parsing rules that are useful in many grammars.
Upstream URL
Author
Maintainer
License
1Introduction
This system provides rules and rule-constructing macros for theesrap parser library for common cases in the following categories:- Anchors
- Rules that match if the input around the position inquestion has certain properties.
- Whitespace
- Rules related to whitespace.
- Comments
- Rules for parsing different kinds of comments commonlyused in programming languages.
- Literals
- Rules for parsing commonly used literals such asbooleans, numbers and strings.
- Tokenization
- Macros for handling tokenization (which is usuallynot done in a separate lexing step in esrap-based parsers).
- Infix Operators
- Macros for defining families of unary, binaryand ternary operators with given precedence relations andassociativity.
This functionality is provided as a separate system since it introduces a dependency on the architecture.builder-protocol system.
2Tutorial
(ql:quickload '(:parser.common-rules :parser.common-rules.operators))
2.1Anchors
"Anchor" rules match when the input around the current position hascertain properties and do not consume any input. For example: (esrap:parse 'parser.common-rules:<beginning-of-line> (format nil "foo~%bar")
:start 4 :junk-allowed t)
:BEGINNING-OF-LINE 4 T
The rules srclisp[:exports code]{<end-of-line>}, srclisp[:exports code]{<beginning-of-input>} and srclisp[:exports code]{<end-of-input>} work similarly.
The final rule, srclisp[:exports code]{<same-line>} is different in that it does consume input:
(esrap:parse 'parser.common-rules:<same-line> (format nil "foo bar~%baz")
:start 4 :junk-allowed t)
"bar" 7 T
2.2Whitespace
The whitespace-related rules should be pretty self-explanatory: (esrap:parse 'parser.common-rules:whitespace+ " ")
NIL NIL T
2.3Comments
There are several rules for parsing different styles of commentscommonly used in programming languages. For example, the followingrule parses srcc[:exports code]{/* … */} comments: (esrap:parse
'parser.common-rules:c-style-comment/delimited
"/*
* Foo bar
*/")
" * Foo bar " NIL T
The above production is faithful to the input text with respect to whitespace, but that is not always desired. In such cases, the following comes in handy:
(esrap:parse
'parser.common-rules:c-style-comment/delimited/trimmed
"/*
* Foo bar
** fez baz
* * whoop
*/")
"Foo bar fez baz * whoop" NIL T
Note how prefixes of the same length are trimmed from all lines and
the * whoop in the third comment line remains intact.
2.4Literals
The system provides rules for parsing Boolean, integer, floatingpoint and string literals. The two most interesting are probably- floating point literals
(esrap:parse 'parser.common-rules:float-literal "0.12e-10")1.2f-11 NIL T
- string literals
(esrap:parse 'parser.common-rules:string-literal/double-quotes "\" foo \\\" bar \\x041 \\\\ baz \"")" foo \" bar A \\ baz " NIL T
(esrap:parse 'parser.common-rules:string-literal/sextuple-quotes "\"\"\" foo \\\" bar \\x041 \\\\ baz \"\"\"")" foo \\\" bar \\x041 \\\\ baz " NIL T
2.5Tokenization
Esrap-based grammars in most cases work without a separate lexicalanalysis phase. Among other things, this implies that the grammarrules have to handle tokenization. This system provides thesrclisp[:exports code]{defrule/s} macro to automate some of thiseffort.The macro is used in place of srclisp[:exports code]{esrap:defrule} to define rules which parse token-like things. For example
(parser.common-rules:defrule/s (identifier
:skippable-expression parser.common-rules:whitespace+
:skippable?-expression parser.common-rules:whitespace*)
(and (esrap:character-ranges (#\a #\z) (#\A #\Z))
(* (esrap:character-ranges (#\a #\z) (#\A #\Z) (#\0 #\9))))
(:text t))
Instead of one rule srclisp[:exports code]{identifier}, this form defines up to three rules
- srclisp[:exports code]{identifier}
- srclisp[:exports code]{identifier/s}
- srclisp[:exports code]{identifier/?s}
(parser.common-rules:defrule/s (equals
:skippable-expression parser.common-rules:whitespace+
:skippable?-expression parser.common-rules:whitespace*)
#\=)
(esrap:defrule declaration
(and identifier/?s equals/?s (* (digit-char-p character))))
This rule behaves like a parser with lexical analysis phase would:
(mapcar (lambda (input)
(list (prin1-to-string input)
(princ-to-string (esrap:parse 'declaration input))))
'("a=1" "a =1" "a= 1" "a = 1"))
| input | production |
|---|---|
| "a=1" | (a = (1)) |
| "a =1" | (a = (1)) |
| "a= 1" | (a = (1)) |
| "a = 1" | (a = (1)) |
Note that skippable text before and after the declaration is not handled by this rule but in the respective context in which the srclisp[:exports code]{declaration} rule is used (This could require defining the srclisp[:exports code]{declaration} rule using srclisp[:exports code]{defrule/s} as well).
The unwieldy specification of skippable expressions
(parser.common-rules:defrule/s (identifier
:skippable-expression parser.common-rules:whitespace+
:skippable?-expression parser.common-rules:whitespace*)
…)
can be avoided by defining rules for skippable text in the package of the symbol naming the rule:
(esrap:defrule skippable
parser.common-rules:whitespace+)
(esrap:defrule skippable?
parser.common-rules:whitespace*)
(parser.common-rules:defrule/s (identifier)
(and (esrap:character-ranges (#\a #\z) (#\A #\Z))
(* (esrap:character-ranges (#\a #\z) (#\A #\Z) (#\0 #\9))))
(:text t))
These rules can then be shared by all rules defined with srclisp[:exports code]{defrule/s}.
2.6Infix Operators
Because of additional dependencies, this part of the project is
provided as a separate system parser.common-rules.operators.
The macros for defining infix operators are probably the most complex but also most useful part of this project. The macro srclisp[:exports code]{define-operator-rules} defines a group of rules that implement a group of unary and binary operators with certain precedence relations:
(parser.common-rules.operators:define-operator-rules
(:skippable?-expression (* #\Space))
(2 assign ":=" :associativity :none) ; lowest binding power
(3 if-then-else "?" ":")
(2 term "+")
(2 factor "*")
(2 expon "^" :associativity :right)
(1 neg "-")
(1 inc "++" :fixity :postfix) ; highest binding power
character) ; leaf expression
Parse results are constructed using the architecture.builder-protocol system. The following parsing code and resulting parse tree demonstrate the precedence and associativity properties:
(architecture.builder-protocol:with-builder ('list)
(esrap:parse 'assign "x := a ? b : c + d^e^f * -g"))
(:BINARY-OPERATOR
(:OPERAND
((#\x)
((:TERNARY-OPERATOR
(:OPERAND
((#\a) (#\b)
((:BINARY-OPERATOR
(:OPERAND
((#\c)
((:BINARY-OPERATOR
(:OPERAND
(((:BINARY-OPERATOR
(:OPERAND
((#\d)
((:BINARY-OPERATOR (:OPERAND ((#\e) (#\f))) :OPERATOR "^"
:BOUNDS (19 . 22)))))
:OPERATOR "^" :BOUNDS (17 . 22)))
((:UNARY-OPERATOR (:OPERAND ((#\g))) :OPERATOR "-" :BOUNDS
(25 . 27)))))
:OPERATOR "*" :BOUNDS (17 . 27)))))
:OPERATOR "+" :BOUNDS (13 . 27)))))
:OPERATOR1 "?" :OPERATOR2 ":" :BOUNDS (5 . 27)))))
:OPERATOR ":=" :BOUNDS (0 . 27))
NIL
T
srclisp[:exports code]{define-operator-rules} is not concerned with overriding operator precedence and associativity via parentheses. This aspect is easily handled "manually", though:
(parser.common-rules.operators:define-operator-rules
(:skippable?-expression (* #\Space))
(2 assign ":=" :associativity :none)
(3 if-then-else "?" ":")
(2 term "+")
(2 factor "*")
(2 expon "^" :associativity :right)
(1 neg "-")
(1 inc "++" :fixity :postfix)
(or parenthesized character))
(esrap:defrule parenthesized
(and #\( assign #\))
(:function second))
Now, parenthesis can be used to override precedence and associativity:
(architecture.builder-protocol:with-builder ('list)
(esrap:parse 'assign "(((z := a) + b)^c)^d * (-e)"))
(:BINARY-OPERATOR
(:OPERAND
(((:BINARY-OPERATOR
(:OPERAND
(((:BINARY-OPERATOR
(:OPERAND
(((:BINARY-OPERATOR
(:OPERAND
(((:BINARY-OPERATOR (:OPERAND ((#\z) (#\a))) :OPERATOR ":="
:BOUNDS (3 . 9)))
(#\b)))
:OPERATOR "+" :BOUNDS (2 . 14)))
(#\c)))
:OPERATOR "^" :BOUNDS (1 . 17)))
(#\d)))
:OPERATOR "^" :BOUNDS (0 . 20)))
((:UNARY-OPERATOR (:OPERAND ((#\e))) :OPERATOR "-" :BOUNDS (24 . 26)))))
:OPERATOR "*" :BOUNDS (0 . 27))
NIL
T
3Dictionary
(ql:quickload '(:alexandria :split-sequence
:parser.common-rules :parser.common-rules.operators))
(defun doc (symbol kind)
(let* ((lambda-list (sb-introspect:function-lambda-list symbol))
(string (or (documentation symbol kind)
(error "~@<~A ~S is not documented.~@:>"
kind symbol)))
(lines (split-sequence:split-sequence #\Newline string))
(strip (reduce
#'min (rest lines)
:key (lambda (line)
(or (position #\Space line :test-not #'char=)
most-positive-fixnum))))
(trimmed (mapcar (lambda (line)
(subseq line (min strip (length line))))
(rest lines))))
(format nil "~(~A~) ~<~{~A~^ ~}~:@>~2%~{~A~^~%~}"
symbol (list lambda-list) (list* (first lines) trimmed))))
3.1Anchors
<beginning-of-input>
Matches at the beginning of the input (i.e. there is no preceding
character). Produces :beginning-of-input and does not consume input.
<end-of-input>
Matches at the end of the input line (i.e. there is no following
character). Produces :end-of-input and does not consume input.
<beginning-of-line>
Matches at the beginning of a line (i.e. the preceding character is
#\Newline or there is no preceding character). Produces
:beginning-of-line and does not consume input.
<end-of-line>
Matches at the end of a line (i.e. the following character is
#\Newline or there is no following character). Produces :end-of-line
and does not consume input.
<same-line>
Consumes all characters until <end-of-line> and produces the resulting
string.
3.2Whitespace
whitespace/not-newline
Consumes a single #\Space or #\Tab, produces nil.
whitespace/not-newline?
Consumes nothing or a single #\Space or #\Tab, produces nil.
whitespace
Consumes a single #\Tab, #\Space, #\Newline or #\Page, produces nil.
whitespace?
Consumes nothing or a single #\Tab, #\Space, #\Newline or #\Page,
produces nil.
whitespace+
Consumes one or more #\Tab, #\Space, #\Newline or #\Page characters,
produces nil.
whitespace*
Consumes zero or more #\Tab, #\Space, #\Newline or #\Page characters,
produces nil.
3.3Comments
c-style-comment/rest-of-line[/trimmed]
Consumes a comment of the form // … <end-of-line>, produces a string
from the enclosed characters. The /trimmed variant removes leading
#\/ characters. The plain variant uses the character unmodified.
c-style-comment/delimited[/trimmed]
Consumes a comment of the form /* … */, produces a string from the
enclosed characters. The /trimmed variant removes a common prefix
consisting of #\Space and #\* characters. The plain variant uses the
enclosed characters unmodified.
shell-style-comment[/trimmed]
Consumes a comment of the form # … <end-of-line>, produces a string
from the enclosed characters. The /trimmed variant removes leading
#\# characters. The plain variant uses the character unmodified.
lisp-style-comment[/trimmed]
Consumes a comment of the form ; … <end-of-line>, produces a string
from the enclosed characters. The /trimmed variant removes leading
#\; characters. The plain variant uses the character unmodified.
3.4Literals
boolean-literal/{lower-case,capital-case,extended}
Consumes a Boolean value of the form
true | false
or True | False
or true | false | t | f | 1 | 0
respectively and produces t or nil.
integer-literal/{binary,octal,decimal,hexdecimal}{,/prefix,/no-sign}
Consumes an integer literal and produces its value.
Variants:
/prefix plain /no-sign
binary {+,-,}[01]+ [01]+
octal {+,-,}0o[0-7]+ {+,-,}[0-7]+ [0-7]+
decimal {+,-,}[0-9]+ [0-9]+
hexadecimal {+,-,}0x[0-f]+ {+,-,}[0-f]+ [0-f]+
{,single-,double-}float-literal[/rational]
Consumes a floating point literal in fixed or scientific notation and
produces its value as rational, single-float or double-float value.
The /rational variants return the parsed number as a rational value
while the plain variants coerce the parsed number into the respective
float sub-type.
the single- and double- variants verify that the parsed number is
within the value range of the respective type.
number-literal
Consumes an integer or float literal and produces its value. In case
of a float literal, a single-float value is returned.
string-literal-{single,double,triple,sextuple}-quotes
Consumes a string literal delimited by ', ", ''' or """ respectively.
Produces the content of the literal (i.e. excluding the delimiters) as
a string.
For the single-quote and double-quote rules, the #\\ character
initiates escape sequences. The following escape sequences are
recognized:
\\ -> #\Backslash
\a -> #\Bel
\b -> #\Backspace
\f -> #\Page
\n -> #\Newline
\r -> #\Return
\t -> #\Tab
\v -> #\Line_Tabulation
\<octal number below decimal 256> -> the character with that code
\x<hexadecimal number below decimal 256> -> the character with that code
3.5Tokenization
(doc 'parser.common-rules:defrule/s 'function)
defrule/s NAME-AND-OPTIONS EXPRESSION &BODY OPTIONS
Like `esrap:defule' but define additional rules named NAME/s and
NAME/?s which respectively require/allow EXPRESSION to be followed
by skippable input (e.g. whitespace).
NAME-AND-OPTIONS can be either just a rule name or a list of the
form
(NAME &key
SKIPPABLE-EXPRESSION S?
SKIPPABLE?-EXPRESSION ?S?
DEFINER)
where SKIPPABLE-EXPRESSION and SKIPPABLE?-EXPRESSION name the rules
used to parse skippable input in the NAME/s and NAME/?s
variants. Default to `skippable' and `skippable?' respectively.
S? and ?S? control which of the NAME/S and NAME/?S rules should be
generated. Default is generating both.
DEFINER is the name of the macro used to define the "main"
rule. Defaults to `esrap:defrule'.
3.6Infix Operators
(doc 'parser.common-rules.operators:define-unary-operator-rule 'function)
define-unary-operator-rule NAME OPERATOR-EXPRESSION NEXT &KEY (FIXITY PREFIX)
(SKIPPABLE?-EXPRESSION) (DEFINER 'DEFRULE)
(NODE-KIND UNARY-OPERATOR)
Define a rule NAME for parsing an unary operator expressions with
operator OPERATOR-EXPRESSION and operand NEXT.
FIXITY has to be one of
:prefix
Generate a prefix operator, i.e.
(and OPERATOR-EXPRESSION SKIPPABLE?-EXPRESSION NEXT)
:postfix
Generate a postfix operator, i.e.
(and NEXT SKIPPABLE?-EXPRESSION OPERATOR-EXPRESSION)
If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
parsing skippable input (usually whitespace) between
OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
supplied, a rule whose name is
(find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))
is used.
If supplied, DEFINER names the macro that should be used to define
the rule. Otherwise `esrap:defrule' is used.
(doc 'parser.common-rules.operators:define-binary-operator-rule 'function)
define-binary-operator-rule NAME OPERATOR-EXPRESSION NEXT &KEY
(ASSOCIATIVITY LEFT) (SKIPPABLE?-EXPRESSION)
(DEFINER 'DEFRULE) (NODE-KIND BINARY-OPERATOR)
Define a rule NAME for parsing a binary operator expressions with
operator OPERATOR-EXPRESSION and operands NEXT.
ASSOCIATIVITY has to be one of
:none
The defined binary operator will be non-associative, i.e. for an
OPERATOR-EXPRESSION ":=", the expressions x:=y:=z will not be
syntatically legal.
:left
The defined binary operator will associate to the left, i.e.
x+y+z will be parsed as (x+y)+z.
:right
The defined binary operator will associate to the right, i.e.
x^y^z will be parsed as x^(y^z).
:associative
The defined binary operator will associate to the left (but this
should not be relied upon).
If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
parsing skippable input (usually whitespace) between
OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
supplied, a rule whose name is
(find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))
is used.
If supplied, DEFINER names the macro that should be used to define
the rule. Otherwise `esrap:defrule' is used.
(doc 'parser.common-rules.operators:define-ternary-operator-rule 'function)
define-ternary-operator-rule NAME OPERATOR1-EXPRESSION OPERATOR2-EXPRESSION
NEXT &KEY (SKIPPABLE?-EXPRESSION)
(DEFINER 'DEFRULE) (NODE-KIND TERNARY-OPERATOR)
Define a rule NAME for parsing a ternary operator expressions with
operators OPERATOR1-EXPRESSION and OPERATOR2-EXPRESSION and
operands NEXT.
If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
parsing skippable input (usually whitespace) between
OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
supplied, a rule whose name is
(find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))
is used.
If supplied, DEFINER names the macro that should be used to define
the rule. Otherwise `esrap:defrule' is used.
(doc 'parser.common-rules.operators:define-operator-rules 'function)
define-operator-rules (&KEY SKIPPABLE?-EXPRESSION
(UNARY-NODE-KIND UNARY-OPERATOR)
(BINARY-NODE-KIND BINARY-OPERATOR)
(TERNARY-NODE-KIND TERNARY-OPERATOR))
&BODY CLAUSES
Define rules for parsing infix operators according to CLAUSES.
The order of clauses in CLAUSES determines the precedence of
operators:
(define-operator-rules ()
OPERATOR-WITH-LOWEST-BINDING-POWER
⋮
OPERATOR-WITH-HIGHEST-BINDING-POWER
LEAF-EXPRESSION)
All but the final clause in CLAUSES are of the form
(ARITY RULE-NAME OPERATOR-EXPRESSION &rest ARGS &key)
where
* ARITY is the number of operands accepted by the operator
being defined. The ARITY must be either 1, 2 or 3.
* RULE-NAME is the name of the rule generated for the operator.
* OPERATOR-EXPRESSION is an expression for parsing the operator
token, e.g. #\* for multiplication.
* ARGS can be any of the keyword arguments accepted by
`define-unary-operator-rule', `define-binary-operator-rule' or
`define-ternary-operator-rule' depending on ARITY, i.e.
* :fixity (:prefix | :postfix)
Only for unary operators. Fixity of the operator being defined.
* :associativity (:none | :left | :right | :associative)
Only for binary operators. Associativity of the operator being
defined.
* :skippable?-expression EXPRESSION
See below.
* :definer RULE-NAME
The macro used to define the operator rule. Defaults to
`esrap:defrule'.
The final LEAF-EXPRESSION clause is just a rule expression,
describing the "leafs" (i.e. not operator expressions) of the
operator grammar.
Whitespace handling can be controlled by specifying rules for
"skippable" input using the :skippable?-expression keyword
argument in ARGS. If supplied, SKIPPABLE?-EXPRESSION is applied to
all defined operators. If SKIPPABLE?-EXPRESSION is not supplied, a
rule whose name is
(find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))
is used.
Example
(define-operator-rules (:skippable?-expression (* #\Space))
(2 term "+") ; lowest binding power
(2 factor "*")
(1 neg "-") ; highest binding power
#\x) ; leaf expression
(architecture.builder-protocol:with-builder ('list)
(esrap:parse 'term "x + x * -x"))
=>
(:BINARY-OPERATOR
(:OPERAND (("x")
((:BINARY-OPERATOR
(:OPERAND (("x")
((:UNARY-OPERATOR
(:OPERAND (("x")))
:OPERATOR "-" :BOUNDS (4 . 6)))))
:OPERATOR "*" :BOUNDS (2 . 6)))))
:OPERATOR "+" :BOUNDS (0 . 6))
Note that this macro is not concerned with forcing operator
bindings via parentheses. See the documentation for recommendations
on that.