parser.common-rules

2017-06-30

Introduction

This system provides rules and rule-constructing macros for the esrap parser library for common cases in the following categories:

Anchors

Rules that match if the input around the position in question has certain properties.

Whitespace

Rules related to whitespace.

Comments

Rules for parsing different kinds of comments commonly used in programming languages.

Literals

Rules for parsing commonly used literals such as booleans, numbers and strings.

Tokenization

Macros for handling tokenization (which is usually not done in a separate lexing step in esrap-based parsers).

Infix Operators

Macros for defining families of unary and binary operators with given precedence relations and associativity.

Tutorial

Anchors

"Anchor" rules match when the input around the current position has certain properties and do not consume any input. For example:

  (esrap:parse 'parser.common-rules:<beginning-of-line> (format nil "foo~%bar")
               :start 4 :junk-allowed t)
:BEGINNING-OF-LINE
4
T

The rules <end-of-line>, <beginning-of-input> and <end-of-input> work similarly.

The final rule, <same-line> is different in that it does consume input:

  (esrap:parse 'parser.common-rules:<same-line> (format nil "foo bar~%baz")
               :start 4 :junk-allowed t)
"bar"
7
T

Whitespace

The whitespace-related rules should be pretty self-explanatory:

  (esrap:parse 'parser.common-rules:whitespace+ "  ")
NIL
NIL
T

Comments

There are several rules for parsing different styles of comments commonly used in programming languages. For example, the following rule parses /* ? */ comments:

  (esrap:parse
   'parser.common-rules:c-style-comment/delimited
   "/*
     * Foo bar
     */")
"
   * Foo bar
   "
NIL
T

The above production is faithful to the input text with respect to whitespace, but that is not always desired. In such cases, the following comes in handy:

  (esrap:parse
   'parser.common-rules:c-style-comment/delimited/trimmed
   "/*
     * Foo bar
     ** fez baz
     * * whoop
     */")
"Foo bar
 fez baz
* whoop"
NIL
T

Note how prefixes of the same length are trimmed from all lines and the * whoop in the third comment line remains intact.

Literals

The system provides rules for parsing Boolean, integer, floating point and string literals. The two most interesting are probably

  1. floating point literals

      (esrap:parse 'parser.common-rules:float-literal "0.12e-10")
    
    1.2f-11
    NIL
    T
    
  2. string literals

      (esrap:parse 'parser.common-rules:string-literal/double-quotes
                   "\" foo \\\" bar \\x041 \\\\ baz \"")
    
    " foo \" bar A \\ baz "
    NIL
    T
    
      (esrap:parse 'parser.common-rules:string-literal/sextuple-quotes
                   "\"\"\" foo \\\" bar \\x041 \\\\ baz \"\"\"")
    
    " foo \\\" bar \\x041 \\\\ baz "
    NIL
    T
    

Tokenization

Esrap-based grammars in most cases work without a separate lexical analysis phase. Among other things, this implies that the grammar rules have to handle tokenization. This system provides the defrule/s macro to automate some of this effort.

The macro is used in place of esrap:defrule to define rules which parse token-like things. For example

  (parser.common-rules:defrule/s (identifier
                                  :skippable-expression  parser.common-rules:whitespace+
                                  :skippable?-expression parser.common-rules:whitespace*)
      (and (esrap:character-ranges (#\a #\z) (#\A #\Z))
           (* (esrap:character-ranges (#\a #\z) (#\A #\Z) (#\0 #\9))))
    (:text t))
Instead of one rule identifier, this form defines up to three rules
  • identifier
  • identifier/s
  • identifier/?s

The second and third rules parse an identifier followed by mandatory and optional "skippable" text (i.e. some form of whitespace in most cases) respectively. These rules can be used in places that require or allow an identifier to be separated by whitespace from the next token. For example:

  (parser.common-rules:defrule/s (equals
                                  :skippable-expression  parser.common-rules:whitespace+
                                  :skippable?-expression parser.common-rules:whitespace*)
      #\=)

  (esrap:defrule declaration
      (and identifier/?s equals/?s (* (digit-char-p character))))

This rule behaves like a parser with lexical analysis phase would:

  (mapcar (lambda (input)
            (list (prin1-to-string input)
                  (princ-to-string (esrap:parse 'declaration input))))
          '("a=1" "a =1" "a= 1" "a = 1"))
input production
"a=1" (a = (1))
"a 1" | (a (1))
"a= 1" (a = (1))
"a = 1" (a = (1))

Note that skippable text before and after the declaration is not handled by this rule but in the respective context in which the declaration rule is used (This could require defining the declaration rule using defrule/s as well).

The unwieldy specification of skippable expressions

  (parser.common-rules:defrule/s (identifier
                                  :skippable-expression  parser.common-rules:whitespace+
                                  :skippable?-expression parser.common-rules:whitespace*)
      ?)

can be avoided by defining rules for skippable text in the package of the symbol naming the rule:

  (esrap:defrule skippable
      parser.common-rules:whitespace+)

  (esrap:defrule skippable?
      parser.common-rules:whitespace*)

  (parser.common-rules:defrule/s (identifier)
      (and (esrap:character-ranges (#\a #\z) (#\A #\Z))
           (* (esrap:character-ranges (#\a #\z) (#\A #\Z) (#\0 #\9))))
    (:text t))

These rules can then be shared by all rules defined with defrule/s.

Infix Operators

Because of additional dependencies, this part of the project is provided as a separate system parser.common-rules.operators.

The macros for defining infix operators are probably the most complex but also most useful part of this project. The macro define-operator-rules defines a group of rules that implement a group of unary and binary operators with certain precedence relations:

  (parser.common-rules.operators:define-operator-rules
      (:skippable?-expression (* #\Space))
    (2 assign       ":="    :associativity :none)  ; lowest binding power
    (3 if-then-else "?" ":")
    (2 term         "+")
    (2 factor       "*")
    (2 expon        "^"     :associativity :right)
    (1 neg          "-")
    (1 inc          "++"    :fixity :postfix)      ; highest binding power
    character)                                     ; leaf expression

Parse results are constructed using the architecture.builder-protocol system. The following parsing code and resulting parse tree demonstrate the precedence and associativity properties:

  (architecture.builder-protocol:with-builder ('list)
    (esrap:parse 'assign "x := a ? b : c + d^e^f * -g"))
(:BINARY-OPERATOR
 (:OPERAND
  ((#\x)
   ((:TERNARY-OPERATOR
     (:OPERAND
      ((#\a) (#\b)
       ((:BINARY-OPERATOR
         (:OPERAND
          ((#\c)
           ((:BINARY-OPERATOR
             (:OPERAND
              (((:BINARY-OPERATOR
                 (:OPERAND
                  ((#\d)
                   ((:BINARY-OPERATOR (:OPERAND ((#\e) (#\f))) :OPERATOR "^"
                     :BOUNDS (19 . 22)))))
                 :OPERATOR "^" :BOUNDS (17 . 22)))
               ((:UNARY-OPERATOR (:OPERAND ((#\g))) :OPERATOR "-" :BOUNDS
                 (25 . 27)))))
             :OPERATOR "*" :BOUNDS (17 . 27)))))
         :OPERATOR "+" :BOUNDS (13 . 27)))))
     :OPERATOR1 "?" :OPERATOR2 ":" :BOUNDS (5 . 27)))))
 :OPERATOR ":=" :BOUNDS (0 . 27))
NIL
T

define-operator-rules is not concerned with overriding operator precedence and associativity via parentheses. This aspect is easily handled "manually", though:

  (parser.common-rules.operators:define-operator-rules
      (:skippable?-expression (* #\Space))
    (2 assign       ":="    :associativity :none)
    (3 if-then-else "?" ":")
    (2 term         "+")
    (2 factor       "*")
    (2 expon        "^"     :associativity :right)
    (1 neg          "-")
    (1 inc          "++"    :fixity :postfix)
    (or parenthesized character))

  (esrap:defrule parenthesized
      (and #\( assign #\))
    (:function second))

Now, parenthesis can be used to override precedence and associativity:

  (architecture.builder-protocol:with-builder ('list)
    (esrap:parse 'assign "(((z := a) + b)^c)^d * (-e)"))
(:BINARY-OPERATOR
 (:OPERAND
  (((:BINARY-OPERATOR
     (:OPERAND
      (((:BINARY-OPERATOR
         (:OPERAND
          (((:BINARY-OPERATOR
             (:OPERAND
              (((:BINARY-OPERATOR (:OPERAND ((#\z) (#\a))) :OPERATOR ":="
                 :BOUNDS (3 . 9)))
               (#\b)))
             :OPERATOR "+" :BOUNDS (2 . 14)))
           (#\c)))
         :OPERATOR "^" :BOUNDS (1 . 17)))
       (#\d)))
     :OPERATOR "^" :BOUNDS (0 . 20)))
   ((:UNARY-OPERATOR (:OPERAND ((#\e))) :OPERATOR "-" :BOUNDS (24 . 26)))))
 :OPERATOR "*" :BOUNDS (0 . 27))
NIL
T

Dictionary

Anchors

  <beginning-of-input>

  Matches at the beginning of the input (i.e. there is no preceding
  character). Produces :beginning-of-input and does not consume input.
  <end-of-input>

  Matches at the end of the input line (i.e. there is no following
  character). Produces :end-of-input and does not consume input.
  <beginning-of-line>

  Matches at the beginning of a line (i.e. the preceding character is
  #\Newline or there is no preceding character). Produces
  :beginning-of-line and does not consume input.
  <end-of-line>

  Matches at the end of a line (i.e. the following character is
  #\Newline or there is no following character). Produces :end-of-line
  and does not consume input.
  <same-line>

  Consumes all characters until <end-of-line> and produces the resulting
  string.

Whitespace

  whitespace/not-newline

  Consumes a single #\Space or #\Tab, produces nil.
  whitespace/not-newline?

  Consumes nothing or a single #\Space or #\Tab, produces nil.
  whitespace

  Consumes a single #\Tab, #\Space, #\Newline or #\Page, produces nil.
  whitespace?

  Consumes nothing or a single #\Tab, #\Space, #\Newline or #\Page,
  produces nil.
  whitespace+

  Consumes one or more #\Tab, #\Space, #\Newline or #\Page characters,
  produces nil.
  whitespace*

  Consumes zero or more #\Tab, #\Space, #\Newline or #\Page characters,
  produces nil.

Comments

  c-style-comment/rest-of-line[/trimmed]

  Consumes a comment of the form // ? <end-of-line>, produces a string
  from the enclosed characters. The /trimmed variant removes leading
  #\/ characters. The plain variant uses the character unmodified.
  c-style-comment/delimited[/trimmed]

  Consumes a comment of the form /* ? */, produces a string from the
  enclosed characters. The /trimmed variant removes a common prefix
  consisting of #\Space and #\* characters. The plain variant uses the
  enclosed characters unmodified.
  shell-style-comment[/trimmed]

  Consumes a comment of the form # ? <end-of-line>, produces a string
  from the enclosed characters. The /trimmed variant removes leading
  #\# characters. The plain variant uses the character unmodified.
  lisp-style-comment[/trimmed]

  Consumes a comment of the form ; ? <end-of-line>, produces a string
  from the enclosed characters. The /trimmed variant removes leading
  #\; characters. The plain variant uses the character unmodified.

Literals

  boolean-literal/{lower-case,capital-case,extended}

  Consumes a Boolean value of the form

       true | false
    or True | False
    or true | false | t | f | 1 | 0

  respectively and produces t or nil.
  integer-literal/{octal[/prefix],decimal,hexdecimal[/prefix]}

  Consumes an integer literal and produces its value.

  Variants:

                 /prefix         plain
    octal        {+,-,}0o[0-7]+  {+,-,}[0-7]+
    decimal                      {+,-,}[0-9]+
    hexadecimal  {+,-,}0x[0-f]+  {+,-,}[0-f]+
  float-literal[/rational]

  Consumes a floating point literal in fixed or scientific notation and
  produces its value.

  The /rational variant returns the parsed number as a rational value
  while the plain variant coerces the parsed number into a single-float.
  number-literal

  Consumes an integer or float literal and produces its value. In case
  of a float literal, a single-float value is returned.
  string-literal-{single,double,triple,sextuple}-quotes

  Consumes a string literal delimited by ', ", ''' or """ respectively.
  Produces the content of the literal (i.e. excluding the delimiters) as
  a string.

  For the single-quote and double-quote rules, the #\\ character
  initiates escape sequences. The following escape sequences are
  recognized:

    \\                                       -> #\Backslash

    \a                                       -> #\Bel
    \b                                       -> #\Backspace
    \f                                       -> #\Page
    \n                                       -> #\Newline
    \r                                       -> #\Return
    \t                                       -> #\Tab
    \v                                       -> #\Line_Tabulation

    \<octal number below decimal 256>        -> the character with that code
    \x<hexadecimal number below decimal 256> -> the character with that code

Tokenization

defrule/s NAME-AND-OPTIONS EXPRESSION &BODY OPTIONS

Like `esrap:defule' but define additional rules named NAME/s and
NAME/?s which respectively require/allow EXPRESSION to be followed
by skippable input (e.g. whitespace).

NAME-AND-OPTIONS can be either just a rule name or a list of the
form

  (NAME &key
        SKIPPABLE-EXPRESSION  S?
        SKIPPABLE?-EXPRESSION ?S?
        DEFINER)

where SKIPPABLE-EXPRESSION and SKIPPABLE?-EXPRESSION name the rules
used to parse skippable input in the NAME/s and NAME/?s
variants. Default to `skippable' and `skippable?' respectively.

S? and ?S? control which of the NAME/S and NAME/?S rules should be
generated. Default is generating both.

DEFINER is the name of the macro used to define the "main"
rule. Defaults to `esrap:defrule'.

Infix Operators

define-unary-operator-rule NAME OPERATOR-EXPRESSION NEXT &KEY (FIXITY PREFIX)
                           (SKIPPABLE?-EXPRESSION) (DEFINER 'DEFRULE)
                           (NODE-KIND UNARY-OPERATOR)

Define a rule NAME for parsing an unary operator expressions with
operator OPERATOR-EXPRESSION and operand NEXT.

FIXITY has to be one of

:prefix

  Generate a prefix operator, i.e.

    (and OPERATOR-EXPRESSION SKIPPABLE?-EXPRESSION NEXT)

:postfix

  Generate a postfix operator, i.e.

    (and NEXT SKIPPABLE?-EXPRESSION OPERATOR-EXPRESSION)

If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
parsing skippable input (usually whitespace) between
OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
supplied, a rule whose name is

  (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

is used.

If supplied, DEFINER names the macro that should be used to define
the rule. Otherwise `esrap:defrule' is used.
define-binary-operator-rule NAME OPERATOR-EXPRESSION NEXT &KEY
                            (ASSOCIATIVITY LEFT) (SKIPPABLE?-EXPRESSION)
                            (DEFINER 'DEFRULE) (NODE-KIND BINARY-OPERATOR)

Define a rule NAME for parsing a binary operator expressions with
operator OPERATOR-EXPRESSION and operands NEXT.

ASSOCIATIVITY has to be one of

:none

  The defined binary operator will be non-associative, i.e. for an
  OPERATOR-EXPRESSION ":=", the expressions x:=y:=z will not be
  syntatically legal.

:left

  The defined binary operator will associate to the left, i.e.
  x+y+z will be parsed as (x+y)+z.

:right

  The defined binary operator will associate to the right, i.e.
  x^y^z will be parsed as x^(y^z).

:associative

  The defined binary operator will associate to the left (but this
  should not be relied upon).

If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
parsing skippable input (usually whitespace) between
OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
supplied, a rule whose name is

  (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

is used.

If supplied, DEFINER names the macro that should be used to define
the rule. Otherwise `esrap:defrule' is used.
define-ternary-operator-rule NAME OPERATOR1-EXPRESSION OPERATOR2-EXPRESSION
                             NEXT &KEY (SKIPPABLE?-EXPRESSION)
                             (DEFINER 'DEFRULE) (NODE-KIND TERNARY-OPERATOR)

Define a rule NAME for parsing a ternary operator expressions with
operators OPERATOR1-EXPRESSION and OPERATOR2-EXPRESSION and
operands NEXT.

If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
parsing skippable input (usually whitespace) between
OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
supplied, a rule whose name is

  (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

is used.

If supplied, DEFINER names the macro that should be used to define
the rule. Otherwise `esrap:defrule' is used.
define-operator-rules (&KEY SKIPPABLE?-EXPRESSION
                       (UNARY-NODE-KIND UNARY-OPERATOR)
                       (BINARY-NODE-KIND BINARY-OPERATOR)
                       (TERNARY-NODE-KIND TERNARY-OPERATOR))
                      &BODY CLAUSES

Define rules for parsing infix operators according to CLAUSES.

The order of clauses in CLAUSES determines the precedence of
operators:

  (define-operator-rules ()
    OPERATOR-WITH-LOWEST-BINDING-POWER
    ?
    OPERATOR-WITH-HIGHEST-BINDING-POWER
    LEAF-EXPRESSION)

All but the final clause in CLAUSES are of the form

  (ARITY RULE-NAME OPERATOR-EXPRESSION &rest ARGS &key)

where

* ARITY is the number of operands accepted by the operator
  being defined. The ARITY must be either 1, 2 or 3.

* RULE-NAME is the name of the rule generated for the operator.

* OPERATOR-EXPRESSION is an expression for parsing the operator
  token, e.g. #\* for multiplication.

* ARGS can be any of the keyword arguments accepted by
  `define-unary-operator-rule', `define-binary-operator-rule' or
  `define-ternary-operator-rule' depending on ARITY, i.e.

  * :fixity (:prefix | :postfix)

    Only for unary operators. Fixity of the operator being defined.

  * :associativity (:none | :left | :right | :associative)

    Only for binary operators. Associativity of the operator being
    defined.

  * :skippable?-expression EXPRESSION

    See below.

  * :definer RULE-NAME

    The macro used to define the operator rule. Defaults to
    `esrap:defrule'.

The final LEAF-EXPRESSION clause is just a rule expression,
describing the "leafs" (i.e. not operator expressions) of the
operator grammar.

Whitespace handling can be controlled by specifying rules for
"skippable" input using the :skippable?-expression keyword
argument in ARGS. If supplied, SKIPPABLE?-EXPRESSION is applied to
all defined operators. If SKIPPABLE?-EXPRESSION is not supplied, a
rule whose name is

  (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

is used.

Example

  (define-operator-rules (:skippable?-expression (* #\Space))
    (2 term   "+") ; lowest binding power
    (2 factor "*")
    (1 neg    "-") ; highest binding power
    #\x)           ; leaf expression

  (architecture.builder-protocol:with-builder ('list)
    (esrap:parse 'term "x + x * -x"))
  =>
  (:BINARY-OPERATOR
   (:OPERAND (("x")
              ((:BINARY-OPERATOR
                (:OPERAND (("x")
                           ((:UNARY-OPERATOR
                             (:OPERAND (("x")))
                             :OPERATOR "-" :BOUNDS (4 . 6)))))
                :OPERATOR "*" :BOUNDS (2 . 6)))))
   :OPERATOR "+" :BOUNDS (0 . 6))

Note that this macro is not concerned with forcing operator
bindings via parentheses. See the documentation for recommendations
on that.

Settings

Author
Jan Moringen <jmoringe@techfak.uni-bielefeld.de>
Maintainer
Jan Moringen <jmoringe@techfak.uni-bielefeld.de>
License
MIT