parser.common-rules

2020-07-15

Provides common parsing rules that are useful in many grammars.

Upstream URL

github.com/scymtym/parser.common-rules

Author

Jan Moringen <jmoringe@techfak.uni-bielefeld.de>

Maintainer

Jan Moringen <jmoringe@techfak.uni-bielefeld.de>

License

MIT
README
parser.common-rules README

1Introduction

This system provides rules and rule-constructing macros for theesrap parser library for common cases in the following categories:
Anchors
Rules that match if the input around the position inquestion has certain properties.
Whitespace
Rules related to whitespace.
Comments
Rules for parsing different kinds of comments commonlyused in programming languages.
Literals
Rules for parsing commonly used literals such asbooleans, numbers and strings.
Tokenization
Macros for handling tokenization (which is usuallynot done in a separate lexing step in esrap-based parsers).
Infix Operators
Macros for defining families of unary, binaryand ternary operators with given precedence relations andassociativity.

This functionality is provided as a separate system since it introduces a dependency on the architecture.builder-protocol system.

https://travis-ci.org/scymtym/parser.common-rules.svg

2Tutorial

    (ql:quickload '(:parser.common-rules :parser.common-rules.operators))

2.1Anchors

"Anchor" rules match when the input around the current position hascertain properties and do not consume any input. For example:
     (esrap:parse 'parser.common-rules:<beginning-of-line> (format nil "foo~%bar")
                  :start 4 :junk-allowed t)
:BEGINNING-OF-LINE
4
T

The rules srclisp[:exports code]{<end-of-line>}, srclisp[:exports code]{<beginning-of-input>} and srclisp[:exports code]{<end-of-input>} work similarly.

The final rule, srclisp[:exports code]{<same-line>} is different in that it does consume input:

     (esrap:parse 'parser.common-rules:<same-line> (format nil "foo bar~%baz")
                  :start 4 :junk-allowed t)
"bar"
7
T

2.2Whitespace

The whitespace-related rules should be pretty self-explanatory:
     (esrap:parse 'parser.common-rules:whitespace+ "  ")
NIL
NIL
T

2.3Comments

There are several rules for parsing different styles of commentscommonly used in programming languages. For example, the followingrule parses srcc[:exports code]{/* … */} comments:
     (esrap:parse
      'parser.common-rules:c-style-comment/delimited
      "/*
        * Foo bar
        */")
"
   * Foo bar
   "
NIL
T

The above production is faithful to the input text with respect to whitespace, but that is not always desired. In such cases, the following comes in handy:

     (esrap:parse
      'parser.common-rules:c-style-comment/delimited/trimmed
      "/*
        * Foo bar
        ** fez baz
        * * whoop
        */")
"Foo bar
 fez baz
* whoop"
NIL
T

Note how prefixes of the same length are trimmed from all lines and the * whoop in the third comment line remains intact.

2.4Literals

The system provides rules for parsing Boolean, integer, floatingpoint and string literals. The two most interesting are probably
  1. floating point literals
            (esrap:parse 'parser.common-rules:float-literal "0.12e-10")
    
    1.2f-11
    NIL
    T
  2. string literals
            (esrap:parse 'parser.common-rules:string-literal/double-quotes
                         "\" foo \\\" bar \\x041 \\\\ baz \"")
    
    " foo \" bar A \\ baz "
    NIL
    T
            (esrap:parse 'parser.common-rules:string-literal/sextuple-quotes
                         "\"\"\" foo \\\" bar \\x041 \\\\ baz \"\"\"")
    
    " foo \\\" bar \\x041 \\\\ baz "
    NIL
    T

2.5Tokenization

Esrap-based grammars in most cases work without a separate lexicalanalysis phase. Among other things, this implies that the grammarrules have to handle tokenization. This system provides thesrclisp[:exports code]{defrule/s} macro to automate some of thiseffort.

The macro is used in place of srclisp[:exports code]{esrap:defrule} to define rules which parse token-like things. For example

     (parser.common-rules:defrule/s (identifier
                                     :skippable-expression  parser.common-rules:whitespace+
                                     :skippable?-expression parser.common-rules:whitespace*)
         (and (esrap:character-ranges (#\a #\z) (#\A #\Z))
              (* (esrap:character-ranges (#\a #\z) (#\A #\Z) (#\0 #\9))))
       (:text t))

Instead of one rule srclisp[:exports code]{identifier}, this form defines up to three rules

  • srclisp[:exports code]{identifier}
  • srclisp[:exports code]{identifier/s}
  • srclisp[:exports code]{identifier/?s}
The second and third rules parse an identifier followed by mandatoryand optional "skippable" text (i.e. some form of whitespace in mostcases) respectively. These rules can be used in places that requireor allow an identifier to be separated by whitespace from the nexttoken. For example:
     (parser.common-rules:defrule/s (equals
                                     :skippable-expression  parser.common-rules:whitespace+
                                     :skippable?-expression parser.common-rules:whitespace*)
         #\=)

     (esrap:defrule declaration
         (and identifier/?s equals/?s (* (digit-char-p character))))

This rule behaves like a parser with lexical analysis phase would:

     (mapcar (lambda (input)
               (list (prin1-to-string input)
                     (princ-to-string (esrap:parse 'declaration input))))
             '("a=1" "a =1" "a= 1" "a = 1"))
input production
"a=1" (a = (1))
"a =1" (a = (1))
"a= 1" (a = (1))
"a = 1" (a = (1))

Note that skippable text before and after the declaration is not handled by this rule but in the respective context in which the srclisp[:exports code]{declaration} rule is used (This could require defining the srclisp[:exports code]{declaration} rule using srclisp[:exports code]{defrule/s} as well).

The unwieldy specification of skippable expressions

     (parser.common-rules:defrule/s (identifier
                                     :skippable-expression  parser.common-rules:whitespace+
                                     :skippable?-expression parser.common-rules:whitespace*)
         …)

can be avoided by defining rules for skippable text in the package of the symbol naming the rule:

     (esrap:defrule skippable
         parser.common-rules:whitespace+)

     (esrap:defrule skippable?
         parser.common-rules:whitespace*)

     (parser.common-rules:defrule/s (identifier)
         (and (esrap:character-ranges (#\a #\z) (#\A #\Z))
              (* (esrap:character-ranges (#\a #\z) (#\A #\Z) (#\0 #\9))))
       (:text t))

These rules can then be shared by all rules defined with srclisp[:exports code]{defrule/s}.

2.6Infix Operators

Because of additional dependencies, this part of the project is provided as a separate system parser.common-rules.operators.

The macros for defining infix operators are probably the most complex but also most useful part of this project. The macro srclisp[:exports code]{define-operator-rules} defines a group of rules that implement a group of unary and binary operators with certain precedence relations:

     (parser.common-rules.operators:define-operator-rules
         (:skippable?-expression (* #\Space))
       (2 assign       ":="    :associativity :none)  ; lowest binding power
       (3 if-then-else "?" ":")
       (2 term         "+")
       (2 factor       "*")
       (2 expon        "^"     :associativity :right)
       (1 neg          "-")
       (1 inc          "++"    :fixity :postfix)      ; highest binding power
       character)                                     ; leaf expression

Parse results are constructed using the architecture.builder-protocol system. The following parsing code and resulting parse tree demonstrate the precedence and associativity properties:

     (architecture.builder-protocol:with-builder ('list)
       (esrap:parse 'assign "x := a ? b : c + d^e^f * -g"))
   (:BINARY-OPERATOR
    (:OPERAND
     ((#\x)
      ((:TERNARY-OPERATOR
        (:OPERAND
         ((#\a) (#\b)
          ((:BINARY-OPERATOR
            (:OPERAND
             ((#\c)
              ((:BINARY-OPERATOR
                (:OPERAND
                 (((:BINARY-OPERATOR
                    (:OPERAND
                     ((#\d)
                      ((:BINARY-OPERATOR (:OPERAND ((#\e) (#\f))) :OPERATOR "^"
                        :BOUNDS (19 . 22)))))
                    :OPERATOR "^" :BOUNDS (17 . 22)))
                  ((:UNARY-OPERATOR (:OPERAND ((#\g))) :OPERATOR "-" :BOUNDS
                    (25 . 27)))))
                :OPERATOR "*" :BOUNDS (17 . 27)))))
            :OPERATOR "+" :BOUNDS (13 . 27)))))
        :OPERATOR1 "?" :OPERATOR2 ":" :BOUNDS (5 . 27)))))
    :OPERATOR ":=" :BOUNDS (0 . 27))
   NIL
   T

srclisp[:exports code]{define-operator-rules} is not concerned with overriding operator precedence and associativity via parentheses. This aspect is easily handled "manually", though:

     (parser.common-rules.operators:define-operator-rules
         (:skippable?-expression (* #\Space))
       (2 assign       ":="    :associativity :none)
       (3 if-then-else "?" ":")
       (2 term         "+")
       (2 factor       "*")
       (2 expon        "^"     :associativity :right)
       (1 neg          "-")
       (1 inc          "++"    :fixity :postfix)
       (or parenthesized character))

     (esrap:defrule parenthesized
         (and #\( assign #\))
       (:function second))

Now, parenthesis can be used to override precedence and associativity:

     (architecture.builder-protocol:with-builder ('list)
       (esrap:parse 'assign "(((z := a) + b)^c)^d * (-e)"))
   (:BINARY-OPERATOR
    (:OPERAND
     (((:BINARY-OPERATOR
        (:OPERAND
         (((:BINARY-OPERATOR
            (:OPERAND
             (((:BINARY-OPERATOR
                (:OPERAND
                 (((:BINARY-OPERATOR (:OPERAND ((#\z) (#\a))) :OPERATOR ":="
                    :BOUNDS (3 . 9)))
                  (#\b)))
                :OPERATOR "+" :BOUNDS (2 . 14)))
              (#\c)))
            :OPERATOR "^" :BOUNDS (1 . 17)))
          (#\d)))
        :OPERATOR "^" :BOUNDS (0 . 20)))
      ((:UNARY-OPERATOR (:OPERAND ((#\e))) :OPERATOR "-" :BOUNDS (24 . 26)))))
    :OPERATOR "*" :BOUNDS (0 . 27))
   NIL
   T

3Dictionary

    (ql:quickload '(:alexandria :split-sequence
                    :parser.common-rules :parser.common-rules.operators))
    (defun doc (symbol kind)
      (let* ((lambda-list (sb-introspect:function-lambda-list symbol))
             (string      (or (documentation symbol kind)
                              (error "~@<~A ~S is not documented.~@:>"
                                     kind symbol)))
             (lines       (split-sequence:split-sequence #\Newline string))
             (strip       (reduce
                           #'min (rest lines)
                           :key (lambda (line)
                                  (or (position #\Space line :test-not #'char=)
                                      most-positive-fixnum))))
             (trimmed     (mapcar (lambda (line)
                                    (subseq line (min strip (length line))))
                                  (rest lines))))
        (format nil "~(~A~) ~<~{~A~^ ~}~:@>~2%~{~A~^~%~}"
                symbol (list lambda-list) (list* (first lines) trimmed))))

3.1Anchors

     <beginning-of-input>

     Matches at the beginning of the input (i.e. there is no preceding
     character). Produces :beginning-of-input and does not consume input.
     <end-of-input>

     Matches at the end of the input line (i.e. there is no following
     character). Produces :end-of-input and does not consume input.
     <beginning-of-line>

     Matches at the beginning of a line (i.e. the preceding character is
     #\Newline or there is no preceding character). Produces
     :beginning-of-line and does not consume input.
     <end-of-line>

     Matches at the end of a line (i.e. the following character is
     #\Newline or there is no following character). Produces :end-of-line
     and does not consume input.
     <same-line>

     Consumes all characters until <end-of-line> and produces the resulting
     string.

3.2Whitespace

     whitespace/not-newline

     Consumes a single #\Space or #\Tab, produces nil.
     whitespace/not-newline?

     Consumes nothing or a single #\Space or #\Tab, produces nil.
     whitespace

     Consumes a single #\Tab, #\Space, #\Newline or #\Page, produces nil.
     whitespace?

     Consumes nothing or a single #\Tab, #\Space, #\Newline or #\Page,
     produces nil.
     whitespace+

     Consumes one or more #\Tab, #\Space, #\Newline or #\Page characters,
     produces nil.
     whitespace*

     Consumes zero or more #\Tab, #\Space, #\Newline or #\Page characters,
     produces nil.

3.3Comments

     c-style-comment/rest-of-line[/trimmed]

     Consumes a comment of the form // … <end-of-line>, produces a string
     from the enclosed characters. The /trimmed variant removes leading
     #\/ characters. The plain variant uses the character unmodified.
     c-style-comment/delimited[/trimmed]

     Consumes a comment of the form /* … */, produces a string from the
     enclosed characters. The /trimmed variant removes a common prefix
     consisting of #\Space and #\* characters. The plain variant uses the
     enclosed characters unmodified.
     shell-style-comment[/trimmed]

     Consumes a comment of the form # … <end-of-line>, produces a string
     from the enclosed characters. The /trimmed variant removes leading
     #\# characters. The plain variant uses the character unmodified.
     lisp-style-comment[/trimmed]

     Consumes a comment of the form ; … <end-of-line>, produces a string
     from the enclosed characters. The /trimmed variant removes leading
     #\; characters. The plain variant uses the character unmodified.

3.4Literals

     boolean-literal/{lower-case,capital-case,extended}

     Consumes a Boolean value of the form

          true | false
       or True | False
       or true | false | t | f | 1 | 0

     respectively and produces t or nil.
     integer-literal/{binary,octal,decimal,hexdecimal}{,/prefix,/no-sign}

     Consumes an integer literal and produces its value.

     Variants:

                    /prefix         plain         /no-sign
       binary                       {+,-,}[01]+   [01]+
       octal        {+,-,}0o[0-7]+  {+,-,}[0-7]+  [0-7]+
       decimal                      {+,-,}[0-9]+  [0-9]+
       hexadecimal  {+,-,}0x[0-f]+  {+,-,}[0-f]+  [0-f]+
     {,single-,double-}float-literal[/rational]

     Consumes a floating point literal in fixed or scientific notation and
     produces its value as rational, single-float or double-float value.

     The /rational variants return the parsed number as a rational value
     while the plain variants coerce the parsed number into the respective
     float sub-type.

     the single- and double- variants verify that the parsed number is
     within the value range of the respective type.
     number-literal

     Consumes an integer or float literal and produces its value. In case
     of a float literal, a single-float value is returned.
     string-literal-{single,double,triple,sextuple}-quotes

     Consumes a string literal delimited by ', ", ''' or """ respectively.
     Produces the content of the literal (i.e. excluding the delimiters) as
     a string.

     For the single-quote and double-quote rules, the #\\ character
     initiates escape sequences. The following escape sequences are
     recognized:

       \\                                       -> #\Backslash

       \a                                       -> #\Bel
       \b                                       -> #\Backspace
       \f                                       -> #\Page
       \n                                       -> #\Newline
       \r                                       -> #\Return
       \t                                       -> #\Tab
       \v                                       -> #\Line_Tabulation

       \<octal number below decimal 256>        -> the character with that code
       \x<hexadecimal number below decimal 256> -> the character with that code

3.5Tokenization

     (doc 'parser.common-rules:defrule/s 'function)
   defrule/s NAME-AND-OPTIONS EXPRESSION &BODY OPTIONS

   Like `esrap:defule' but define additional rules named NAME/s and
   NAME/?s which respectively require/allow EXPRESSION to be followed
   by skippable input (e.g. whitespace).

   NAME-AND-OPTIONS can be either just a rule name or a list of the
   form

     (NAME &key
           SKIPPABLE-EXPRESSION  S?
           SKIPPABLE?-EXPRESSION ?S?
           DEFINER)

   where SKIPPABLE-EXPRESSION and SKIPPABLE?-EXPRESSION name the rules
   used to parse skippable input in the NAME/s and NAME/?s
   variants. Default to `skippable' and `skippable?' respectively.

   S? and ?S? control which of the NAME/S and NAME/?S rules should be
   generated. Default is generating both.

   DEFINER is the name of the macro used to define the "main"
   rule. Defaults to `esrap:defrule'.

3.6Infix Operators

     (doc 'parser.common-rules.operators:define-unary-operator-rule 'function)
   define-unary-operator-rule NAME OPERATOR-EXPRESSION NEXT &KEY (FIXITY PREFIX)
                              (SKIPPABLE?-EXPRESSION) (DEFINER 'DEFRULE)
                              (NODE-KIND UNARY-OPERATOR)

   Define a rule NAME for parsing an unary operator expressions with
   operator OPERATOR-EXPRESSION and operand NEXT.

   FIXITY has to be one of

   :prefix

     Generate a prefix operator, i.e.

       (and OPERATOR-EXPRESSION SKIPPABLE?-EXPRESSION NEXT)

   :postfix

     Generate a postfix operator, i.e.

       (and NEXT SKIPPABLE?-EXPRESSION OPERATOR-EXPRESSION)

   If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
   parsing skippable input (usually whitespace) between
   OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
   supplied, a rule whose name is

     (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

   is used.

   If supplied, DEFINER names the macro that should be used to define
   the rule. Otherwise `esrap:defrule' is used.
     (doc 'parser.common-rules.operators:define-binary-operator-rule 'function)
   define-binary-operator-rule NAME OPERATOR-EXPRESSION NEXT &KEY
                               (ASSOCIATIVITY LEFT) (SKIPPABLE?-EXPRESSION)
                               (DEFINER 'DEFRULE) (NODE-KIND BINARY-OPERATOR)

   Define a rule NAME for parsing a binary operator expressions with
   operator OPERATOR-EXPRESSION and operands NEXT.

   ASSOCIATIVITY has to be one of

   :none

     The defined binary operator will be non-associative, i.e. for an
     OPERATOR-EXPRESSION ":=", the expressions x:=y:=z will not be
     syntatically legal.

   :left

     The defined binary operator will associate to the left, i.e.
     x+y+z will be parsed as (x+y)+z.

   :right

     The defined binary operator will associate to the right, i.e.
     x^y^z will be parsed as x^(y^z).

   :associative

     The defined binary operator will associate to the left (but this
     should not be relied upon).

   If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
   parsing skippable input (usually whitespace) between
   OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
   supplied, a rule whose name is

     (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

   is used.

   If supplied, DEFINER names the macro that should be used to define
   the rule. Otherwise `esrap:defrule' is used.
     (doc 'parser.common-rules.operators:define-ternary-operator-rule 'function)
   define-ternary-operator-rule NAME OPERATOR1-EXPRESSION OPERATOR2-EXPRESSION
                                NEXT &KEY (SKIPPABLE?-EXPRESSION)
                                (DEFINER 'DEFRULE) (NODE-KIND TERNARY-OPERATOR)

   Define a rule NAME for parsing a ternary operator expressions with
   operators OPERATOR1-EXPRESSION and OPERATOR2-EXPRESSION and
   operands NEXT.

   If supplied, SKIPPABLE?-EXPRESSION is the expression to be used for
   parsing skippable input (usually whitespace) between
   OPERATOR-EXPRESSION and NEXT. If SKIPPABLE?-EXPRESSION is not
   supplied, a rule whose name is

     (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

   is used.

   If supplied, DEFINER names the macro that should be used to define
   the rule. Otherwise `esrap:defrule' is used.
     (doc 'parser.common-rules.operators:define-operator-rules 'function)
   define-operator-rules (&KEY SKIPPABLE?-EXPRESSION
                          (UNARY-NODE-KIND UNARY-OPERATOR)
                          (BINARY-NODE-KIND BINARY-OPERATOR)
                          (TERNARY-NODE-KIND TERNARY-OPERATOR))
                         &BODY CLAUSES

   Define rules for parsing infix operators according to CLAUSES.

   The order of clauses in CLAUSES determines the precedence of
   operators:

     (define-operator-rules ()
       OPERATOR-WITH-LOWEST-BINDING-POWER
       ⋮
       OPERATOR-WITH-HIGHEST-BINDING-POWER
       LEAF-EXPRESSION)

   All but the final clause in CLAUSES are of the form

     (ARITY RULE-NAME OPERATOR-EXPRESSION &rest ARGS &key)

   where

   * ARITY is the number of operands accepted by the operator
     being defined. The ARITY must be either 1, 2 or 3.

   * RULE-NAME is the name of the rule generated for the operator.

   * OPERATOR-EXPRESSION is an expression for parsing the operator
     token, e.g. #\* for multiplication.

   * ARGS can be any of the keyword arguments accepted by
     `define-unary-operator-rule', `define-binary-operator-rule' or
     `define-ternary-operator-rule' depending on ARITY, i.e.

     * :fixity (:prefix | :postfix)

       Only for unary operators. Fixity of the operator being defined.

     * :associativity (:none | :left | :right | :associative)

       Only for binary operators. Associativity of the operator being
       defined.

     * :skippable?-expression EXPRESSION

       See below.

     * :definer RULE-NAME

       The macro used to define the operator rule. Defaults to
       `esrap:defrule'.

   The final LEAF-EXPRESSION clause is just a rule expression,
   describing the "leafs" (i.e. not operator expressions) of the
   operator grammar.

   Whitespace handling can be controlled by specifying rules for
   "skippable" input using the :skippable?-expression keyword
   argument in ARGS. If supplied, SKIPPABLE?-EXPRESSION is applied to
   all defined operators. If SKIPPABLE?-EXPRESSION is not supplied, a
   rule whose name is

     (find-symbol (string '#:skippable?) (symbol-package OPERATOR-NAME))

   is used.

   Example

     (define-operator-rules (:skippable?-expression (* #\Space))
       (2 term   "+") ; lowest binding power
       (2 factor "*")
       (1 neg    "-") ; highest binding power
       #\x)           ; leaf expression

     (architecture.builder-protocol:with-builder ('list)
       (esrap:parse 'term "x + x * -x"))
     =>
     (:BINARY-OPERATOR
      (:OPERAND (("x")
                 ((:BINARY-OPERATOR
                   (:OPERAND (("x")
                              ((:UNARY-OPERATOR
                                (:OPERAND (("x")))
                                :OPERATOR "-" :BOUNDS (4 . 6)))))
                   :OPERATOR "*" :BOUNDS (2 . 6)))))
      :OPERATOR "+" :BOUNDS (0 . 6))

   Note that this macro is not concerned with forcing operator
   bindings via parentheses. See the documentation for recommendations
   on that.

4Settings :noexport:

Dependencies (6)

  • alexandria
  • architecture.builder-protocol
  • esrap
  • fiveam
  • let-plus
  • split-sequence

Dependents (1)

  • GitHub
  • Quicklisp