cl-html5-parser

2017-10-19

cl-html5-parser: HTML5 parser for Common Lisp

Abstract

cl-html5-parser is a HTML5 parser for Common Lisp with the following features:

  • It is a port of the Python library html5lib.
  • It passes all relevant tests from html5lib.
  • It is not tied to a specific DOM implementation.

Requirements

  • SBCL or ECL.
  • CL-PPCRE and FLEXI-STREAMS.

Might work with CLISP, ABCL and Clozure CL, but many of the tests don't pass there.

Usage

Parsing

Parsing functions are in the package HTML5-PARSER.

parse-html5 source &key encoding strictp dom
    => document, errors

Parse an HTML document from source. Source can be a string, a pathname or a stream. When parsing from a stream encoding detection is not supported, encoding must be supplied via the encoding keyword parameter.

When strictp is true, parsing stops on first error.

Returns two values. The primary value is the document node. The secondary value is a list of errors found during parsing. The format of this list is subject to change.

The type of document depends on the dom parameter. By default it's an instance of cl-html5-parser's own DOM implementation. See the DOM paragraph below for more information.

parse-html5-fragment source &key container encoding strictp dom
    => document-fragment, errors

Parses a fragment of HTML. Container sets the context, defaults to "div". Returns a document-fragment node. For the other parameters see PARSE-HTML5.

Example

(html5-parser:parse-html5-fragment "Parse <i>some</i> HTML" :dom :xmls)
==> ("Parse " ("i" NIL "some") " HTML")

The DOM

Parsing HTML5 is not possible without a DOM. cl-html5-parser defines a minimal DOM implementation for this task. Functions for traversing documents are exported by the HTML5-PARSER package.

Alternatively the parser can be instructed to to convert the document into other DOM implementations using the dom parameter. The conversion is done by simply calling the generic function transform-html5-dom. Support for other DOM implementations can be added by defining new methods for this generic function. The dom parameter is either a symbol or a list where the car is a symbol and the rest is key arguments. Below is the currently supported target types.

Namespace of elements and attributes

The HTML5 syntax has no support for namespaces, however the standard defines special rules to set the expected namespace for SVG and MathML elements and the following attributes: xlink:actuate, xlink:arcrole, xlink:href, xlink:role, xlink:show, xlink:title, xlink:type, xml:base, xml:lang, xml:space, xmlns, xmlns:xlink. Please note that this only applies to SVG and MathML elements. Attributes of HTML elements will never get a namespace.

Examples

<html xml:lang='en'><svg xml:lang='en></svg></html>
  • Element html with namespace http://www.w3.org/1999/xhtml
  • Attribute with name xml:lang (no prefix)
  • Element svg with namespace http://www.w3.org/2000/svg
  • Attribute with prefix xml, local name lang, namespace http://www.w3.org/XML/1998/namespace
(html5-parser:parse-html5 "<!doctype html><html xml:lang='en' xml@lang='en'>" :dom :xmls-ns)
==>
(("html" . "http://www.w3.org/1999/xhtml")
 (("xmlU00003Alang" "en") ("xmlU000040lang" "en")) ("head" NIL) ("body" NIL))

On an HTML element xml:lang and xml@lang are just attributes with unusual characters in their name. In the HTML DOM these names are kept as is, but when converting to XML they are escaped, to ensure the XML becomes valid. This escaping can be reversed with HTML5-PARSER:XML-UNESCAPE-NAME.

(html5-parser:parse-html5 "<!doctype html><svg xml:lang='en' xml@lang='en' xlink:href='#' xlink:to='#'></svg>" :dom :xmls-ns)
==>
(("html" . "http://www.w3.org/1999/xhtml") NIL ("head" NIL)
 ("body" NIL
  (("svg" . "http://www.w3.org/2000/svg")
   (("xml:lang" "en") ("xmlU000040lang" "en") ("xlink:href" "#")
    ("xmlns:xlink" "http://www.w3.org/1999/xlink") ("xlinkU00003Ato" "#")))))

In this case the xml:lang and xmlns:xlink is one of those attributes with known namespace when used on SVG and MathML elements. However xlink:to is not the list, even if it's defined in the xlink standard.

:XMLS or (:XMLS &key namespace comments)

Converts a node into a simple XMLS-like list structure. If node is a document fragment a list of XMLS nodes a returned. In all other cases a single XMLS node is returned.

If namespace argument is true, tag names are conses of name and namespace URI.

By default comments are stripped. If comments argument is true, comments are returned as (:COMMENT NIL "comment text"). This extension of XMLS format.

:CXML

Convert to Closure XML Parser DOM implementation. In order to use this you must load/depend on the the system cl-html5-parser-cxml.

License

This library is available under the GNU Lesser General Public License v3.0.

Author
Thomas Bakketun <thomas.bakketun@copyleft.no>
License
GNU Lesser General Public License
Categories
html parser, lgpl