trivial-sanitize

2024-10-12

clean html strings: "<a>foo</a>" → "foo"

Upstream URL

codeberg.org/cage/trivial-sanitize

Author

cage

Maintainer

cage

License

LLGPL
README
trivial-sanitize

1Introduction

This library is a tiny library to clean HTML strings

2Status

The library is under development

3Prerequisites:

  • alexandria
  • cl-ppcre
  • cl-html5-parser

To run the tests also

  • clunit2

needs to be installed (via quicklisp).

4Installation:

The best way to get trivial-sanitize working is using the excellent +quicklisp+

#+BEGIN_SRC common-lisp +(ql:quickload "trivial-sanitize")+ #+END_SRC

This library is not yet in quicklisp, to install just copy the sources in your local-projects directory.

5Usage

The public API have just a single function sanitize.

    (sanitize:sanitize "<a>foo</a>" sanitize:*strips-all-tags-sanitize*)

the arguments of sanitize are:

    (sanitize:sanitize str rules &key
                       (case-insensitive-tag-match        t)
                       (strip-comments                    t)
                       (strips-all-tags-on-malformed-html nil))

where:

str
is the strings that needs to be cleaned;
rules
are the set of clean filters to be applied to str, the rules can be built using the macro define-rules example below:
        (define-rules *basic-sanitize*
          :tags   ("a" "abbr" "b" "blockquote" "br" "cite" "code" "dd" "dfn" "dl" "dt" "em" "i"
                       "kbd" "li" "mark" "ol" "p" "pre" "q" "s" "samp" "small" "strike" "strong"
                       "sub" "sup" "time" "u" "ul" "var")
          :attributes ((:all         . ("title"))
                       ("a"          . ("href"))
                       ("blockquote" . ("cite"))
                       ("dfn"        . ("title"))
                       ("q"          . ("cite"))
                       ("time"       . ("datetime" "pubdate")))
          :add-attributes (("a" . (("rel" . "nofollow"))))
          :protocols (("a"           . (("href" . (:ftp :http :https :mailto :relative))))
                      ("blockquote"  . (("cite" . (:http :https :relative))))
                      ("q"           . (("cite" . (:http :https :relative))))))
???
the arguments for :tags is a list of allowed tags;
???
the arguments for :attributes is a list where each element is a list with the first element a tag name or a special keyword :all and the rest of the list represents the allowed attributes for that tag;
???
the argument for :add-attributes is a list of lists where the first element of the latter is the name of a tag and the rest is also a list of cons that specify the name of the attribute (as the car) and the value of the attribute (as the cdr), to be added to the attributes of the tag; so fro example a string like:
      <a href="http://...">email</a>

will became:

      <a href="http://..." rel="nofollow">email</a>
???
the arguments for :protocols is a list where each element is a list so formed
      first element: tag-name rest: attributes-protocols
tag-name
the name of the tag where this rule applies;
attribute-protocols
a list where each element is also a list
       first element: attribute-name rest: allowed-protocols

for example the list:

        '("a" . (("href" . (:ftp :http :https :mailto :relative))))

meas that, for tag a the attribute href can contains values that specify protocols of type: "ftp", "http", "https", "mailto" or "relative" only.

Three sets of rules are already available:

  • *basic-sanitize*
  • *strips-all-tags-sanitize*
  • *restricted-sanitize*
  • *relaxed-sanitize*

Please see the file src/sanitize.lisp for their definition.

case-insensitive-tag-match
if non nil, when matching tags (or other elements) ignore different case
strip-comments
if non nil remove comments (i.e.) text wrapped in <!-- --> on the same line
strips-all-tags-on-malformed-html
if non nil when the parsing of str contains error run again this function using a set of rules that tries to strips all the tags from str

Moreover there are two important special variable that can help to make a fine tuning of the generated HTML:

whitespace-elements
a list of tag's name that are replaced with white spaces when the tags is stripped, example:

given the string

      "<div>foo</div>"

if div is present in the list bound to whitespace-elements stripping the tag will results in

      " foo "

if not presents the results will be:

      "foo"
self-closing-elements
a list of tags that will be kept as self closing tags, example:
      "a<hr>b"

will be kept as is if present in the list bound to self-closing-elements

      "a<hr>b"

but will became:

      "a<hr></hr>b"

if not present.

6BUGS

  • when a malformed HTML is provided to sanitize a spurious "&lt;/root-tag&gt;" could appears at the end of the filtered string

Please send bug reports or patches to the issue tracker.

7License

This library is released under Lisp Lesser General Public license (seeCOPYING.LESSER file)

8NO WARRANTY

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

9Acknowledgment

My deep thanks to the authors of cl-sanitize and hunchentoot, Thank you!

Dependencies (5)

  • alexandria
  • cl-html5-parser
  • cl-ppcre
  • clunit2
  • uiop

Dependents (0)

    • GitHub
    • Quicklisp