clean html strings: "<a>foo</a>" → "foo"
1moved to: https://codeberg.org/cage/trivial-sanitize/
This library is a tiny library to clean HTML strings
The library is under development
To run the tests also
needs to be installed (via quicklisp).
The best way to get trivial-sanitize working is using the excellent
+quicklisp+ #+BEGIN_SRC common-lisp
This library is not yet in quicklisp, to install just copy the sources in your
The public API have just a single function
(sanitize:sanitize "<a>foo</a>" sanitize:*strips-all-tags-sanitize*)
the arguments of
(sanitize:sanitize str rules &key (case-insensitive-tag-match t) (strip-comments t) (strips-all-tags-on-malformed-html nil))
- is the strings that needs to be cleaned;
- are the set of clean filters to be applied to
str, the rules can be built using the macro
(define-rules *basic-sanitize* :tags ("a" "abbr" "b" "blockquote" "br" "cite" "code" "dd" "dfn" "dl" "dt" "em" "i" "kbd" "li" "mark" "ol" "p" "pre" "q" "s" "samp" "small" "strike" "strong" "sub" "sup" "time" "u" "ul" "var") :attributes ((:all . ("title")) ("a" . ("href")) ("blockquote" . ("cite")) ("dfn" . ("title")) ("q" . ("cite")) ("time" . ("datetime" "pubdate"))) :add-attributes (("a" . (("rel" . "nofollow")))) :protocols (("a" . (("href" . (:ftp :http :https :mailto :relative)))) ("blockquote" . (("cite" . (:http :https :relative)))) ("q" . (("cite" . (:http :https :relative))))))
- the arguments for
:tagsis a list of allowed tags;
- the arguments for
:attributesis a list where each element is a list with the first element a tag name or a special keyword
:alland the rest of the list represents the allowed attributes for that tag;
- the argument for
:add-attributesis a list of lists where the first element of the latter is the name of a tag and the rest is also a list of
consthat specify the name of the attribute (as the
car) and the value of the attribute (as the
cdr), to be added to the attributes of the tag; so fro example a string like:
<a href="http://..." rel="nofollow">email</a>
- the arguments for
:protocolsis a list where each element is a list so formed
first element: tag-name rest: attributes-protocols
- the name of the tag where this rule applies;
- a list where each element is also a list
first element: attribute-name rest: allowed-protocols
for example the list:
'("a" . (("href" . (:ftp :http :https :mailto :relative))))
meas that, for tag
hrefcan contains values that specify protocols of type: "ftp", "http", "https", "mailto" or "relative" only.
Three sets of rules are already available:
Please see the file
src/sanitize.lispfor their definition.
- if non nil, when matching tags (or other elements) ignore different case
- if non nil remove comments (i.e.) text wrapped in
<!-- -->on the same line
- if non nil when the parsing of
strcontains error run again this function using a set of rules that tries to strips all the tags from
Moreover there are two important special variable that can help to make a fine tuning of the generated HTML:
- a list of tag's name that are replaced with white spaces when the tags is stripped, example:
given the string
divis present in the list bound to whitespace-elements stripping the tag will results in
" foo "
if not presents the results will be:
- a list of tags that will be kept as self closing tags, example:
will be kept as is if present in the list bound to self-closing-elements
but will became:
if not present.
- when a malformed HTML is provided to sanitize a spurious "</root-tag>" could appears at the end of the filtered string
Please send bug reports or patches to the issue tracker.
8LicenseThis library is released under Lisp Lesser General Public license (seeCOPYING.LESSER file)
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
My deep thanks to the authors of cl-sanitize and hunchentoot, Thank you!