trivial-sanitize
2024-10-12
clean html strings: "<a>foo</a>" → "foo"
1Introduction
This library is a tiny library to clean HTML strings
2Status
The library is under development
3Prerequisites:
- alexandria
- cl-ppcre
- cl-html5-parser
To run the tests also
- clunit2
needs to be installed (via quicklisp).
4Installation:
The best way to get trivial-sanitize working is using the excellent
+quicklisp+
#+BEGIN_SRC common-lisp
+(ql:quickload "trivial-sanitize")+
#+END_SRC
This library is not yet in quicklisp, to install just copy the sources in your local-projects
directory.
5Usage
The public API have just a single function sanitize
.
(sanitize:sanitize "<a>foo</a>" sanitize:*strips-all-tags-sanitize*)
the arguments of sanitize
are:
(sanitize:sanitize str rules &key
(case-insensitive-tag-match t)
(strip-comments t)
(strips-all-tags-on-malformed-html nil))
where:
- str
- is the strings that needs to be cleaned;
- rules
- are the set of clean filters to be applied to
str
, the rules can be built using the macrodefine-rules
example below:(define-rules *basic-sanitize* :tags ("a" "abbr" "b" "blockquote" "br" "cite" "code" "dd" "dfn" "dl" "dt" "em" "i" "kbd" "li" "mark" "ol" "p" "pre" "q" "s" "samp" "small" "strike" "strong" "sub" "sup" "time" "u" "ul" "var") :attributes ((:all . ("title")) ("a" . ("href")) ("blockquote" . ("cite")) ("dfn" . ("title")) ("q" . ("cite")) ("time" . ("datetime" "pubdate"))) :add-attributes (("a" . (("rel" . "nofollow")))) :protocols (("a" . (("href" . (:ftp :http :https :mailto :relative)))) ("blockquote" . (("cite" . (:http :https :relative)))) ("q" . (("cite" . (:http :https :relative))))))
- ???
- the arguments for
:tags
is a list of allowed tags; - ???
- the arguments for
:attributes
is a list where each element is a list with the first element a tag name or a special keyword:all
and the rest of the list represents the allowed attributes for that tag; - ???
- the argument for
:add-attributes
is a list of lists where the first element of the latter is the name of a tag and the rest is also a list ofcons
that specify the name of the attribute (as thecar
) and the value of the attribute (as thecdr
), to be added to the attributes of the tag; so fro example a string like:<a href="http://...">email</a>
will became:
<a href="http://..." rel="nofollow">email</a>
- ???
- the arguments for
:protocols
is a list where each element is a list so formedfirst element: tag-name rest: attributes-protocols
- tag-name
- the name of the tag where this rule applies;
- attribute-protocols
- a list where each element is also a list
first element: attribute-name rest: allowed-protocols
for example the list:
'("a" . (("href" . (:ftp :http :https :mailto :relative))))
meas that, for tag
a
the attributehref
can contains values that specify protocols of type: "ftp", "http", "https", "mailto" or "relative" only.
Three sets of rules are already available:
*basic-sanitize*
*strips-all-tags-sanitize*
*restricted-sanitize*
*relaxed-sanitize*
Please see the file
src/sanitize.lisp
for their definition. - case-insensitive-tag-match
- if non nil, when matching tags (or other elements) ignore different case
- strip-comments
- if non nil remove comments (i.e.) text wrapped in
<!-- -->
on the same line - strips-all-tags-on-malformed-html
- if non nil when the parsing of
str
contains error run again this function using a set of rules that tries to strips all the tags fromstr
Moreover there are two important special variable that can help to make a fine tuning of the generated HTML:
- whitespace-elements
- a list of tag's name that are replaced with white spaces when the tags is stripped, example:
given the string
"<div>foo</div>"
if
div
is present in the list bound to whitespace-elements stripping the tag will results in" foo "
if not presents the results will be:
"foo"
- self-closing-elements
- a list of tags that will be kept as self closing tags, example:
"a<hr>b"
will be kept as is if present in the list bound to self-closing-elements
"a<hr>b"
but will became:
"a<hr></hr>b"
if not present.
6BUGS
- when a malformed HTML is provided to sanitize a spurious "</root-tag>" could appears at the end of the filtered string
Please send bug reports or patches to the issue tracker.
7License
This library is released under Lisp Lesser General Public license (seeCOPYING.LESSER file)8NO WARRANTY
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
9Acknowledgment
My deep thanks to the authors of cl-sanitize and hunchentoot, Thank you!