inquisitor
2019-05-21
Encoding/end-of-line detection and of external-format abstraction for Common Lisp
Inquisitor
Encoding/end-of-line detection and external-format abstraction for Common Lisp.
The Library is a sphere whose exact center is any one of its hexagons and whose circumference is inaccessible. -- "The Library of Babel" by Jorge Luis Borges
Features
- encoding/end-of-line name abstraction
- encoding/end-of-line detection
- external-format abstraction
- make external-format for each implementations
- make external-format from byte-array, stream and pathname (with auto-detection)
- [TODO] abstract external-format of babel and flexi-stream
- supports many implementations
- GNU CLISP
- Embeddable Common Lisp
- Steel Bank Common Lisp
- Clozure Common Lisp
- Armed Bear Common Lisp
- Allegro Common Lisp
- LispWorks
Installation
Get and install via quicklisp:
CL-USER> (ql:quickload :inquisitor)
Usage
Encoding detection
To detect encoding from stream, use (inq:detect-encoding stream scheme)
.
This returns implementation independent encoding name.
About scheme
, see Encoding scheme
.
for example:
CL-USER> (with-open-file (in #P"t/data/unicode/utf-8.txt" :direction :input :element-type '(unsigned-byte 8)) (inq:detect-encoding in :jp)) :UTF-8
You can see the list of available encodings:
CL-USER> inq:+available-encodings+ (:UTF-8 :UCS-2LE :UCS-2BE :UTF-16 :ISO-2022-JP :EUC-JP :CP932 :BIG5 :ISO-2022-TW :GB2312 :GB18030 :ISO-2022-CN :EUC-KR :JOHAB :ISO-2022-KR :ISO-8859-6 :CP1256 :ISO-8859-7 :CP1253 :ISO-8859-8 :CP1255 :ISO-8859-9 :CP1254 :ISO-8859-5 :KOI8-R :KOI8-U :CP866 :CP1251 :ISO-8859-2 :CP1250 :ISO-8859-13 :CP1257)
Encoding scheme
Encoding scheme is a hint to detect encoding.
It's mostly impossible to detect encoding universally, because there are two encoding such that use same byte sequences to represent other characters. So, limitting target encodings has benefit to encoding detection.
Here, in inquisitor, languages are used to limit the encodings. Where languages are, roughly speaking, writing systems used in anywhere arround the world. Fixing language is equivalent to fixing possible characters. Becaus of which, encoding detection be slightly eazy.
Supported scheme (languages) is as follows:
- jp: japanese
- tw: taiwanese
- cn: chinese
- kr: korean
- ru: russian (latin-5)
- ar: arabic (latin-6)
- tr: turkish (latin-9)
- gr: greek (latin-7)
- hw: hebrew (latin-8)
- pl: polish (latin-2)
- bl: baltic (latin-7)
End-of-line type detection
If you want to know end-of-line (line break) type, use (inq:detect-end-of-line stream)
.
This returns implementation independent end-of-line name.
CL-USER> (with-open-file (in "t/data/ascii/ascii-crlf.txt" :direction :input :element-type '(unsigned-byte 8)) (inquisitor:detect-end-of-line in)) :CRLF
Implementation dependent/independent names
If you want to know implementation dependent name of encodings or eol type, use (inq:independent-name dependent-name)
.
Returned value can be used as external-format, or its part.
CL-USER> (inq:independent-name :cp932) #<ENCODING "CP932" :UNIX> ; on CLISP :WINDOWS-CP932 ; on ECL :SHIFT_JIS ; on SBCL :WINDOWS-31J ; on CCL :|X-MS932_0213| ; on ABCL
If you want to know implementation independent name of encodings or eol type, use (inq:dependent-name independent-name)
.
Eol
If you want to know eol is available on your implementation, use (inq:eol-available-p)
.
CL-USER> (inq:eol-available-p) NIL ; on SBCL
Make external-format
To make external-format from impl independent names, use (inq:make-external-format enc eol)
.
In SBCL and CCL, same code returns different value.
On SBCL:
CL-USER> (let* ((file #P"t/data/ja/sjis.txt") (enc (inq:detect-encoding file :jp)) (eol (inq:detect-end-of-line file))) (inq:make-external-format enc eol)) :SHIFT_JIS
On CCL:
CL-USER> (let* ((file #P"t/data/ja/sjis.txt") (enc (inq:detect-encoding file :jp)) (eol (inq:detect-end-of-line file))) (inq:make-external-format enc eol)) #<EXTERNAL-FORMAT :WINDOWS-31J/:UNIX #x302001C574CD>
External-format detection
Inquisitor provides external-format detection method. It detects encoding and eol style, then make external-format from these. It can use with vector, byte stream and pathname.
Let's see examples with CCL.
From vector
CL-USER> (inq:detect-external-format (encode-string-to-octets "公的な捜索係、調査官がいる。 わたしは彼らが任務を遂行しているところを見た。") :jp) #<EXTERNAL-FORMAT :UTF-8/:UNIX #x30200046719D>
From stream
CL-USER> (with-open-file (in "t/data/unicode/utf-8.txt" :direction :input :element-type '(unsigned-byte 8)) (inq:detect-external-format in :jp)) #<EXTERNAL-FORMAT :UTF-8/:UNIX #x30200046719D>
From pathname
CL-USER> (inq:detect-external-format #P"t/data/unicode/utf-8.txt" :jp) #<EXTERNAL-FORMAT :UTF-8/:UNIX #x30200046719D>
Author
- Shiro Kawai - original code of encoding detection for Gauche
- Masayuki Onjo - porting from Gauche to Common Lisp
- zqwell - porting multilingual encoding detection from libguess
- Shinichi Tanaka (shinichi.tanaka45@gmail.com) - adding end-of-line detection and wrapping external-format
Copyright
Copyright (c) 2000-2007 Shiro Kawai (shiro@acm.org)
Copyright (c) 2007 Masayuki Onjo (onjo@lispuser.net)
Copyright (c) 2011 zqwell (zqwell@gmail.com)
Copyright (c) 2015 Shinichi Tanaka (shinichi.tanaka45@gmail.com)
License
Licensed under the MIT License.