cl-tesseract

2015-10-31
CL-TESSERACT is a set of CFFI bindings for the Tesseract OCR library v. 3.04: 
https://github.com/tesseract-ocr/tesseract

On OS X, Tesseract can be conveniently installed using Homebrew:
brew install tesseract

As Tesseract OCR?s capi changed in the update to v. 3.04, earlier versions such as 3.02
will not work with these bindings.

CL-TESSERACT also provides convenient lisp functions to retrieve text from images, 
IMAGE-TO-TEXT and IMAGE-TO-HOCR.

IMAGE-TO-TEXT accepts a lisp pathname and an optional language parameter and returns a 
unicode string:

* (image-to-text #P"~/eurotext.tif")
"The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,,schnelle? braune Fuchs springt
?ber den faulen Hund. Le renard brun
?rapide? saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra i] cane pigro. El zorro
marr?n r?pido salta sobre el perro
perezoso. A raposa marrom r?pida
salta sobre 0 C50 preguieoso.

"

* (image-to-text #P"~/eurotext.tif" :lang "rus")
"??? (?????) [?????] {???} ]????!
???? [?? $43?456.78 <1?2?> #90 603
& ?1???/3005?, ?? 12.5% ?? ?-????
???? ????????@???5?[?.??? ?? ????.
??? ?,5?11??11?? ?????? ????? ???????
????? ?!?? ?????? ????. ?? ?????? ????
????????? ???? ???-??????? 1? ?????
?????????. ?? ????? ?????? ??????
???? ????? ?] ???? ?????. ?1 ?????
?????? ?????? ???? ????? ?1 ????
????????. ? ?????? ????? ??????
???? ????? ? ??? ??????????.

"

Available languages are dependent on the Tesseract OCR .traineddata files located in the directory denoted by *TESSDATA-DIRECTORY*. CL-TESSERACT attempts to set this variable to 
a reasonable default for your platform.

IMAGE-TO-HOCR accepts a lisp pathname, the optional language parameter, and a optional 
page number (default 0) and return HOCR XML describing not just the recognized text, but 
its location in the page:

* (image-to-hocr #P"~/python-tesseract/eurotext.jpg?)
"  <div class='ocr_page' id='page_2' title='image \"/Users/Walrus/python-tesseract/eurotext.jpg\"; bbox 0 0 1024 800; ppageno 1'>
   <div class='ocr_carea' id='block_2_1' title=\"bbox 98 66 918 661\">
. . .
word_2_65' title='bbox 391 621 456 651; x_wconf 72' lang='eng' dir='ltr'>C50</span> <span class='ocrx_word' id='word_2_66' title='bbox 481 621 710 661; x_wconf 74' lang='eng' dir='ltr'>preguieoso.</span> 
     </span>
    </p>
   </div>
  </div>
"

This can be parsed using Common Lisp libraries such as Closure-XML and plump.

Tested on CCL and SBCL.

License: 

MIT

Author:
Edward Geist (egeist@stanford.edu)