cl-tesseract

2017-11-30

CFFI bindings to the Tesseract OCR library.

Upstream URL

github.com/GOFAI/cl-tesseract

Author

Edward Geist

License

MIT
README
CL-TESSERACT is a set of CFFI bindings for the Tesseract OCR library v. 3.04: https://github.com/tesseract-ocr/tesseract On OS X, Tesseract can be conveniently installed using Homebrew: brew install tesseract As Tesseract OCR’s capi changed in the update to v. 3.04, earlier versions such as 3.02 will not work with these bindings. CL-TESSERACT also provides convenient lisp functions to retrieve text from images, IMAGE-TO-TEXT and IMAGE-TO-HOCR. IMAGE-TO-TEXT accepts a lisp pathname and an optional language parameter and returns a unicode string: * (image-to-text #P"~/eurotext.tif") "The (quick) [brown] {fox} jumps! Over the $43,456.78 <lazy> #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam. Der ,,schnelle” braune Fuchs springt fiber den faulen Hund. Le renard brun «rapide» saute par-dessus le chien paresseux. La volpe marrone rapida salta sopra i] cane pigro. El zorro marrén répido salta sobre el perro perezoso. A raposa marrom répida salta sobre 0 C50 preguieoso. " * (image-to-text #P"~/eurotext.tif" :lang "rus") "ТЬе (чиісК) [Ьгошп] {Гох} ]итрз! Очег [пе $43‚456.78 <1а2у> #90 603 & ‹1исК/3005е, аз 12.5% ог Е-таіі Ггот азраттег@шеЬ5і[е.сош із зрат. Бег ‚,5с11пе11е” Ьгаипе Риспз зргіпві ііЬег ‹!еп Тапіеп Нипа. Ье гепага Ьгип «гарісіе» заше раг-сіеззиз 1е сЬіеп рагеззеих. Ьа уоіре тапопе гаріаа зама зорга і] сапе рівго. Е1 гогго таггбп гёріао зама воЬге е1 репо регегозо. А гароза шапот гйріаа зака воЬге о еде ргевиісозо. " Available languages are dependent on the Tesseract OCR .traineddata files located in the directory denoted by *TESSDATA-DIRECTORY*. CL-TESSERACT attempts to set this variable to a reasonable default for your platform. IMAGE-TO-HOCR accepts a lisp pathname, the optional language parameter, and a optional page number (default 0) and return HOCR XML describing not just the recognized text, but its location in the page: * (image-to-hocr #P"~/python-tesseract/eurotext.jpg”) " <div class='ocr_page' id='page_2' title='image \"/Users/Walrus/python-tesseract/eurotext.jpg\"; bbox 0 0 1024 800; ppageno 1'> <div class='ocr_carea' id='block_2_1' title=\"bbox 98 66 918 661\"> . . . word_2_65' title='bbox 391 621 456 651; x_wconf 72' lang='eng' dir='ltr'>C50</span> <span class='ocrx_word' id='word_2_66' title='bbox 481 621 710 661; x_wconf 74' lang='eng' dir='ltr'>preguieoso.</span> </span> </p> </div> </div> " This can be parsed using Common Lisp libraries such as Closure-XML and plump. Tested on CCL and SBCL. License: MIT Author: Edward Geist (egeist@stanford.edu)

Dependencies (1)

  • cffi

Dependents (0)

    • GitHub
    • Quicklisp