Tesseract (software)

from Wikipedia, the free encyclopedia
Tesseract

TesseractLogo.png
Basic data

Maintainer Ray Smith et al. a.
developer Google LLC , HP Inc.
Current  version 4.1.1
( December 26, 2019 )
Current preliminary version 5.0.0-alpha
(2019-10-30)
operating system Windows , Linux , macOS , Cygwin , OS / 2 or eComStation / ArcaOS
programming language C ++
category Text recognition
License Apache license, version 2.0
German speaking Yes
github.com/tesseract-ocr
Tesseract 3.03 in a Linux console

Tesseract is a free software for text recognition . The focus is on the recognition of text characters or lines of text, but Tesseract can also take on the task of breaking down a text into text blocks (layout analysis). Tesseract uses language models such as dictionaries to improve recognition rates.

Text recognition data is already available in additional modules for well over 100 languages ​​and language variants. Tesseract not only supports Latin Antiqua fonts, but also Fraktur , Devanagari (Indian script), Chinese , Arabic , Greek , Hebrew , Cyrillic and other fonts.

history

The software was originally developed between 1984 and 1994 by Hewlett-Packard for their scanners , but was never used in a product there. In a test by the University of Nevada, Las Vegas (UNLV), it emerged as one of the three most precise test candidates in 1995. After HP withdrew from the OCR market, development was largely idle until the code was handed over to the UNLV's Information Science Research Institute in 2005. It was found here that the former developer Ray Smith was now working at Google. After asking Google whether they were interested in the code, Google took it on, brought it up to date and released it in the same year under the Apache license via SourceForge .

In the world of free software, this meant a great leap in quality in the field of text recognition. The project migrated from SourceForge to Google's own software development platform Google Code , where it was further developed under the supervision of Google. The further development has been taking place on GitHub since 2015 .

Since 2006 the program has been developed further as the basis of Google Books . As of version 3.0 of September 2010, results can be output directly in the hOCR format and a new module for analyzing the page design has been introduced.

In version 3.02 of October 28, 2012, a. introduced the recognition of Arabic and Hebrew texts in bidirectional mode.

The tesseractindic project is dedicated to the task of making the program usable with languages ​​from the Indian language family .

At the end of 2016, Tesseract introduced a neural network for text recognition. Version 4 supports this new method, but can continue to work with the pattern comparison of the previous versions.

Since December 2018, Tesseract has been able to output the OCR results in the standardized ALTO format .

According to Google, it uses Tesseract for text recognition on mobile devices and in videos as well as for spam detection in e-mail images.

application

Tesseract is controlled from the command line under Windows using the usual Unix conventions and has the following format:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...] 

Tesseract reads the image in Tagged Image File Format (TIFF) and passes the text on to the output file. Older versions of Tesseract did not have their own layout analysis, so they relied on external software such as OCRopus to distribute text columns to individual image files. Current versions use the Leptonica program library for the analysis of the page design, but also for the direct processing of all common image formats.

Automated processing can be implemented with ImageMagick , for example .

As of version 3, Tesseract can save the scan results in hOCR format , whereby the page design is retained. Searchable PDF files can also be generated directly with this version.

There is a range of software that Tesseract integrates as a backend. Tesseract can be used as a character recognition module in OCRopus , which also offers analysis of the document structure and statistical language models. However, from version 0.4 OCRopus uses its own character recognition module based on neural networks as standard. In previous versions, Tesseract was used as the standard module in OCRopus. In addition to other possible backends, it can be used for character recognition in the OCRFeeder desktop OCR solution . Using hocr2pdf, for example, it is used in the Linux -based document management system Archivista to generate a text layer for raster-graphic images of scanned paper documents in order to make them searchable by machine.

Availability

Tesseract is as free software in source code under the terms of version 2.0 of the Apache License distributed (Apache Software License, ASL). In practically all common Linux distributions , it can be installed directly from the standard package sources. Installers for Windows are available from several vendors.

Tesseract serves u. a. in the following programs as the basis of text recognition:

  • gImageReader is a free graphic frontend and runs on Linux and Windows.
  • gscan2pdf a document scanning program under Linux.
  • ecoDMS is a commercial document management system for Windows, Linux and macOS.
  • Nextcloud OCR is a free extension for Nextcloud and enables text recognition for all image and PDF files
  • Office Manager Pro is a commercial document management system for Windows.
  • FreeOCR for Windows is available as version 5.4 (March 2015).
  • TesseractOCR Mac makes it available for macOS as well.
  • YAGF is one of several front ends that can be used on Linux.
  • PDFScanner is a program for scanning documents on Macs.
  • k2pdfopt is a platform-independent open source program that optimizes PDF files for e-readers. It can overlay a tesseract-based OCR layer over a scanned PDF file. The MS Windows version offers a GUI.
  • Capture2Text is a utility that quickly recognizes text from a screenshot.
  • (a9t9) Free OCR is an open source (GPL) Tesseract frontend for Windows desktop.
  • Tesseract.js is a port of Tesseract in JavaScript , which was created with the help of Emscripts .
  • Tesseract Studio .Net is another open source Tesseract front end for Windows.
  • Apache Tika uses Tesseract to find text in image files.
  • VietOCR is an open source ( Apache license ) GUI frontend for Tesseract and runs on Linux, macOS, Windows and other operating systems.
  • OCRmyPDF adds a text layer to existing scanned PDFs with the help of Tesseract.

See also

Web links

Commons : Tesseract (software)  - collection of images, videos and audio files

Individual evidence

  1. Release 4.1.1 . December 26, 2019 (accessed December 27, 2019).
  2. Release 5.0.0-alpha for windows ( English ) GitHub. Retrieved March 10, 2019.
  3. ^ In: Free Software Directory .
  4. tesseract-ocr / tessdata_best: Best (most accurate) trained LSTM models ( English ) GitHub. September 15, 2017. Retrieved September 25, 2017.
  5. a b Erik Bärwaldt: Letter salad . Text recognition with Tesseract. In: LinuxUser . No. 5 . Linux New Media AG, April 8, 2011.
  6. ^ Ray Smith: An Overview of the Tesseract OCR Engine . In: Ninth International Conference on Document Analysis and Recognition . ICDAR 2007. Volume 2 . IEEE, September 2007, pp. 629–633 (English, github.com [PDF]).
  7. Tesseract moved to github.com ( English ) Google Groups. June 14, 2015. Accessed March 20, 2018.
  8. Tesseractindic ( English ) GitHub. November 27, 2009. Retrieved April 19, 2016.
  9. Tesseract OCR ( English ) Google Open Source. Retrieved January 12, 2017.
  10. Running Tesseract ( English ) GitHub. Retrieved November 12, 2018.
  11. Selected papers on image processing and image analysis ( English ) July 7, 2007. Accessed April 19, 2016.
  12. Adnan Vatandas: Tesseract 3 and hOCR . October 2010. Retrieved October 28, 2010.
  13. Tesseract Wiki . Retrieved November 7, 2015.
  14. ocropus - Google Code . Archived from the original on May 14, 2008. Retrieved April 19, 2016.
  15. OCRopus doesn't even link with Tesseract by default . August 17, 2009. Retrieved April 19, 2016.
  16. Debian - Information about package tesseract-ocr in sid . Retrieved April 19, 2016.
  17. Debian Package Search . Retrieved April 19, 2016.
  18. Package tesseract ( English ) Retrieved on April 19, 2016.
  19. openSUSE package search . Retrieved April 19, 2016.
  20. ^ Mandriva Linux . Mandriva SA. Archived from the original on July 16, 2012.
  21. Downloads ( english ) In: Tesseract Wiki . GitHub. Retrieved November 13, 2016.
  22. gImageReader ( English ) GitHub. Retrieved April 19, 2016.
  23. gscan2pdf-2.1.4. Retrieved September 14, 2018 .
  24. ecoDMS functional scope . ecoDMS GmbH. Archived from the original on June 11, 2016. Retrieved April 19, 2016.
  25. janis91 / ocr: Nextcloud OCR (optical character recognition) processing for images and PDF ( English ) GitHub. Retrieved September 25, 2017.
  26. Office Manager User Guide . Software office Krekeler. Retrieved April 19, 2016.
  27. FreeOCR ( English ) Retrieved April 19, 2016.
  28. Tesseract Mac ( English ) Malcolm Hardie Solutions Ltd .. Retrieved April 19, 2016.
  29. YAGF ( English ) SourceForge. February 24, 2016. Retrieved April 19, 2016.
  30. Felix Rotthowe: PDFScanner ( English ) Retrieved on April 19, 2016.
  31. K2pdfopt ( English ) April 9, 2016. Accessed April 19, 2016.
  32. Capture2Text ( English ) SourceForge. January 15, 2016. Retrieved April 19, 2016.
  33. Tesseract OCR Software GUI ( English ) Accessed April 19, 2016.
  34. Tesseract.js ( English ) Retrieved November 17, 2019.
  35. Tesseract Studio .Net: A free Windows graphical interface to the Tesseract 4.0 OCR engine. ( English ) Retrieved January 28, 2018.
  36. TikaOCR . Apache Tika. March 26, 2019. Retrieved December 2, 2019.
  37. VietOCR ( English ) Retrieved October 2, 2019.
  38. jbarlow83: OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched: jbarlow83 / OCRmyPDF. December 3, 2019, accessed December 3, 2019 .