OCRopus

from Wikipedia, the free encyclopedia
OCRopus

OCRopus text recognition.png
Basic data

developer Thomas Breuel, DFKI
Publishing year 2007
Current  version 1.3.3
( December 16, 2017 )
operating system FreeBSD , Linux , macOS , Windows 10
programming language C ++ , Python
category Text recognition
License Apache license
github.com/tmbdev/ocropy

OCRopus (also ocropy ) is free software for document analysis and text recognition with a very modular design. OCRopus was developed with the support of Google Inc. under the direction of Thomas Breuel from the German Research Center for Artificial Intelligence (DFKI) in Kaiserslautern and published as free software under the conditions of Version 2.0 of the Apache license .

description

OCRopus was designed especially for use in large-scale retro - digitization projects for books, for example at Google Books , Internet Archives or libraries. A large number of languages ​​and scripts should be supported. It can also be used for office applications or for the visually impaired.

The main components of OCRopus are:

Single or multiple scripts are available for these components. The modular approach allows individual workflows to be used and individual steps to be exchanged.

By default, OCRopus comes with a model for English texts and a model for texts in Fraktur . These models relate to the font and are largely independent of the actual language. New characters or language variants can either be trained anew or additionally.

The actual recognition is based on recurrent neural networks (LSTM) and works entirely without a language model. This means that language-independent models can be trained for which good recognition results were shown for English, German, and French at the same time. In addition to the Latin writing system, there are results for other scripts such as Sanskrit , Urdu , Devanagari , Greek .

Appropriate training can achieve very good recognition rates. This additional effort is particularly worthwhile for difficult documents or fonts that are no longer common today, which are not the focus of other OCR software.

history

On April 9, 2007, OCRopus was announced as a Google sponsored project to develop advanced OCR technologies. The funding was designed for three years and included in particular doctoral and postdoc positions at the DFKI and the University of Kaiserslautern. In return, Google Book Search also used OCRopus for automatic text recognition. Licensing under an open source license was made right at the beginning to facilitate collaborations between industrial and academic research. OCRopus has received further funding from the Andrew W. Mellon Foundation and the BMBF . In the course of the TextGrid project, font recognition for Fraktur was tackled.

The first alpha version 0.1 was published on October 22, 2007 and various pre-release versions appeared between December 2007 and May 2009. With version 0.4.4, a stable level was reached in 2010. The program was originally developed in C ++ , Python and Lua with Jam as the build system . A complete refactoring of the source code in Python modules includes version 0.5, which was released in 2012.

Initially, Tesseract was used as the only detection module. From version 0.4 (2009) Tesseract is only supported as a plugin. Instead, an in-house development for text recognition (also segment-based) was used. From 2013, recognition on recurrent neural networks (LSTM) was also offered, which will be continued with version 1.0 in November 2014 as the only recognizer.

The source code is managed via GitHub and is maintained and further developed by the developer community. The current version of OCRopus is 1.3.3 (December 2017).

Spin-offs

The OCR software Kraken is derived from OCRopus . Calamari is another descendant based on OCRopy and Kraken.

use

OCRopus workflow

OCRopus is a pure command line program . It is primarily developed for Linux platforms, but should be able to run on many platforms as long as its dependencies are met. It is used by specifying the input image on the command line. For more precise control, options can also be transferred to carry out certain actions such as the recognition of a single line. The results are output via the standard output (stdout) in HTML and CSS with special formatting ( hOCR ).

Example for calling the OCRopus scripts to recognize the text in an image:

# Binarisierung:
ocropus-nlbin tests/ersch.png -o book

# Layoutanalyse für Seite:
ocropus-gpageseg book/0001.bin.png

# Texterkennung der Linien (mit dem Fraktur Model):
ocropus-rpred -m models/fraktur.pyrnn.gz book/0001/*.bin.png

# HTML Ausgabe erzeugen:
ocropus-hocr book/0001.bin.png -o book/0001.html

Web links

Sources and individual references

  1. Release 1.3.3 . December 16, 2017 (accessed March 15, 2018).
  2. Release 1.3.3 . December 16, 2017 (accessed February 19, 2020).
  3. Release 1.3.3 . December 16, 2017 (accessed August 1, 2020).
  4. ^ Thomas Breuel: Recent Progress on the OCRopus OCR System . In: Proceedings of the International Workshop on Multilingual OCR (=  MOCR '09 ). ACM, New York, NY, USA 2009, ISBN 978-1-60558-698-4 , pp. 2: 1–2: 10 , doi : 10.1145 / 1577802.1577805 ( acm.org [accessed December 29, 2017]).
  5. Models. In: ocropy wiki. GitHub, accessed December 29, 2017 .
  6. ^ Adnan Ul-Hasan, Thomas M. Breuel: Can We Build Language-independent OCR Using LSTM Networks? In: Proceedings of the 4th International Workshop on Multilingual OCR (=  MOCR '13 ). ACM, New York, NY, USA 2013, ISBN 978-1-4503-2114-3 , pp. 9: 1–9: 5 , doi : 10.1145 / 2505377.2505394 ( acm.org [accessed December 30, 2017]).
  7. a b T. M. Breuel, A. Ul-Hasan, MA Al-Azawi, F. Shafait: High-Performance OCR for Printed English and Fraktur Using LSTM Networks . In: 2013 12th International Conference on Document Analysis and Recognition . August 2013, p. 683–687 , doi : 10.1109 / ICDAR.2013.140 ( ieee.org [accessed December 29, 2017]).
  8. Robert Nasarek: OCRopus - Hope bearer of Fraktur font recognition - Digital Humanities knitted themselves. In: Digital Humanities knitted yourself. May 23, 2017, accessed on December 29, 2017 (German).
  9. Uwe Springmann: OCR for old prints . In: Computer Science Spectrum . tape 39 , no. 6 , December 1, 2016, ISSN  0170-6012 , p. 459-462 , doi : 10.1007 / s00287-016-1004-3 ( springer.com [accessed December 30, 2017]).
  10. ^ Thomas Breuel: Announcing the OCRopus Open Source OCR System . In: Google Developers Blog . April 9, 2007 ( googleblog.com [accessed December 29, 2017]).
  11. OCRopus research project. DFKI, accessed on December 29, 2017 .
  12. Thomas M. Breuel: The OCRopus open source OCR system . tape 6815 . International Society for Optics and Photonics, January 28, 2008, p. 68150F , doi : 10.1117 / 12.783598 ( spiedigitallibrary.org [accessed December 29, 2017]).
  13. ocropus project website. In: Google Project Hosting. December 24, 2012, accessed December 30, 2017 .
  14. ^ Final report (public version): TextGrid - Networked Research Environment in the eHumanities . November 27, 2012 ( textgrid.de [PDF]).
  15. ocropy: older versions. In: GitHub Wiki. Retrieved December 29, 2017 .
  16. OCRopus 0.5. In: Google Groups. June 2, 2012, accessed January 5, 2018 .
  17. OCRopus doesn't even link with Tesseract by default .
  18. ocropy - release v1.0. GitHub, November 2, 2014, accessed December 29, 2017 .
  19. ocropy: Python-based tools for document analysis and OCR. GitHub, accessed December 29, 2017 .
  20. Releases ocropy. In: GitHub. Retrieved January 5, 2018 .
  21. Kraken: OCR engine for all the languages. Accessed March 10, 2019 .
  22. calamari: OCR engine based on OCRopy and octopuses. Accessed March 10, 2019 .
  23. ocropy wiki. GitHub, accessed December 29, 2017 .