Text recognition

from Wikipedia, the free encyclopedia
Example of automatic text recognition

Text recognition based on optical character recognition (English optical character recognition , abbr. OCR ) and is a term used in information technology . It describes the automated text recognition within images.


Text recognition is necessary because optical input devices (scanners or digital cameras, but also fax receivers) can only deliver raster graphics as a result , i.e. H. Dots of different colors ( pixels ) arranged in rows and columns . Text recognition describes the task of recognizing the letters shown as such, i.e. H. to identify and to assign them the numerical value that is assigned to them according to the usual text coding ( ASCII , Unicode ). Automatic text recognition and OCR are often used as synonyms in the German-speaking world. From a technical point of view, however, OCR only relates to the sub-area of pattern comparisons of separated image parts as candidates for the recognition of individual characters. This OCR process is preceded by a global structure recognition, in which text blocks are first differentiated from graphic elements, the line structures are recognized and finally individual characters are separated. When deciding which character is present, a linguistic context can be taken into account using further algorithms .

Originally specially designed fonts were developed for automatic text recognition , which were used, for example, for printing check forms. These fonts were designed in such a way that the individual characters could be differentiated quickly and easily by an OCR reader. The font OCR-A (DIN 66008, ISO 1073-1) is characterized by particularly dissimilar characters, especially in terms of the numbers. OCR-B (ISO 1073-2) is more like a sans serif , non-proportional font, while OCR-H (DIN 66225) is based on handwritten numbers and capital letters.

The increased performance of modern computers and improved algorithms meanwhile also allow the recognition of “normal” printer fonts right through to handwriting (for example when distributing letters); However, if human readability is not a priority, barcodes that are easier to handle in terms of printing and identification are used.

Modern text recognition now encompasses more than just OCR, i.e. the translation of individual characters. In addition, methods of context analysis, Intelligent Character Recognition (ICR) , are used, with which the actual OCR results can be corrected. A character that was actually recognized as “8” can be corrected to a “B” if it is within a word. Instead of “8aum”, “tree” is recognized, but a conversion of “8th”, ie an alphanumeric combination, should not be made. In the area of ​​industrial text recognition systems, OCR / ICR systems are used. However, the boundaries of the OCR term are fluid, because OCR and ICR also serve as marketing terms in order to be able to better market technical developments. Also Intelligent Word Recognition (IWR) falls under this category. This approach tries to solve the problem of recognizing flowing handwriting, in which the individual characters cannot be clearly separated and therefore cannot be recognized using conventional OCR methods.

A fundamentally different approach to text recognition is used for handwriting recognition on touch screens or input fields ( PDA , etc.). Vector-based patterns are processed here, either 'offline' as a whole word or 'online' with additional analysis of the input flow ( e.g. Apple's Inkwell ).

A special form of text recognition results, for example, in the automatic processing of incoming mail from large companies. One of the tasks is to sort the documents. The content does not always have to be analyzed for this, but sometimes it is sufficient to recognize the rough features, such as the characteristic layout of forms, company logos, etc. As with OCR, certain types of text are classified using pattern recognition , which however relates globally to the entire sheet or to defined places instead of individual letters.


The starting point is an image file ( raster graphics ) that is generated from the template using a scanner , digital photography or video camera . The text recognition itself takes place in three stages:

Page and outline recognition

The image file is divided into relevant areas (texts, captions) and irrelevant areas (images, white areas, lines).

Pattern recognition

Error correction at the pixel level

The raw pixels can be corrected by their proximity relationships to adjacent pixels. Individual pixels are deleted. Missing pixels can be added. This increases the hit rate for a pure pattern comparison. This is strongly dependent on the contrast of the original.

Pattern comparison mapping

The pixel patterns in the text areas are compared with patterns in a database , and raw digital copies are generated.

Error correction character level (Intelligent Character Recognition, ICR)

The raw digitized material is compared with dictionaries and evaluated according to linguistic and statistical methods with regard to their probable freedom from errors. Depending on this evaluation, the text is output or, if necessary, fed to a new layout or pattern recognition with changed parameters.

Word-level error correction (Intelligent Word Recognition, IWR)

Fluent handwriting, in which the individual characters cannot be recognized separately from one another, is compared with dictionaries on the basis of global characteristics. The accuracy of hits decreases as the size of the integrated dictionary increases, as the likelihood of confusion increases. Areas of application are defined field areas with limited information options, for example handwritten addresses on envelopes.

Manual error correction

Many programs also offer a special mode for manual correction of characters that are not reliably recognized.

Coding in the output format

In the simplest case, a text file is generated in a defined format such as UTF-8 . Depending on the task at hand, it can also be output to a database or as a PDF file. Specialized output formats such as the XML- based formats ALTO and PAGE or hOCR , an HTML variant, save the text with layout information.

The quality of text recognition is determined by several factors, among others:

  • Quality of layout recognition,
  • Extent and quality of the sample database,
  • The size and quality of the dictionaries,
  • Quality of the algorithms for error correction,
  • Color, contrast, layout and font of the original document,
  • The resolution and quality of the image file.

The number of undetected errors in a document can be estimated, see Misspellings . While texts contain redundancies and therefore allow a higher error rate, lists of numbers, such as telephone numbers, require repeated proofreading.

Success through neural networks

Recently, artificial neural networks have often outperformed competing learning methods in handwriting applications . Between 2009 and 2012, the recurrent or deep forward neural networks of Jürgen Schmidhuber's research group at the Swiss AI laboratory IDSIA won a series of eight international competitions in the field of pattern recognition . In particular, their recurrent LSTM networks won three interconnected handwriting recognition competitions at the 2009 Intl. Conf. on Document Analysis and Recognition ( ICDAR ) ”, without built-in a priori knowledge of the three different languages ​​to be learned. The LSTM networks learned simultaneous segmentation and recognition. These were also the first international competitions to be won through deep learning or recurrent networks.

Even deep forward-looking networks such as Kunihiko Fukushima's convolution network of the 1980s are again important for handwriting recognition today. They have alternating positions of convolution and positions of neurons that compete with one another. Yann LeCun's team from New York University applied the well-known backpropagation algorithm to such networks in 1989 . Modern variants use so-called "max pooling" for the competitive situations. Finally, you crown the deep network with several fully networked layers of neurons. Fast GPU implementations of this combination were introduced by Scherer and colleagues in 2010. They have since won numerous handwriting and other pattern recognition competitions. GPU-based "max-pooling" convolutional networks were also the first methods that could recognize the handwritten digits of the MNIST benchmark as well as humans.

Even with printed text, there is a trend to replace the classic character-by-character text recognition with line-by-line text recognition using neural networks. This technology is used in the programs OCRopus and Tesseract (from version 4).


  • Retrieval of text information from image files in order to edit them further with the help of a word processor or to make them electronically searchable
  • Recognition of relevant features (e.g. post code, contract number, invoice number) for mechanical (postal route) or electronic ( workflow management system) sorting of a document
  • An extended full-text search in databases or document management systems to be able to search through PDFs and images.
  • Recognition of features for registration and, if necessary, tracking of objects (e.g. license plates )
  • Layout recognition: A formatted document is created that comes as close as possible to the template in terms of text, image and table arrangement.
  • Aids for the blind : The text recognition makes it possible for the blind to read scanned texts on a computer and Braille display or to have them read aloud via voice output.

OCR software

Proprietary software

As an auxiliary function in proprietary software:

Cloud based:

  • ABBYY Cloud OCR
  • Google Cloud Vision (Beta)
  • Microsoft Azure Computer Vision API
  • OCR.space Online OCR (proprietary, but freely usable)
  • TextScan Online OCR

Free software

Individual evidence

  1. 2012 Kurzweil AI interview with Jürgen Schmidhuber about the eight competitions that his deep learning team won between 2009 and 2012
  2. Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris KI; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th – 10th, 2009, Vancouver, BC , Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545-552; a preprint of the same name can be found at: http://people.idsia.ch/~juergen/nips2009.pdf
  3. A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no.5, 2009.
  4. ^ Schmidhuber, Jürgen ; Winning Handwriting Recognition Competitions Through Deep Learning , http://www.idsia.ch/~juergen/handwriting.html
  5. ^ Bengio, Y. (2009). Learning Deep Architectures for AI. Now Publishers. Archived copy ( memento of the original from March 21, 2014 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / www.iro.umontreal.ca
  6. ^ Schmidhuber, Jürgen ; My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013 , http://www.idsia.ch/~juergen/firstdeeplearner.html
  7. Fukushima, K .: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position . In: Biological Cybernetics . 36, No. 4, 1980, pp. 93-202. doi : 10.1007 / BF00344251 .
  8. Y. LeCun, B. Boser, JS thinkers, D. Henderson, RE Howard, W. Hubbard, LD Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1 (4): 541-551, 1989. PDF
  9. M. Riesenhuber, T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience , 1999. PDF
  10. Dominik Scherer, Andreas C. Müller, and Sven Behnke: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In 20th International Conference on Artificial Neural Networks (ICANN), pp.92-101, 2010. doi : 10.1007 / 978-3-642-15825-4_10 , PDF
  11. J. Schmidhuber , 2009–2013: Deep Learning since 1991: First Deep Learners to Win Contests in Pattern Recognition, Object Detection, Image Segmentation, Sequence Learning, Through Fast & Deep / Recurrent Neural Networks. www.deeplearning.it
  12. ^ DC Ciresan, U. Meier, J. Schmidhuber . Multi-column Deep Neural Networks for Image Classification. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012. http://www.idsia.ch/~juergen/cvpr2012.pdf
  13. ABBYY Cloud OCR SDK. Accessed December 4, 2017 .
  14. Vision API - analysis of image content | Google Cloud Platform. Retrieved December 4, 2017 .
  15. Computer Vision API - Image Processing | Microsoft Azure. Accessed December 4, 2017 .
  16. OCR.space Free Online OCR. Retrieved March 15, 2019 .
  17. TextScan OCR. Retrieved October 25, 2019 .