hOCR (standard)

from Wikipedia, the free encyclopedia

hOCR is an open standard that describes a data format that is used to represent text recognition results. With this format, in addition to the text, its layout, recognition accuracy, formatting and other information can be recorded. The format is based on XHTML (or HTML ). Metadata is saved in special <meta> tags according to the Dublin Core convention for embedding metadata in HTML.

software

The format was introduced in Google's OCRopus . In addition to OCRopus, the format can also be generated directly by CuneiForm , HOCR , a text recognition software specializing in Hebrew script, and from version 3.0 also by Tesseract .

The hocr-tools are a package of tools for processing (merging, splitting, inserting metadata) and analyzing hOCR data. With hocr2pdf, there is a command line tool for generating machine-searchable image PDF files using hOCR data.

Web links

Individual evidence

  1. exactcode.de/site/open_source/exactimage/hocr2pdf