Optical character recognition (OCR)

webPDF uses the integrated toolbox "tesseract," version 3.05, for character recognition (OCR), i.e. to convert graphics formats to PDF documents with text content. This toolbox is used by the "OCR" web service or the portal page.

 

The external toolbox is located in the "tesseract/" subdirectory, found in webPDF's installation directory.

 

The "OCR" web service is used to convert graphics source documents in the formats TIFF, JPEG or PNG into PDF documents. A PDF document is created which visibly contains the graphics element and behind this (in a PDF layer) the text extracted using OCR. This means that the PDF document is once again searchable and can be indexed, for example.

 

The "tesseract" toolbox is a freely available OCR engine. It delivers a good level of recognition performance, provided that the source graphics have at least 200 DPI. Nevertheless, bear in mind that this recognition is not error-free. Graphics which have a resolution of less than 200 DPI often lead to poor results. Also bear in mind that the OCR engine does not support hand writing (or similar fonts).

 

OCR requires a certain amount of data in order to be able to guarantee successful character recognition. It is possible that recognition will not be possible for individual pages with few text lines. You can enable error codes for this purpose (see "failOnWarning" parameter).

 

OCR can be run on rotated pages and allowed for the page to be “normalized,” i.e., to rotate the page in such a way that the text in the target document does not look rotated.

 

It is also important to specify the language of the source document when using the web service so that “special characters” (such as öäü in German) of the respective language are recognized. At present, the following languages are supported (see "language" parameters):

 

English

French

Spanish

German

Italian

 

Further languages can be set up provided that they are stored in the "tesseract/tessdata" folder and a corresponding entry is added in the "tesseract/languages.xml" file.

 

hint

At present, no other languages are supported which use a "Multibyte Character Set" (MBCS). These are, for example, Asian or Arabic languages.