The "OCR" web service can be used to run character recognition in PDF documents or images.
If recognition is run on images, they will be converted to PDF documents. More specifically, a page will be generated for each image in the PDF document, with this page containing the original image and a text layer with the recognized text.
Character recognition on PDF documents will only work with documents that do not contain text already. Normally, these will be documents that were generated by scanners and that only have an image per page in the PDF document.
ocr element
Used to define the text recognition parameters for images or PDF documents.
Used to specify the language for the output document (PDF/image). The language must be defined for the character recognition operation (OCR) so that the “special characters" of the respective language (e.g. "üäö" in German) can be recognized better. At present, the following languages are supported: eng = English fra = French spa = Spanish deu = German ita = Italian
checkResolution (default: true) If "true," then the DPI resolution of the output file will be checked. Resolutions of less than 200 DPI are rejected in this check because as a rule, they do not produce good results for character recognition.
forceEachPage (default: false) If a PDF document contains text content on any page, the web service will refuse to run character recognition again. If, however, a value of "true" is passed for this option, all the pages in the document will be considered individually and character recognition will be run on all pages that do not contain text (layers) so that a new layer with text will be generated for them.
imageDpi (default: 200) Used to set the minimum resolution images will be embedded with in resulting PDF documents. When a value of 0 is set for this parameter, the images shall be embedded using resolutions and dimensions as close as possible to the original source images.
jpegQuality (default: 75) A percentage that sets the compression ratio and influences the quality of JPEG images, that shall be embedded in resulting PDF documents. Higher values will result in less compressed images of higher quality.
outputFormat (default: "pdf") Different output formats can be created during character recognition. Generally, the document is generated as a PDF document, but the output can also be as an ASCII document or an XML document if desired (HOCR). text = Text hocr = XML (hOCR) pdf = PDF
normalizePageRotation (Default: false) If "true", then, for the recognition of a rotated text, the system will attempt to rotate the page in such a way that the text in the document will not appear to be rotated and will be shown “upright.”
failOnWarning (Default: false) If "true”, character recognition will fail even in the event of warnings that do not prevent recognition, but that make it very unlikely for a meaningful result to be generated.
page element
If images are converted to PDF documents during the character recognition process, the size of the page will be computed based on the size of the image and the DPI resolution. This element can be used to specify a custom page size instead.
width (default: 210) height (default: 297) Height and width of the page in the PDF document.
metrics (default: "mm") Unit for the page size arguments. mm = Millimetres
pdfa element
You can use the parameter structure described for the PDF/A service and insert it into the element.
|