OCR Parameters

The "OCR" web service can be used to run character recognition in PDF documents or images.

 

If recognition is run on images, they will be converted to PDF documents. More specifically, a page will be generated for each image in the PDF document, with this page containing the original image and a text layer with the recognized text.

 

Character recognition on PDF documents will only work with documents that do not contain text already. Normally, these will be documents that were generated by scanners and that only have an image per page in the PDF document.

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<operation xmlns="http://schema.webpdf.de/1.0/operation">
<ocr language="deu"
      checkResolution="false"
      forceEachPage="true"
      imageDpi="200"

      jpegQuality="60"
      outputFormat="pdf"

      normalizePageRotation="true"

      failOnWarning="true">
  <page width="210"
        height="297"
        metrics="mm"/>

   <pdfa>
     ...
  </pdfa>
</ocr>
</operation>

 

{
"ocr": {
  "language": "deu",
  "checkResolution": false,
  "forceEachPage": true,
  "imageDpi": 200,

  "jpegQuality": 60,
  "outputFormat": "pdf",

  "normalizePageRotation": true,

   "failOnWarning": true,
  "page": {
    "width": 210,
    "height": 297,
    "metrics": "mm"
   },
  "pdfa": {
     ...
   }
 }
}

 

ocr element

 

Used to define the text recognition parameters for images or PDF documents.

 

<ocr language="deu"
    checkResolution="false"
    forceEachPage="true"
    imageDpi="200"

    jpegQuality="60"
    outputFormat="pdf"

    normalizePageRotation="true"

    failOnWarning="true">

 

"ocr": {
  "language": "deu",
  "checkResolution": false,
  "forceEachPage": true,
  "imageDpi": 200,

  "jpegQuality": 60,
  "outputFormat": "pdf",

  "normalizePageRotation": true,

   "failOnWarning": true,

   ...

}

 

language (Default: "eng")

Used to specify the language for the output document (PDF/image). The language must be defined for the character recognition operation (OCR) so that the “special characters" of the respective language (e.g. "üäö" in German) can be recognized better. At present, the following languages are supported:

eng = English

fra = French

spa = Spanish

deu = German

ita = Italian

 

checkResolution (default: true)

If "true," then the DPI resolution of the output file will be checked. Resolutions of less than 200 DPI are rejected in this check because as a rule, they do not produce good results for character recognition.

 

forceEachPage (default: false)

If a PDF document contains text content on any page, the web service will refuse to run character recognition again. If, however, a value of "true" is passed for this option, all the pages in the document will be considered individually and character recognition will be run on all pages that do not contain text (layers) so that a new layer with text will be generated for them.

 

imageDpi (default: 200)

Used to set the minimum resolution images will be embedded with in resulting PDF documents. When a value of 0 is set for this parameter, the images shall be embedded using resolutions and dimensions as close as possible to the original source images.

 

jpegQuality (default: 75)

A percentage that sets the compression ratio and influences the quality of JPEG images, that shall be embedded in resulting PDF documents. Higher values will result in less compressed images of higher quality.

 

outputFormat (default: "pdf")

Different output formats can be created during character recognition. Generally, the document is generated as a PDF document, but the output can also be as an ASCII document or an XML document if desired (HOCR).

text = Text

hocr = XML (hOCR)

pdf = PDF

 

normalizePageRotation (Default: false)

If "true", then, for the recognition of a rotated text, the system will attempt to rotate the page in such a way that the text in the document will not appear to be rotated and will be shown “upright.”

 

failOnWarning (Default: false)

If "true”, character recognition will fail even in the event of warnings that do not prevent recognition, but that make it very unlikely for a meaningful result to be generated.

 

page element

 

If images are converted to PDF documents during the character recognition process, the size of the page will be computed based on the size of the image and the DPI resolution. This element can be used to specify a custom page size instead.

 

<page width="210"
    height="297"
    metrics="mm"/>
 

"page": {
"width": 210,
"height": 297,
"metrics": "mm"
}

 

width (default: 210)

height (default: 297)

Height and width of the page in the PDF document.

 

metrics (default: "mm")

Unit for the page size arguments.

mm = Millimetres

 

 

pdfa element

 

You can use the parameter structure described for the PDF/A service and insert it into the element.

 

<pdfa>
     ...
</pdfa>

"pdfa":{
     ...
},

 

hint

If this element is set, the "Pdfa" web service will be called automatically after the conversion.