Extract contents

The extraction operation element can be used to extract various (text) contents.

 

Various elements for the content being extracted can be inserted into the extraction element for this purpose.

 

These elements are as follows:

 

text = The PDF document’s text content

Generates an ASCII text, XML, or JSON file that will be returned as a result when the web service is called and that will contain all texts in the PDF document.

 

links = All the links in the PDF document

Generates an ASCII text, XML, or JSON file that will be returned as a result when the web service is called and that will contain all selected supported links in the PDF document. Every link is written to a separate line in the ASCII file.

 

info = General information about the PDF document

Generates an XML or JSON file that will be returned as a result when the web service is called. This file will contain information about the PDF document such as the corresponding security settings, PDF properties, or PDF/A status.

 

words = All the words in the PDF document, with page and position information

Generates an ASCII text, XML, or JSON file that will be returned as a result when the web service is called. For each word in the text, the file will contain the page number and the X-axis and Y-axis coordinates in the relevant page. When the TEXT output format is selected, only the words’ text will be output, separated with line breaks.

 

paragraphs = Text content of the PDF document, separated by paragraphs

Generates an ASCII text, XML, or JSON file that will be returned as a result when the web service is called and that will contain all texts in the PDF document separated by paragraphs.

In order for this to work, the paragraphs must be found in the PDF as elements. A purely visual separation will not work!

 

images = The PDF document’s image contents

Generates a ZIP file that is returned as a result when the web service is called. This file will contain all the images contained at the page level in a freely selectable page range.

 

tipp

The format of the document generated with the "extraction" operation is described by the http://schema.webpdf.de/1.0/extraction/text.xsd schema for "<text>", "<links>", "<words>", and "<paragraphs>" and by the http://schema.webpdf.de/1.0/extraction/info.xsd schema for "<info>".

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<operation xmlns="http://schema.webpdf.de/1.0/operation">
<extraction>
  <text pages="" fileFormat="xml"/>
</extraction>
</operation>

{
"extraction": {
  "text": {
    "pages": "",
    "fileFormat": "xml"
   }
 }
}

 

General attributes for the content elements:

 

pages (default: "")

Used to define which page(s) should be used for the extraction mode. The page number can be either an individual page, a page range, or a list (separated with commas) (e.g., "1,5-6,9"). A blank value or "*" selects all pages of the PDF document.

 

fileFormat (default: "xml")

Used to define the output format for the PDF document text contents being extracted.

 

text = Text document

xml = XML document

json = JSON data structure

 

 

Special attributes for the words content element:

 

delimitAfterPunctuation (default: true)

If this attribute is set to true, a new word will be started after each punctuation mark.

 

extendedSequenceCharacter (default: false)

This attribute specifies whether quotation marks and apostrophes should be handled the same way as brackets (such as parentheses and square brackets), i.e., whether they should be placed before the word they enclose. For an example, please refer to the portal description.

 

removePunctuation (default: false)

Used to specify whether punctuation marks should be included in the export or whether they should be explicitly removed.

 

 

The links element also contains the text sub-element:

 

links element

 

<links pages="" fileFormat="xml">
<text fromText="true" protocol="http" withoutProtocol="true"/>
</links>

"links": {
"text": {
  "fromText": true,
  "protocol": "http",
  "withoutProtocol": true,
 }
}

 

text element

 

<text fromText="true"

     protocol="http"

     withoutProtocol="true"/>
 

"text": {
"fromText": true,
"protocol": "http",
"withoutProtocol": true,

}

 

fromText (default: false)

Advanced mode for extracting links When using this mode, links will not be extracted from annotations, but will instead be extracted directly from the text. This means that links that are not found in standard mode can be found with this advanced mode, provided that they are present in the form of text.

 

protocol (default: "")

Provides the option of filtering the links being extracted by protocol. If multiple protocols are specified, they need to be separated with commas (e.g., "http,https,ftp"). The following values are valid: "http","https","ftp", "telnet","mailto", "file", "nntp", and "notes".

 

withoutProtocol (default: true")

When enabled, incomplete URLs from which the protocol information is missing will be extracted as well when extracting links from text. This would apply to the following examples, for instance:

"www.webpdf.de" - There is no protocol information. If the option is enabled and "http" links are searched for, the link will be extracted.

"ftp.softvision.de" - There is no protocol information here either. If the option is enabled and "ftp" links are searched for, the link will be extracted.

 

images element

 

<images fileFormat="zip"

     pages="*"

     fileNameTemplate="file[%d]"

     folderNameTemplate="page[%d]"

     fallbackFormat="png"/>

"images": {
"fileFormat": "zip",

"pages": "*",

"fileNameTemplate": "file[%d]",

"folderNameTemplate": "page[%d]",

"fallbackFormat": "png"

}

 

hint

The images mode can only be used to extract raster graphics (bitmap images). The extraction vector graphics, as well as the rendering of vector graphics based on vectorial drawing paths, is not supported.

 

hint

Due to licensing reasons, the images mode currently only supports the extraction of basic JPEG2000 images that conform to the part-1 core coding system definition in ISO/IEC 15444-1.

 

hint

It cannot be guaranteed that an image will be exported in its original source format, as the image may have already been converted when embedded in the PDF (this depends on whether the source format was supported by the PDF standard and on the application that was used to embed the image).

 

fileFormat (Default: "zip")

Used to define the output format for the PDF document images being extracted.

 

zip = ZIP archive

 

pages (default: "")

Used to define which page(s) should be used for the images mode. The page number can be an individual page, a page range, or a list (separated with commas) (e.g., "1,5-6,9"). A blank value or "*" selects all pages of the PDF document.

 

fileNameTemplate (Default: "file[%d]")

Used to set the template for the image files in the returned ZIP file. "file[%d]", for example, would result in a "file[1].png" entry for a PNG image.

 

folderNameTemplate (default: "page[%d]")

Used to set the template for the page folders in the returned ZIP file. "page[%d]", for example, would result in a folder called "page[1]" for page 1, etc.

 

fallbackFormat (Default: "png")

Used to specify the format that should be used as the fallback format if extracting an image would result in a format that is not supported.

 

png = PNG file

jpeg = JPEG file