Data extraction

How to successfully extract data from documents

Before extracting data you need to:

  1. Create a document type
  2. Build a dataset
  3. Train the document type

Once you successfully completed the above steps, you can use the /extract-data endpoint to extract data.
We prepared a recipe with steps to help you out.



Supported file types

typless supports the following file types:

  • PDF
  • JPG
  • PNG
  • TIFF

If you are working with scanned documents we recommend using a resolution of 300 DPI to achieve optimal results.

🚧

Are you having problems with document quailty?

We wrote a short blog on how to solve your problems here.



Line item extraction

If you want to extract line items, make sure you defined the line-item structure in the document type.

Request parameters

ParamTypeRequiredDetails
document_type_namestringYESName of the document type that you use for extraction.
filestring (Base64 encoded file)YESThe original file of the document that you are extracting data from.
file_namestringYESName of the original file of the document that you are extracting data from.

Name must include file type suffix e.g. document.pdf
customerstringNOYour internal customer identification, used for .csv usage report.
e.g. "my customer"


Understanding response

In the response, you will always find all of the fields you defined in the document type. Values for fields will be sorted by the confidence score. If the field is not present on the document the value will be set to null.

Response base params

ParamTypeDetails
file_namestringSame value as provided in request
object_idstringId of document for sending feed-back to dataset.
extracted_fieldslistList of extracted fields
customerstringSame value as provided in request

Extracted fields params

ParamTypeBehaviour
namestringName of the field you defined in the document type
valueslistList of values for the field
data_typestringType of the field you defined in the document type

Extracted values params

ParamTypeBehaviour
xintTop left bounding box corner. If value is null this value will be -1
yintTop left bounding box corner. If value is null this bounding box corner value will be -1
widthintBounding box width. If value is null this width will be -1
heightintBounding box height. If value is null this height will be -1
valuestringValue for field in standard format
confidence_scorestringValue between 0 and 1. Bigger the value more confident the system is
page_numberintPage on which value is present. If value is null this page number will be -1

📘

Field supplier_name will never cary any positional data - Just value!

All the positional parameters will always be -1, except for confidence_score and the value.


Example full response

{
    "file_name": "invoice_2.pdf",
    "object_id": "1cb25cc8-c9fa-4149-9a83-b4ed6a2173b9",
    "extracted_fields": [
        {
            "name": "supplier",
            "values": [
                {
                    "x": -1,
                    "y": -1,
                    "width": -1,
                    "height": -1,
                    "value": "ScaleGrid",
                    "confidence_score": "0.968",
                    "page_number": -1
                }
            ],
            "data_type": "AUTHOR"
        },
        {
            "name": "invoice_number",
            "values": [
                {
                    "x": 1989,
                    "y": 545,
                    "width": 323,
                    "height": 54,
                    "value": "20190500005890",
                    "confidence_score": "0.250",
                    "page_number": 0
                },
                {
                    "x": 167,
                    "y": 574,
                    "width": 391,
                    "height": 54,
                    "value": "GB123456789",
                    "confidence_score": "0.250",
                    "page_number": 0
                }
            ],
            "data_type": "STRING"
        },
        {
            "name": "issue_date",
            "values": [
                {
                    "x": 2072,
                    "y": 628,
                    "width": 240,
                    "height": 54,
                    "value": "2019-06-05",
                    "confidence_score": "0.358",
                    "page_number": 0
                }
            ],
            "data_type": "DATE"
        },
        {
            "name": "total_amount",
            "values": [
                {
                    "x": 2146,
                    "y": 1196,
                    "width": 126,
                    "height": 54,
                    "value": "47.5300",
                    "confidence_score": "0.990",
                    "page_number": 0
                }
            ],
            "data_type": "NUMBER"
        }
    ],
    "line_items": [
        [
            {
                "name": "Description",
                "values": [
                    {
                        "x": 208,
                        "y": 1196,
                        "width": 1022,
                        "height": 50,
                        "value": "5/2019-MongoBackend-MgmtStandalone-Small-744 hours",
                        "confidence_score": "0.661",
                        "page_number": 0
                    }
                ],
                "data_type": "STRING"
            },
            {
                "name": "Price",
                "values": [
                    {
                        "x": 2146,
                        "y": 1196,
                        "width": 126,
                        "height": 54,
                        "value": "47.5300",
                        "confidence_score": "0.582",
                        "page_number": 0
                    }
                ],
                "data_type": "NUMBER"
            },
            {
                "name": "Quantity",
                "values": [
                    {
                        "x": 1979,
                        "y": 1196,
                        "width": 23,
                        "height": 54,
                        "value": "1",
                        "confidence_score": "0.647",
                        "page_number": 0
                    }
                ],
                "data_type": "NUMBER"
            }
        ]
    ],
    "customer": null
}

Example null response

{
    "file_name": "invoice.pdf",
    "object_id": "26e01d82-e7f4-48d3-a902-b74283b73279",
    "extracted_fields": [
        {
            "name": "total_amount",
            "values": [
                {
                    "x": -1,
                    "y": -1,
                    "width": -1,
                    "height": -1,
                    "value": null,
                    "confidence_score": "0.000",
                    "page_number": -1
                }
            ],
            "data_type": "NUMBER"
        },
        {
            "name": "invoice_number",
            "values": [
                {
                    "x": -1,
                    "y": -1,
                    "width": -1,
                    "height": -1,
                    "value": null,
                    "confidence_score": "0.000",
                    "page_number": -1
                }
            ],
            "data_type": "STRING"
        },
        {
            "name": "issue_date",
            "values": [
                {
                    "x": -1,
                    "y": -1,
                    "width": -1,
                    "height": -1,
                    "value": null,
                    "confidence_score": "0.000",
                    "page_number": -1
                }
            ],
            "data_type": "DATE"
        },
        {
            "name": "supplier",
            "values": [
                {
                    "x": -1,
                    "y": -1,
                    "width": -1,
                    "height": -1,
                    "value": null,
                    "confidence_score": "0.000",
                    "page_number": -1
                }
            ],
            "data_type": "AUTHOR"
        }
    ],
    "line_items": [],
    "customer": null
}