Extract structured data from your collection

This feature is in beta. We are actively evaluating the usability and quality of this feature. Please share your feedback about it through [email protected].

Extract Structured Data Using Google's Pinpoint

You can use Pinpoint to extract structured data from a collection of similarly-formatted digitized or scanned PDF documents into a set of spreadsheets.

This feature works best with collections with these attributes

  • Share the same template
  • Share the same reading order (left-to-right or right-to-left only)
  • Using form-like or tabular format or the combination of both

For example, if you have ten thousand scanned auto accident reports that use a similar form, you can import the scans and export a spreadsheet that enables you to group, sort, or filter accidents by date, automobile manufacturer, or any other fields provided in the source documents.

You must have full access to Pinpoint to use this feature. If you don't have full access, you can request full access using this form.

 

Prepare your Pinpoint collection

  • Navigate to your collection consisting the documents which you wish to extract structured data from
  • If you don’t have a collection in Pinpoint for processing, create a new collection with the documents which you wish to extract structured data from
  • Make sure your collection has been fully processed by Pinpoint. Depending on the size and number of files, processing can take up to 24 hours
  • Click the “Extract Structured Data” link on lower left side of the collection view
  • Click the “Process collection” button. The processing can take from seconds to hours, depending on the size of your collection
  • Once the processing is complete, click “Annotate collection”

If you add documents to the processed Pinpoint collection, you would need to reprocess the collection. See “Reprocess annotated collection” for more detail.

Choose golden document

The Extract Structured Data tool will direct you to the annotation editor page and automatically select a “golden” document for you. This is a single document in which you create an annotation template to be applied to all of the documents in the same collection.

If you think the selected golden document is not the best fit for annotation, you can replace it with another document in the collection. See “Replace golden document”

If the document template in your collection has a lot of optional fields in it, we recommend choosing the document with the most optional fields available as the golden document to ensure the highest matching compatibility with all of the documents in your collection.

In the rare case where not all desired fields are covered in a single golden document, you can then add more golden documents to accommodate additional optional fields. See “Add golden document”.

Annotate collection

The annotation editor page is divided into four major sections:

  1. Main editor
    This is the dominant part of the page where you will perform document annotations. You will see your golden document and your added annotations in this section.
     
  2. Toolbar
    This section is at the top of the page where you can find all the actions menu for the annotation editor page, including the name of the golden document that you are working on.
     
  3. Annotations list
    This section is at the right hand side of the page where you will see the list of annotations you created in the golden document.
     
  4. Preview table
    This section is at the bottom of the page where you can find the preview the values of extracted fields from 10 randomly selected documents in your collection.

Currently, the tool only supports extraction to text or checkbox (boolean). All numerical values will be converted to text/string.

Key/Value

This tool is best used to extract a single labeled value from your collection. An example of the result of this annotation is “Country” as key and “United States of America” as the value.

To annotate your document using the Key/value annotation, follow these steps:

  • Select Key/value annotation tool on the top of the annotation editor page
  • Draw a rectangle around the value that you want to extract. You should make the rectangle longer to accommodate values with more characters in other documents
  • The tool will automatically select and mark a key for the value you selected. You can drag and edit this marker for accurate annotation
  • To change the name of the column header in the extracted data, you can edit the name of the key parameter within the Annotations section on the right side of the window
  • Repeat the steps for all the key-value pairs you wish to extract from your document collection

Each annotation is an approximate marker for the tool to extract the data from all of the documents in your collection.

When available, you can follow grids or markers in your document. If not, please make sure you accommodate for longer values.

Repeated section

This tool is best used to extract a section with repeating key-value pair(s). The annotation will be able to cover any number of continuous repeated sections over multiple pages.

To annotate your document using the Repeated Section annotation, follow these steps:

  • Select Repeated Section annotation tool on the top of the annotation editor page
  • Mark across the height of the first repeated instance of the section
  • The tool will automatically create a line approximately below the marked instance. Drag the line until the whole section you want to annotate is highlighted
  • Enter the name of the section in the “Repeated section name” pop-up.
  • Click “Save section”
  • Select Key/value annotation tool on the top of the annotation editor page
  • Within the range of the first repeated instance, follow the key/value annotation steps for all of the key-value pairs you want to extract

Tables

This tool is best used to extract data stored in tabular format. You will need to annotate each table you wish to extract in the document. Please note that the tool will work for a table that spans multiple pages, including repeated headers.

The tool will work best if the annotated table is of the same horizontal dimension, format, and headers across all of the documents in the collection.

To annotate your document using the Tables annotation, follow these steps:

  • Select Tables annotation tool on the top of the annotation editor page
  • Draw a rectangle over the table you wish to extract your data from. If the table spans multiple pages, you can highlight only the first page of the table
  • The tool will try to approximately detect the table. If this doesn’t roughly cover the table, please repeat the highlighting step
  • Adjust the outline to match the outline of the table. Drag the bottom line so all parts of the table are highlighted, including repeated headers and parts which are on following pages  
  • Enter the table name inside the pop-up box
  • Indicate whether the table has a header using the toggle in the pop-up box
  • Adjust the header and column separator lines to match the table’s formatting, clearly marking column widths and table headers representation in the document. You can add or delete column separators by right-clicking on the column separator
  • Click “Save Table”

Extract and download your data

Once you are happy with the result available in the preview table, you can extract your data by clicking the “Extract” button on the top right hand corner of the annotation editor page. This extraction is only applicable for the current set of annotations. If you edit annotations for your collection at a later date, you would need to redo this extraction process.

Once the extraction is complete, you can download the data by clicking “Download”. You will get a zip file containing CSV file(s), one for each tab in the preview table and one summary file for all the documents in the collection.

You can review the extraction result for a document by clicking the link corresponding to that document provided in the summary file. See "Review extraction result".

Review extraction result

After you extract some fields from your collection, you may wish to verify some of the extracted value, and see whether they match what you see in the document.
You can review the extraction results for each document in your collection by clicking the link corresponding to that document provided in the summary CSV file that you downloaded or by clicking the document link provided in the preview table.
The document extraction result page allows you to view all the extracted values for a single document, and validate them yourself
Selecting any annotation box within the document will show the extracted result in the panel on the right side and vice versa when selecting a value within the right side panel, the document will navigate to the corresponding annotation box.

 

Manage annotated collection

Reprocess annotated collection

Reprocessing an annotated collection will remove any annotations that you have made before

To redo the processing that Extract Structured Data tool performs on your collection, follow these steps : 

Manage golden documents

Replace golden document

To replace a golden document with another one, follow these steps: 

  • Navigate into the annotation editor page for your collection
  • In the annotation editor page, click the (three dots) menu
  • Select “Replace golden document”
  • Select your preferred golden document from the sample set, click “OK”
  • In the document review page, click “Set as golden” on the top right hand corner
  • Select “Replace an existing golden document”, click “OK”

The next step is dependent on whether the collection has a previously annotated golden document:

Add golden document

While reviewing extraction results, you can add more golden documents to accommodate for slight differences in the document template and additional optional fields to annotate in some documents.

You can do this by following these steps: 

  • Navigate to the document review page linked in the sample set preview table or the downloadable main summary CSV
  • Click “Set as golden” on the top right hand corner
  • Select “Add a new golden document”, click “OK”

The annotation process of additional golden documents is different from the regular annotation. See “Annotation transfer” for details.

Remove a golden document from the set

  • Select the name of the golden document from the file name dropdown at the top of the annotation editor page
  • In the same dropdown, select “Remove from golden documents set”
  • Click “Delete” in the following prompt to approve the action

Annotation transfer

After you added a new golden document to the set or replaced an existing annotated golden document, the tool will approximately match the previously existing annotation to the new golden document. 

If the tool can’t match the previously annotated field to the new golden document, the field will be marked as “Needs attention” in the Annotations section on the right hand side of the annotation editor page. 

To resolve this you can do one of the following steps:

  • If the field is actually available in the new golden document
    1. Add the annotation for that field
    2. Select “Resolve a “needs attention” key/value” in the prompt window
    3. Select the name of the field from the dropdown
    4. Click “OK”
  • If the field is not available in the new golden document
    1. Select the field box that needs attention in the Annotations section
    2. Click to mark the field as missing from only the new golden document 

If there’s any data from the new golden document that is not available in the Annotations section, you can manually annotate the data to add them only to the new golden document.

Edit annotation

Change field name or type

  • Select the field box in the Annotations section on the right hand side of the annotation editor page
  • Edit field name or type directly in the field box
  • Click “OK” in the following prompt
Change of field name or type will apply globally to all of the golden documents in the collection

Adjust key/value annotation

  • Click on the value annotation box you wish to adjust
  • Drag and move the selected box or adjust the dimension by moving the edges
  • Only applies to the currently edited golden document

Adjust repeated section annotation

  • Click anywhere on the repeated section annotation you wish to adjust
  • Adjust sections dimension by moving the separator lines vertically
  • Only applies to the currently edited golden document 

Adjust table annotation

  • Click anywhere on the table annotation box you wish to adjust
  • Drag and move lines within the box to adjust dimension, column width and header row
  • Only applies to the currently edited golden document

Delete annotation

To delete any annotation from all golden documents, follow these steps:

  • Select a field in the Annotations section on the right hand side of the annotation editor page
  • Click and acknowledge that you wish to delete the field from all golden documents
Google apps
Main menu
14002716057252068236
true
Search Help Center
true
true
true
false
false