Lost in Translation? Boosting Accuracy in PDF-based data extraction using LLMs

Introduction

Consider the financial document depicted below. At a glance, you and I immediately understand its structure: it contains two distinct columns, multiple tables (some with merged cells), and several empty cells. If asked about a specific detail, such as the Net Current Assets under the F&U Mandate column, you'd readily notice it's blank.

An example of the document that is being translated.

However, interpreting such documents becomes challenging when it comes to Large Language Models (LLMs). Parsing a PDF and sending it directly to an LLM typically results in a loss of spatial accuracy and reading order, leading to unintended inaccuracies in data extraction or summarization tasks. The LLM struggles to correctly interpret the document's layout without clear spatial cues, particularly in complex tables.

At Newtuple, our mission is to harness the power of LLMs to streamline the tedious task of extracting structured information from PDFs. Imagine the transformative potential: converting chaotic, unstructured document content into clean, structured database entries. Our work spans diverse document types, from intricate financial statements and detailed invoices to nuanced inventory reports and dense legal texts. LLMs hold immense promise because of their capability to digest vast amounts of unstructured data and turn them into meaningful, structured information.

Yet, our initial experiments with feeding raw PDFs directly to these models highlighted a critical limitation: contextual comprehension, especially spatial awareness. For instance, invoice tables with visually distinct column headers and corresponding values pose significant challenges. While humans effortlessly interpret these visual cues, LLMs often overlook them when parsing raw PDF data. This oversight results in inaccuracies and reduces the reliability of extracted information.

Recognizing this bottleneck, we explored ways to improve the accuracy of PDF-based data extraction using LLMs. The solution we found effective involves preprocessing documents into semi-structured formats like Markdown. With its structured syntax, Markdown greatly enhances an LLM’s ability to preserve spatial context and accurately interpret tabular layouts.

In this blog, we share our journey and insights into using essential preprocessing tools. We'll compare popular document parsing libraries that excel at converting PDFs into Markdown, focusing on their capacity to handle diverse document types and maintain tabular data integrity.

Preprocessors at a glance

Parsing documents into Markdown format has become an essential step in preparing content for Large Language Models (LLMs). Markdown’s simplicity and structured syntax make it an ideal choice for ensuring the content is clean, consistent, and ready for processing. In this blog, we explore four popular document parsing libraries—Docling, MegaParse, Tabled, and MarkItDown. These tools are evaluated based on their ability to handle diverse document types, maintain formatting integrity, and streamline the conversion process. If you're looking to optimize your workflow for feeding data into LLMs, this guide will help you choose the best library for your needs. You can refer to the links mentioned in this paragraph for installation and setup of the libraries.

To provide a clear and objective evaluation, we employed a specific methodology. Our primary focus was on Markdown Generation: analyzing the output formats, particularly for tabular data extracted from PDFs, alongside the generation of Markdown content itself.

Example documents

The following images were used for this study. All these images are available on the open internet. These were chosen since they contain data in complex structures like scientific notations, complex tables, empty cells, etc.

Physical Specifications of a product
A/D Converter specifications
Financial Statement

Results for the PDF-Based Data Extraction Using LLMs

Physical Specifications of a product
- Docling
- Tabled
- MegaParse

MarkItDown

A/D Converter Specifications
- Docling
- Tabled
- MegaParse

MarkItDown

Financial Statement
- Docling
- Tabled
- MegaParse

MarkItDown

Conclusion

In conclusion, our exploration into PDF preprocessing clearly illustrates the immense impact that structured preparation can have on the effectiveness of Large Language Models (LLMs) in extracting accurate, structured information from PDFs. By shifting from raw PDF inputs to structured Markdown using specialized tools like Docling, we significantly enhanced LLM accuracy by preserving vital spatial context and tabular integrity. Among the evaluated tools—Docling, MegaParse, Tabled, and MarkItDown—we observed very clear differences in the output, particularly when handling complex, tabular data. Docling seems to come on top of all the tools evaluated.

Ultimately, our work at Newtuple underscores a crucial insight: preprocessing PDFs into semi-structured formats such as Markdown is essential. This preprocessing step unlocks the true potential of LLMs, ensuring the extraction of reliable, accurate, and actionable structured data. As we continue refining these techniques, we're excited about the possibilities they will open for businesses seeking to transform unstructured PDF data into valuable operational intelligence.

Subscribe to our newsletter - Modern Data Stack

Lost in Translation? Boosting Accuracy in PDF-Based Data Extraction using LLMs

Introduction

Preprocessors at a glance

Example documents

Physical Specifications of a product

A/D Converter specifications

Financial Statement

Results for the PDF-Based Data Extraction Using LLMs

Physical Specifications of a product

Docling

Tabled

MegaParse

MarkItDown

A/D Converter Specifications

Docling

Tabled

MegaParse

MarkItDown

Financial Statement

Docling

Tabled

MegaParse

MarkItDown

Conclusion

Recent Posts

Commentaires