
End-to-End Distributed PDF Processing Pipeline
OCR, Spatial Analysis & GPU Embeddings with Python
by Malcolm GreavesPDF processing pipelines break on edge cases. Data engineers spend hours debugging memory limitations, custom OCR scripts, and fragmented toolchains. Even processing a few PDFs can be challenging. There’s a whole host of new problems that arise when scaling up a PDF processing pipeline. These challenges become critical across industries that process documents at scale:
- •
Financial Services: Mortgage processors analyze thousands of loan applications daily—financial statements, employment verification, property assessments. Traditional OCR fails on rotated pages, mixed layouts, handwritten annotations.
- •
Legal Technology: Law firms process contract repositories with millions of agreements. Teams extract specific clauses, key terms, and language patterns while maintaining spatial context for legal interpretation.
- •
Healthcare: Medical organizations process patient records, insurance claims, clinical reports containing structured forms, handwritten notes, and medical imagery requiring flexible extraction approaches.
Traditional document processing tools fragment these workflows across multiple systems, creating operational complexity and scaling bottlenecks.
Enter Daft, a distributed query engine providing simple and reliable data processing for any modality at any scale. Daft lets you process documents as naturally as you process tabular data—no infrastructure complexity, just declarative pipelines that scale automatically.
This blogpost goes over our PDF handling tutorial. We build out a a production-quality PDF pipeline that handles downloading PDFs from an S3 bucket, parsing and processing them into an easy-to-use form, and computes embeddings using the GPU.
👨💻 Watch the Complete Walkthrough
Software engineer Malcolm demonstrates building a feature-rich PDF processing pipeline from scratch. Watch him download PDFs from S3, extract text with OCR, perform spatial layout analysis, compute GPU embeddings, and save everything to Parquet—all in 20 minutes.
The Problem: Traditional Engines Can't Handle Documents
Traditional engines excel at structured data—rows, columns, simple types. Documents require different architecture:
Data Type | Traditional Approach | Daft's Multimodal-First Design |
---|---|---|
GPU Workloads | Manual resource allocation, custom batching, separate orchestration |
|
Nested Document Structure | Flatten to rows/columns, lose spatial relationships | Preserve hierarchical structure |
Embeddings Pipeline | Separate systems (extraction → processing → embedding → storage) | End-to-end in single declarative pipeline |
Schema Management | Manually maintaining Python code and corresponding Arrow schemas | Automatic Pydantic → Arrow conversion |
Scaling | Rewrite for distributed execution, manage partitioning | Same code from laptop to cluster |
Traditional PDF processing requires this complexity:
1# Traditional approach - manual everything2pdf_schema = StructType([3 StructField("pages", ArrayType(StructType([4 StructField("text_blocks", ArrayType(StructType([...])))5 ])))6]) # 15+ lines just for schema789def process_pdf_traditional(pdf_bytes):10 # how do you use the pdf_schema arrow type?11 pass121314my_pdfs: list[Path] = ...15# This is sequential -- how do you balance running these in parallel?16for pdf in my_pdfs:17 with open(pdf, 'rb') as rb:18 processed = process_pdf_traditional(rb.read())19 # where do you collect these? in memory? write out to disk?
Daft's approach:
1from daft import DataFrame, udf, Series2from pydantic import BaseModel34class ParsedPdf(BaseModel):5 ...67# Multimodal-native - clean and declarative8@udf(return_dtype=daft_pyarrow_datatype(ParsedPdf))9class LoadAndParsePdf:10 def __call__(self, urls: Series, pdf_bytes: Series) -> Series:11 # Natural handling of complex document type into Arrow schema12 # Daft manages concurrency and spooling out results to your final write destination131415df: DataFrame = ...16df = df.with_column("parsed", LoadAndParsePdf(df['pdf_bytes']))
Architecture Overview: End-to-End Multimodal Processing

Daft PDF Processing Pipeline Architecture - Complete data transformation flow from PDFs in S3 to ML-ready Parquet files.
The Complete Pipeline
The architecture is the following six connected stages. Each one feeds directly into the other:
- •
Parallel S3 download → Handles thousands of documents simultaneously, with retry logic to obtain optimum throughput. Produces raw PDF bytes.
- •
Flexible PDF text extraction → Either OCR or PDF processing to extract text with bounding boxes.
- •
Spatial layout analysis → Coordinate-based grouping to recover line and paragraph structure.
- •
GPU embeddings → Uses readily available models to produce 384-dimensional vectors for each piece of grouped text.
- •
Structured Parquet output → Schema-validated data that’s ready to go for ML pipelines.

End-to-End Data Flow with Automatic Scaling - Complete pipeline visualization showing data transformations, distributed processing, and resource management.
🚀 Follow Along in Google Colab
Code alongside this tutorial with our interactive notebook. All dependencies are pre-installed, and you can run the complete pipeline with real data in minutes. Open the notebook →
Document Structure with Pydantic
Document processing requires structure. Unlike traditional NLP where text is one-dimensional, documents are spatial. Modern systems must understand both what the text says and where it appears.
1# This is just pseudocode that's based on the real code in the notebook!2# See the notebook for the full implementation.34class TextBlock(BaseModel):5 text: str6 bounding_box: BoundingBox78class ParsedPage(BaseModel):9 page_index: int10 text_blocks: list[TextBlock]1112class ParsedPdf(BaseModel):13 pdf_path: str14 total_pages: int15 pages: list[ParsedPage]
Pydantic models enforce structure while maintaining flexibility. Daft converts them automatically to Arrow types for efficient processing. This hierarchical structure captures the essential spatial relationships within documents.

Hierarchical Document Structure - Visual breakdown showing how PDFs map to nested Pydantic models. Each PDF contains multiple pages with spatially-aware text blocks that preserve both content and precise coordinate information for layout understanding.
Each PDF contains multiple pages, each page contains multiple text blocks, and each text block has both content and precise spatial coordinates. This spatial information enables understanding document layout and grouping related text together.
PDF Processing with User-Defined Functions (UDF)
Now that we’ve defined our data structure, we can use Daft to process these complex documents at scale. Daft's UDF system defines custom logic once and scales automatically across distributed infrastructure:

Daft UDF Execution Model - Write processing logic once and Daft automatically distributes it across multiple workers. Data partitioning, worker management, and result aggregation happen transparently without infrastructure code.
Automatic parallelization eliminates infrastructure complexity—write processing logic once, Daft handles data partitioning, worker management, and result aggregation.
1# This is just pseudocode, see notebook for full implementation!23import daft4from daft import col, udf56@udf(return_dtype=daft_pyarrow_datatype(ParsedPdf))7class LoadAndParsePdf:8 def __call__(self, urls: Series, pdf_bytes: Series) -> Series:9 results = []10 for url, pdf_data in zip(urls, pdf_bytes):11 if self.ocr:12 parsed_doc = ocr_text_blocks(pdf_data)13 else:14 parsed_doc = extract_text_blocks(pdf_data)15 results.append(parsed_doc.model_dump())16 return Series.from_pylist(results)
Raw extraction is only the first step. The resulting text fragments need intelligent organization to become useful.
Spatial Layout Analysis: From Fragments to Coherent Text
Raw OCR produces fragmented text blocks. Even reading text blocks from the PDF will often produce a jumbled mess: words split into two distinct text elements with their own bounding boxes, etc. Production systems require coherent, readable text maintaining document structure.

Spatial Layout Analysis Process - Three-stage transformation from scattered OCR fragments to coherent document structure. Coordinate-based algorithms preserve reading order and group related text while maintaining spatial relationships.
Here’s a snippet of the algorithm from the tutorial notebook that makes the raw extracted text values a bit more coherent. It uses coordinate-based heuristics to infer the document’s structure:
1# This is just pseudocode, see notebook for full implementation!23@udf(return_dtype=daft_pyarrow_datatype(list[ProcessedPage]))4class DocProcessor:5 def __call__(self, parsed_docs: Series) -> Series:6 results = []7 for doc in parsed_docs:8 for page in doc['pages']:9 # Sort into reading order (left-to-right, top-to-bottom)10 sorted_blocks = self.sort_reading_order(page['text_blocks']) # See notebook for coordinate-based grouping logic1112 # Group into lines, then paragraphs13 if self.group_paragraphs:14 grouped_blocks = self.group_into_paragraphs(sorted_blocks) # See notebook for Y-coordinate clustering15 else:16 grouped_blocks = self.group_into_lines(sorted_blocks) # See notebook for horizontal grouping1718 results.append(grouped_blocks)19 return Series.from_pylist(results)
The configurable thresholds (row_tolerance
, y_thresh
, x_thresh
) handle different document layouts while the algorithm respects spatial relationships that give text meaning.
GPU Embeddings at Scale
With our document now properly structured and reading-order preserved, the final step is to transform this processed content and generate semantic embeddings with automatic GPU allocation and lifecycle management:
1# This is just pseudocode, see notebook for full implementation!2@udf(3 return_dtype=daft.DataType.embedding(daft.DataType.float32(), 384),4 num_gpus=1 # Daft handles GPU lifecycle5)6class TextEmbedder:7 def __init__(self):8 self.model = SentenceTransformer("all-MiniLM-L6-v2").to("cuda").eval()910 def __call__(self, texts: Series) -> Series:11 with torch.no_grad():12 embeddings = self.model.encode(texts.to_pylist(), convert_to_numpy=True)13 return Series.from_numpy(embeddings)
The UDF system manages model initialization, GPU memory allocation, and batching optimization automatically.
The Complete Pipeline
Wire everything together in a single, declarative workflow:
1# This is just pseudocode, see notebook for full implementation!23import daft4from daft import col56# Configuration7IO_CONFIG = daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))89# Build the complete pipeline10pipeline = (11 # Download the PDFs from S312 daft.from_glob_path("s3://your-bucket/pdfs/*", io_config=IO_CONFIG)13 .with_column("pdf_bytes", col("path").url.download(io_config=IO_CONFIG))14 # Run either OCR or parse each PDF file to get out text and form into a structured object15 .with_column("parsed", LoadAndParsePdf.with_init_args(use_ocr=False)(col("path"), col("pdf_bytes")))16 # Reformat the text boxes structured objects (the `ParsedPdf` pydantic class instances) into rows.17 # Each row has the reading order index, the text, the page index, and the bounding box coordinates.18 .with_column("processed", DocProcessor.with_init_args(group_paragraphs=True)(col("parsed")))19 .explode("processed") # Flatten nested structure20 .with_column("indexed_texts", col("processed").struct.get("indexed_texts"))21 .explode("indexed_texts")22 .with_column("text_blocks", col("indexed_texts").struct.get("text"))23 .with_column("reading_order_index", col("indexed_texts").struct.get("index"))24 .with_column("text", col("text_blocks").struct.get("text"))25 .with_column("bounding_box", col("text_blocks").struct.get("bounding_box"))26 # Produce embeddings for each piece of text.27 .with_column("embeddings", TextEmbedder.with_init_args()(col("text")))28)2930# Execute and save31pipeline.write_parquet("./processed_documents")
What Makes This Different
Traditional engines excel at massive structured datasets, SQL workloads, proven reliability. Multimodal data requires different architecture:
- •
Automatic resource management: GPU allocation, memory management, batching optimization happen transparently.
- •
Spatial operations: Built-in support for coordinate-based operations and complex nested structures preserving document meaning.
- •
Declarative scaling: Same code processes 10 PDFs on laptops or 10,000 PDFs on clusters with zero infrastructure changes.
When data extends beyond traditional tables, Daft eliminates the architectural complexity gap between what teams need and what legacy engines provide.
Results
Production PDF processing pipeline handles:
- •
Parallel I/O: Hundreds of PDFs simultaneously with connection pooling and retry logic
- •
Flexible extraction: OCR or direct parsing with intelligent fallback mechanisms
- •
Spatial awareness: Coordinate-based heuristics preserving document structure
- •
GPU embeddings: Automatic model lifecycle management and memory optimization
- •
Structured output: Schema-validated Parquet files ready for ML pipelines
Framework scales automatically from development to production clusters. Processing logic stays clean while Daft handles distribution, memory management, fault tolerance.
Troubleshooting Common Issues
OCR fails on scanned documents: Install Tesseract
1# Ubuntu / Debian systems2sudo apt install tesseract-ocr34# OS X5brew install tesseract
GPU out of memory: Reduce batch size or use a smaller embedding model.
Different document formats: Same pipeline works for images, scanned PDFs, native PDFs. If extracting text directly from the PDF isn’t working, set use_ocr=True
. You should also do this for scanned documents.
Next Steps
Ready to build your own document processing pipeline?
- •
Try the interactive tutorial - Run the complete pipeline in Google Colab with real data
- •
Watch the video walkthrough - See our engineer build this pipeline step-by-step
- •
Read the docs - Explore multimodal processing and scale to production deployments
- •
Get started now -
pip install daft
FAQ
Q: How does this compare to existing PDF processing tools?
A: Traditional tools require separate systems for extraction, processing, and embeddings. Daft handles the entire pipeline in a single framework with automatic scaling and resource management.
Q: Can I use custom embedding models?
A: Yes, any HuggingFace or sentence-transformers model works. Just change the model_name
parameter. Make sure you also provide the correct embedding size if you change the model! (If using SentenceTransformer
, call .get_sentence_embedding_dimension()
on it!)
Q: What about cost optimization?
A: Daft's lazy evaluation and automatic batching minimize compute costs. GPU resources are allocated only when needed, and the system optimizes memory usage automatically.