Create Knowledge for Your Models - Document Processing

Learn how to convert documents into knowledge for your AI applications. Process PDF files, including their images and tables, into structured data.

Updated Jul 15, 202523 min read
Tutorial banner

Converting complex documents like PDFs into knowledge that AI models can effectively use is often underestimated. While it might seem straightforward, flawed document processing is frequently the hidden culprit behind poor AI performance, yet the model often takes the blame. Even sophisticated tools like Optical Character Recognition (OCR) and Vision Language Models (VLMs) aren't perfect1; they can misinterpret layouts, miss text, struggle with tables, and generate inaccurate descriptions for images, leading to "garbage in" for your AI system.

This tutorial guides you through building a document processing pipeline designed to fix some of these issues. And it does it with completely local tools. We'll leverage Docling2 to convert PDFs into structured Markdown, incorporating advanced features like OCR via RapidOCR3 and automated image descriptions using SmolVLM4. Crucially, we will emphasize the absolute necessity of visually inspecting the conversion output to catch errors early. You'll learn how to go beyond simple text extraction by integrating visual context and refining the structure.

Furthermore, we'll explore strategies for transforming the processed text into genuinely useful knowledge components. This involves employing Large Language Models (LLMs) first for intelligent, semantic chunking - breaking the document into meaningful sections - and then for contextual enrichment, adding summaries that help situate each chunk within the document's broader narrative. This careful, step-by-step process transforms raw documents into high-quality, context-aware inputs ready for effective use in downstream AI applications like Retrieval-Augmented Generation (RAG).

Tutorial Goals

  • Understand document processing pipeline steps
  • Convert PDF to Markdown using Docling
  • Visually inspect document processing output
  • Describe images using Vision Language Models (VLMs)
  • Implement simple and LLM-based document chunking strategies
  • Enrich chunks with LLM-generated contextual summaries

Setup

References

Footnotes

  1. How do open source VLMs perform at OCR

  2. Docling 2

  3. RapidOCR 2

  4. SmolVLM on HuggingFace 2 3

  5. Evaluating Chunking Strategies for Retrieval

  6. Introducing Contextual Retrieval