PDF extraction — Pith glossary

Back to glossary

PDF extraction is the process of turning a PDF — a format designed for fixed visual layout, not for data — into structured, searchable text, typically combining optical character recognition (OCR) with layout analysis to recover reading order, headings, tables, and columns.

Why it matters

A PDF is a description of where ink goes on a page, not a record of what the document *says*. That makes extraction deceptively hard. A scanned PDF has no text at all, only an image, so OCR has to read the pixels. A born-digital PDF has text but no structure: a two-column layout can extract as interleaved nonsense, a table can collapse into a jumble of numbers, headers and footnotes can splice into the body mid-sentence. Reading order, which a human infers at a glance, has to be reconstructed.

This matters because so much of the material worth keeping lives in PDFs — research papers, regulatory filings, analyst reports, slide decks exported to print. Anything that cannot reliably extract a PDF cannot search it, cite it, or feed it to a model. Extraction is the gate between "I have the file" and "I can use what's in it."

The state of the art has moved from rule-based parsers (PyMuPDF, pdfplumber) toward vision-and-language models (Mistral OCR, and similar) that read a page the way a person does — recovering structure and reading order, not just characters. The honest limit is that extraction quality varies with the source: a clean filing extracts near-perfectly; a low-resolution scan of a faxed table does not, and no model fully closes that gap.

How Pith relates

Pith uses OCR (Mistral) to bring PDFs into the cited wiki: the extracted text is summarised, embedded, and synthesised alongside your web bookmarks, with citations back to the file. There is a deliberate boundary — images interpreted by vision are kept *out* of the wiki, because a vision guess is lower-provenance than text the OCR actually read, and the [source-grounding](/glossary/source-grounding) guarantee only holds for claims that trace to real extracted text.

Why it matters

How Pith relates

See also