April 2, 2026 · Iridis Engineering
Every RAG tutorial follows the same script: load a PDF, split it into chunks, embed them, store them in a vector database, query with natural language, get answers. It works beautifully on the author’s carefully curated demo corpus. Then you point it at actual enterprise documents and everything falls apart.
The gap between demo RAG and production document intelligence is not incremental. It is architectural. The systems we build to process real corporate documents — scanned contracts, multi-column assessment reports, handwritten-annotated engineering specs, overlapping policy manuals — share almost nothing with the LangChain quickstart tutorial that got your stakeholders excited.
This post is about what goes into the systems that actually work.
Most RAG implementations start with a hidden assumption: your documents are digitally-native, single-column, cleanly formatted PDFs with consistent structure. This describes maybe 15% of documents in a typical enterprise corpus.
Here is what real document collections look like:
If your extraction pipeline cannot handle all of these, your system does not work. It works on a subset of documents, which means users learn they cannot trust it, which means they stop using it.
Document extraction is where production pipelines are won or lost, and it is almost completely absent from the RAG discourse. The community obsesses over embedding models, vector databases, and retrieval strategies while treating extraction as a solved problem. It is not.
OCR quality variance is the first real problem. Tesseract on a clean scan gives you 98%+ character accuracy. Tesseract on a faxed document from 2011 gives you 85%. That 13% gap is not uniformly distributed — it clusters around proper nouns, numbers, and technical terms. Exactly the tokens your retriever most needs to match. We have seen pipelines where extraction errors in names caused complete retrieval failures on the most important queries.
The extraction decision tree looks roughly like this:
We run a two-pass architecture on most corpora: fast traditional extraction first, then selective vision model passes on pages where the first pass produced low-confidence output or detected layout complexity. This cuts vision model costs by 70–80% while maintaining extraction quality on hard pages.
Structured vs. unstructured extraction is the second decision. Some documents have exploitable structure — section numbering, consistent headings, form fields, table-of-contents entries. When structure exists, use it. Parse the heading hierarchy. Extract table-of-contents entries and use them as a structural map. Pull form field labels and values as key-value pairs. This structural metadata is gold for downstream chunking and retrieval.
Fixed-size chunking (split every 512 tokens with 50-token overlap) is the default in every tutorial. It is also the single most common reason production RAG systems give bad answers.
Here is why. A 512-token chunk boundary does not respect the document’s information architecture. It will split a paragraph mid-sentence. It will separate a table header from its data rows. It will put a section title in one chunk and the section content in the next. It will combine the last paragraph of one topic with the first paragraph of an unrelated topic. Every one of these failures degrades retrieval quality.
Semantic chunking attempts to fix this by splitting at topic boundaries detected by embedding similarity. This is better than fixed-size, but it has its own failure modes. Embedding-based boundary detection is noisy — it will sometimes split mid-thought and sometimes fail to split between genuinely different topics. It also discards structural information that the document author intentionally created.
What actually works is hierarchical, structure-aware chunking. If you extracted document structure properly (and you should have), use it:
Each chunk carries metadata: its position in the document hierarchy, its parent section, its page numbers, its document source, and its extracted structural type (narrative, table, list, heading). This metadata is not optional — it drives retrieval filtering, citation, and the consolidation layer.
This is where most teams realize they are not building a search engine — they are building a knowledge base. And those are different engineering problems.
When you have 12 source documents covering overlapping topics — a policy manual from 2023, an update memo from 2024, two conflicting procedure documents, and eight meeting minutes where decisions were made and revised — you face the consolidation problem. The user does not want 12 partially-relevant chunks. They want a single coherent answer that reflects the current state of truth.
Entity resolution is the foundation. “Dr. Sarah Chen,” “S. Chen,” “the attending physician,” and “Dr. Chen (Cardiology)” must resolve to a single entity. Without this, your system retrieves four chunks that all describe the same person’s recommendation but treats them as independent sources. We build entity resolution as a preprocessing step during ingestion, not at query time.
Claim-level deduplication goes further. If three documents all state that “the project deadline is March 15,” you do not need three chunks saying the same thing. But if one says March 15 and another says March 22, that is a conflict, not a duplicate. The distinction matters. We extract atomic claims from chunks during indexing and store them with their source references.
Temporal reasoning is the hardest piece. When Document A (dated January 2024) says the policy is X and Document B (dated June 2024) says the policy is Y, the correct answer is usually Y — but not always. Sometimes Document A establishes the policy and Document B is an unrelated department’s interpretation. Temporal ordering is necessary but not sufficient. You need document authority metadata: what type of document is this, who issued it, and what is its standing?
We model this as a directed graph where documents have authority relationships and temporal ordering. A board resolution supersedes a departmental memo regardless of date. A later version supersedes an earlier version of the same document. These rules are domain-specific and must be configured per deployment.
This is the hill we will die on: if your system generates a statement it cannot trace to a specific source, page, and passage, it is not a document intelligence system. It is a chatbot that read some documents.
Citation is an architectural concern, not a prompt engineering afterthought. “Please cite your sources” in the system prompt does not produce reliable citations. The model will hallucinate plausible-looking citations, invent page numbers, or attribute claims to the wrong source. We have tested this extensively. Prompt-only citation fails 20–40% of the time depending on the model and query complexity.
What works is constrained generation with a verification layer. The generation pipeline looks like this:
Step 3 is where most teams cut corners, and it is the step that makes the system trustworthy. The verification pass catches hallucinated citations, misattributed claims, and subtle distortions where the model paraphrased a source inaccurately.
Cost note: the verification layer adds 15–30% to your generation costs. This is non-negotiable for any use case where accuracy matters. If you are building document intelligence for legal, financial, medical, or compliance domains and you skip verification, you are building a liability generator.
“It seems to work pretty well” is not a quality metric. Production document intelligence requires systematic measurement across the full pipeline.
We run a nightly evaluation suite against a held-out test set and trend all metrics over time. Pipeline changes that improve one metric at the expense of another get flagged for review, not auto-deployed.
Here is the architecture, end to end:
Ingestion: Document arrives (upload, email attachment, API push, watched S3 bucket) → file type detection → document type classification → route to extraction pipeline.
Extraction: Fast text extraction (digital) or OCR (scanned) → layout analysis → table extraction → structure detection (headings, sections, lists) → vision model pass on low-confidence pages → output: structured document representation with full positional metadata.
Quality gate 1: Extraction confidence score. Documents below threshold get flagged for human review, not silently ingested with bad data.
Chunking: Hierarchical split respecting document structure → metadata attachment (source, pages, section path, content type) → entity extraction and resolution → claim extraction for consolidation layer.
Indexing: Chunks embedded and stored in vector database with full metadata. Entity graph updated. Claim deduplication run. Temporal/authority graph updated.
Query time: User query → query analysis (what type of question, what filters apply) → hybrid retrieval (vector similarity + keyword + metadata filters) → reranking → context assembly (deduplicated, conflict-annotated, temporally ordered).
Generation: Assembled context + query → generation model → raw response with claim-source tags.
Verification: Each tagged claim checked against source passage → failed claims flagged → rewrite or remove → final response with verified citations.
Monitoring: Every stage logs latency, error rates, and quality scores. Extraction failures, low-confidence chunks, verification failures, and consistency drops all trigger alerts. Weekly metric reviews drive pipeline improvements.
Production document intelligence is 70% data engineering and 30% AI. The retrieval model matters less than the extraction quality. The embedding model matters less than the chunking strategy. The generation model matters less than the verification layer.
The teams that build systems users actually trust spend most of their time on the parts that never make it into blog posts or conference talks: extraction heuristics for weird document formats, entity resolution edge cases, citation verification logic, and quality monitoring dashboards that nobody outside the team will ever see.
That is the work. It is not glamorous, and there are no shortcuts. But it is the difference between a demo and a system that a legal team trusts with their case preparation, that a compliance department trusts with their regulatory responses, or that an executive trusts with their due diligence review.
Build the pipeline. Measure everything. Verify every claim. That is document intelligence.
15 minutes. No slides. No pitch. Just a conversation about what you’re building.