Document Intelligence: Making OCR Work at Scale

Enterprises are already using foundational LLMs such as Microsoft Copilot or Google Gemini for document tasks such as summarization and content extraction. An adjuster uses Copilot to summarize a medical narrative. Another uses Gemini to pull the demand amount from a demand letter. However, claims documents can span hundreds or thousands of pages, and insights gained are only useful when linked back to the original claim in the system of record. Ad hoc LLM extraction doesn’t provide this connection and doesn’t scale.

The OCR Landscape: Selecting for Scale

As a SaaS platform, CLARA Analytics processes millions of documents annually, including documents that regularly span hundreds, or even thousands, of pages. At that volume, OCR solution selection comes down to six factors: accuracy by document type, cost at volume, speed and throughput for very large documents, bounding boxes for citations and provenance, confidence scores for human-in-the-loop quality routing, and deterministic output for reproducibility in regulated workflows. No single solution wins across all six.

The cost per page varies dramatically depending on the selected approach. Foundational model LLM-based vision extraction from vendors like OpenAI or Anthropic would range from 0.2 to 2.4 cents, depending on vendor and model tier. Google Vision API runs under a penny. The spread across solutions is nearly 10x for the same task, and not all options meet CLARA’s requirements.

Document complexity matters and needs to be considered. Standard text heavy PDFs like medical records and claim forms? Neural network OCR, like Google Vision API, handles these at 99%+ accuracy. But documents with illustrations, complex nested tables, or mixed handwriting may need a different approach. Multimodal LLMs outperform on visually complex pages.

For CLARA, word and paragraph bounding boxes are not a nice-to-have. They are how we link extracted insights back to exact page locations for citations and annotations in the portal. When an adjuster reviews a demand amount or treatment summary, they can trace it to the source. LLMs do not return bounding boxes, which means any LLM-based extraction requires a separate step to establish provenance.

The table below breaks down the current landscape factors.

Solution	Type	Cost/1K Pages	Speed/Page	Confidence Scores	Bounding Boxes	Document Complexity	Best For
Google Vision API	Neural network OCR	~$1.50	ms	Yes	Yes	Printed text 99%+. Clean handwriting 80-95%. Struggles with multi-column layouts and spatial arrangement.	Bulk transcription at scale
Google Document AI	GenAI-powered OCR	~$3-10	ms-sec	Yes	Yes	Forms, tables, key-value pairs, checkboxes. Math OCR add-on. Handwriting in 50 languages.	Structured forms, tables
Gemini Flash (Vertex AI)	LLM vision	~$3.50	sec	No	No	Strong on complex layouts, illustrations, nested tables. Understands spatial meaning. Non-deterministic output.	Complex layouts, interpretation
GPT-4.1 Mini (OpenAI)	LLM vision (mini)	~$2.80	sec	No	No	Handles mixed content, handwriting, multi-column. Weaker on dense data tables vs. flagship.	Cost-effective LLM extraction
GPT-5 Mini (OpenAI)	LLM vision (mini)	~$2.70	sec	No	No	Strong on mixed content, handwriting, embedded images. Better reasoning over layout than 4.1 Mini.	Balanced cost and quality
Claude Haiku (Anthropic)	LLM vision (mini)	~$6	sec	No	No	Handles mixed content, handwriting. Less capable on complex nested tables than flagship Sonnet.	Fast, budget LLM extraction
Mistral OCR 3	Specialized LLM	~$1-2	sec	No	Limited	Claims strong table and handwriting support. Independent tests show mixed results on complex layouts, near-zero checkbox detection.	Budget extraction

After evaluating these options across cost, accuracy, speed, and claims document extraction requirements, we arrived at a two-phase approach: Google Vision API for high-volume transcription, followed by targeted LLM interpretation on the pre-extracted text. This separation keeps “transcription” OCR costs at utility pricing while focusing LLM spend on the “interpretation” tasks that actually drive the most value.

OCR for Transcription. LLMs for Interpretation.

LLMs are remarkably powerful at document understanding. The instinct is to reach for one for every document task. But when you are building a multi-tenant SaaS platform processing millions of documents at scale, cost per page is a unit economic, not an optimization exercise. Industry research is converging on a more nuanced answer, and it matches what we learned: the best document intelligence systems use AI at every layer, but an LLM is not always the right AI.

Our approach is hybrid; we use Google Vision OCR for transcription, and foundation LLMs for interpretation downstream.

Google Vision API uses specialized neural networks trained for character recognition, not generative LLMs. The difference matters: deterministic output, word-level bounding boxes with confidence scores, millisecond processing, utility pricing. We embed that extracted text into non-searchable PDFs for portal search across every document.

The LLM work is targeted interpretation. Once text is extracted, we go after specific elements: legal field extraction (demand amounts, deadline dates, attorney names), medical document analysis (date of service, facility, treatment triples), visit-level summarization, and litigation risk signals. Each is a distinct pipeline stage running against pre-extracted text, not raw pages.

The value is not in transcribing a PDF. That is commodity work. The value is in finding the physician’s note on page 47 that changes the trajectory of a claim.

At our volume, vendor and model selection are critical cost levers. We burn billions of LLM tokens annually and benchmark models from various foundation models from OpenAI, Google, and Anthropic on our real-world adjustor tasks. The most expensive reasoning model is rarely the best choice. One detail practitioners overlook: LLMs lack character-level confidence scores. For regulated workflows requiring human review of low-confidence extractions, that gap matters.

For CLARA Analytics’ architecture, every layer is independently swappable. Upgrade the LLM without touching OCR. Swap the search strategy without reprocessing documents. That modularity compounds over time, because the landscape keeps shifting.

The Takeaway

We arrived at this architecture years ago out of necessity: specialized OCR for transcription at scale, LLMs for targeted interpretation, each layer independently swappable. We did it because the math demanded it, but cost turned out to be just one of the reasons it works.

Separating OCR from LLM interpretation gives us bounding boxes for citations and annotations. Confidence scores for automated quality routing. Deterministic output for audit trails. Parallel processing for 300-page medical records in seconds instead of minutes. And tens of millions of pages at under a penny each for transcription.

The interpretation layer is where the value lives, not the transcription. And by keeping them separate, we can upgrade either independently as the landscape shifts.
The industry research now validates our approach. The OmniAI benchmark, Vellum’s extraction framework, and Cradl AI’s production analysis all converge on the same architecture we run in production. Even as LLM-based extraction improves and new models appear on the Hugging Face leaderboards quarterly, the fundamental advantages of specialized OCR for transcription remain: bounding boxes, confidence scores, deterministic output, and parallel processing at scale. What was once a pragmatic decision is now industry consensus.

Claims documents can span thousands of pages of legal demands, medical records, and bills. An adjuster pasting documents into Copilot gets a summary that lives nowhere, connects to nothing, and has no provenance. A purpose-built pipeline extracts what matters, summarizes legal and medical severity, surfaces time-sensitive deadlines, and cites every insight to the exact page and paragraph, linked to the claim in the system of record. Transcription at utility cost. Interpretation with provenance. Each layer is independently upgradeable. That is the architecture that scales.

References

OmniAI OCR Benchmark: 1,000 documents, 10+ providers (https://getomni.ai/blog/ocr-benchmark)
Vellum, “Document Data Extraction: LLMs vs OCRs” (https://www.vellum.ai/blog/document-data-extraction-llms-vs-ocrs)
Cradl AI, “Using LLMs for OCR and PDF Parsing” (https://www.cradl.ai/posts/llm-ocr)
Reducto, “Mistral OCR vs. Gemini Flash 2.0” (https://reducto.ai/blog/lvm-ocr-accuracy-mistral-gemini)

Doug Lawrence

Douglas Lawrence is a CTO building AI-native platforms. This blog is part of CLARA Analytics’s series on engineering leadership, AI architecture, and the realities of shipping AI production systems at scale.

Featured News

It's Easy To Get Started

Optimize claims outcomes with the power of AI

Request Demo

Document Intelligence: Making OCR Work at Scale

The OCR Landscape: Selecting for Scale

OCR for Transcription. LLMs for Interpretation.

The Takeaway

References

Doug Lawrence

Featured News

The Data Context Layer: Context Engineering for AI-Native Platforms

AI-Native Engineering Leadership: The Six Forces Reshaping How We Build

Shadow AI is a Governance Failure: How to Build a “Paved Road” for Safe Experimentation

It's Easy To Get Started