Cal MercerI've built document parsers for tax forms, bank statements, and invoices. None of them prepared me...
I've built document parsers for tax forms, bank statements, and invoices. None of them prepared me for Explanation of Benefits documents.
EOBs are the documents your health insurance sends after a medical visit. They explain what was billed, what insurance paid, and what you owe. Simple concept. Absolute nightmare to parse.
Here's why - and how we eventually cracked it.
Every insurance company formats EOBs differently. Not just "slightly different layouts" - completely different information hierarchies, terminology, and structures.
Blue Cross puts the patient responsibility at the top.
Aetna buries it in a table on page 2.
UnitedHealthcare uses cryptic codes that require a separate decoder ring.
Kaiser somehow makes it even more confusing.
And that's just the major payers. There are 900+ health insurance companies in the US, each with their own EOB format.
We tried Tesseract. It read the text fine but had no concept of what the text meant. A line like "Amount Billed: $450.00" and "Your Responsibility: $450.00" look similar to regex - but one is information, the other is what you actually owe.
We tried template matching. It worked for about 3 weeks until Blue Cross updated their EOB layout and broke everything.
We tried training custom models. The dataset problem is brutal - EOBs contain PHI (Protected Health Information), so you can't just scrape thousands of examples from the internet.
The solution came from treating this as a visual understanding problem, not a text extraction problem.
Modern vision models (Claude, GPT-4V) can look at an EOB and actually understand it the way a human does. They see the layout, recognize the patterns, and extract meaning.
But raw LLM output is unreliable. You need structured output with validation.
Here's the architecture that works:
EOB Image → Vision LLM → JSON Schema Validation → Structured Output
The key is the schema. We define exactly what fields we expect:
{
"patient": {
"name": "string",
"member_id": "string",
"group_number": "string"
},
"claim": {
"claim_number": "string",
"date_of_service": "date",
"provider": "string"
},
"services": [{
"cpt_code": "string",
"description": "string",
"billed_amount": "number",
"allowed_amount": "number",
"insurance_paid": "number",
"patient_responsibility": "number",
"adjustment_reason": "string"
}],
"summary": {
"total_billed": "number",
"total_allowed": "number",
"total_insurance_paid": "number",
"total_patient_responsibility": "number"
}
}
The vision model extracts. The schema validates. Invalid responses get retried with more specific prompting.
After 6 months of iteration:
The remaining errors are mostly edge cases: handwritten adjustments, multi-page EOBs where totals don't match line items, and the occasional payer that seems to actively obfuscate information.
We wrapped this into an API. Upload an EOB image, get structured JSON back.
curl -X POST https://eobextractor.com/api/parse \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@eob.pdf"
Response:
{
"patient": {
"name": "Jane Smith",
"member_id": "XYZ123456789"
},
"services": [{
"cpt_code": "99213",
"description": "Office visit, established patient",
"billed_amount": 150.00,
"allowed_amount": 89.50,
"insurance_paid": 71.60,
"patient_responsibility": 17.90
}],
"summary": {
"total_patient_responsibility": 17.90
}
}
Three main use cases emerged:
The healthcare billing space is massive and still surprisingly manual. We're seeing customers process thousands of EOBs per month that were previously hand-keyed.
If you're building anything that touches healthcare billing, EOB Extractor has a free tier. Upload a few EOBs and see the output.
The parser handles most payer formats out of the box. If you find one it struggles with, we'll add support - we're still improving coverage.
This is part of a series on document parsing. Next up: why bank statements are deceptively complex.