Building an EOB Parser: Why Healthcare Documents Are the Hardest to Parse

# healthcare# api# ai# python
Building an EOB Parser: Why Healthcare Documents Are the Hardest to ParseCal Mercer

I've built document parsers for tax forms, bank statements, and invoices. None of them prepared me...

I've built document parsers for tax forms, bank statements, and invoices. None of them prepared me for Explanation of Benefits documents.

EOBs are the documents your health insurance sends after a medical visit. They explain what was billed, what insurance paid, and what you owe. Simple concept. Absolute nightmare to parse.

Here's why - and how we eventually cracked it.

The Problem with EOBs

Every insurance company formats EOBs differently. Not just "slightly different layouts" - completely different information hierarchies, terminology, and structures.

Blue Cross puts the patient responsibility at the top.
Aetna buries it in a table on page 2.
UnitedHealthcare uses cryptic codes that require a separate decoder ring.
Kaiser somehow makes it even more confusing.

And that's just the major payers. There are 900+ health insurance companies in the US, each with their own EOB format.

Why Traditional OCR Fails

We tried Tesseract. It read the text fine but had no concept of what the text meant. A line like "Amount Billed: $450.00" and "Your Responsibility: $450.00" look similar to regex - but one is information, the other is what you actually owe.

We tried template matching. It worked for about 3 weeks until Blue Cross updated their EOB layout and broke everything.

We tried training custom models. The dataset problem is brutal - EOBs contain PHI (Protected Health Information), so you can't just scrape thousands of examples from the internet.

The Breakthrough: Vision LLMs + Structured Output

The solution came from treating this as a visual understanding problem, not a text extraction problem.

Modern vision models (Claude, GPT-4V) can look at an EOB and actually understand it the way a human does. They see the layout, recognize the patterns, and extract meaning.

But raw LLM output is unreliable. You need structured output with validation.

Here's the architecture that works:

EOB Image → Vision LLM → JSON Schema Validation → Structured Output
Enter fullscreen mode Exit fullscreen mode

The key is the schema. We define exactly what fields we expect:

{
  "patient": {
    "name": "string",
    "member_id": "string",
    "group_number": "string"
  },
  "claim": {
    "claim_number": "string",
    "date_of_service": "date",
    "provider": "string"
  },
  "services": [{
    "cpt_code": "string",
    "description": "string",
    "billed_amount": "number",
    "allowed_amount": "number",
    "insurance_paid": "number",
    "patient_responsibility": "number",
    "adjustment_reason": "string"
  }],
  "summary": {
    "total_billed": "number",
    "total_allowed": "number",
    "total_insurance_paid": "number",
    "total_patient_responsibility": "number"
  }
}
Enter fullscreen mode Exit fullscreen mode

The vision model extracts. The schema validates. Invalid responses get retried with more specific prompting.

Real-World Results

After 6 months of iteration:

  • 94% accuracy on patient responsibility amounts (the number that matters most)
  • 89% accuracy on individual service line items
  • Works across 50+ payer formats without template updates

The remaining errors are mostly edge cases: handwritten adjustments, multi-page EOBs where totals don't match line items, and the occasional payer that seems to actively obfuscate information.

The API

We wrapped this into an API. Upload an EOB image, get structured JSON back.

curl -X POST https://eobextractor.com/api/parse \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@eob.pdf"
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "patient": {
    "name": "Jane Smith",
    "member_id": "XYZ123456789"
  },
  "services": [{
    "cpt_code": "99213",
    "description": "Office visit, established patient",
    "billed_amount": 150.00,
    "allowed_amount": 89.50,
    "insurance_paid": 71.60,
    "patient_responsibility": 17.90
  }],
  "summary": {
    "total_patient_responsibility": 17.90
  }
}
Enter fullscreen mode Exit fullscreen mode

Who's Using This?

Three main use cases emerged:

  1. Patient advocacy apps - Help people understand what they actually owe
  2. Healthcare billing teams - Reconcile EOBs against claims at scale
  3. HSA/FSA platforms - Auto-categorize healthcare expenses

The healthcare billing space is massive and still surprisingly manual. We're seeing customers process thousands of EOBs per month that were previously hand-keyed.

Try It

If you're building anything that touches healthcare billing, EOB Extractor has a free tier. Upload a few EOBs and see the output.

The parser handles most payer formats out of the box. If you find one it struggles with, we'll add support - we're still improving coverage.


This is part of a series on document parsing. Next up: why bank statements are deceptively complex.