One image schema for four VLM providers: we stopped reformatting payloads

# machinelearning# computervision# llm# infrastructure
One image schema for four VLM providers: we stopped reformatting payloadsMarco Rinaldi

TL;DR: We reconstruct grayscale frames from event-camera data and send them to vision-language models...

TL;DR: We reconstruct grayscale frames from event-camera data and send them to vision-language models for weak scene labels. Four providers, four slightly different ways to attach an image, and our payload-building code had grown three branches. Putting Bifrost in front of the VLMs gave us one OpenAI-compatible image schema. Here's the honest version, including where LiteLLM and Portkey do it better.

So, the thing is, on my team at Prophesee we don't have classic RGB frames. We have event streams, and for some of our offline tooling we reconstruct short grayscale frames from those events so a vision-language model can give us a rough scene description. "Person crossing, low light, motion blur on the left." Weak labels. Useful for triaging which clips a human should look at first.

We send those reconstructed frames to four VLMs depending on the job: OpenAI's gpt-4o, Anthropic's Claude, Gemini through Google Vertex, and occasionally Mistral's vision model. Same picture, four annoyingly different request bodies.

The actual pain

Let me give you the full picture here. The OpenAI shape wants an image_url content block, and the URL can be a base64 data URI. Fine. Then you go to Vertex and the structure shifts, the field names shift, and the detail hint you were passing to control token cost stops meaning anything. Anthropic wants its own source object with a media type. None of this is hard. It's just three if provider == branches in a function that should do one thing.

We had maybe 80 lines of payload-shaping code in our Python annotation service. Every time a provider tweaked an API version, something silently broke and a batch of frames came back unlabelled at 2am during an overnight run. Not dramatic. Just paper cuts that add up over a 6-person team.

What we changed

We put Bifrost (an open-source AI gateway written in Go) in front of all four providers. It exposes a single OpenAI-compatible API, including multimodal, so our service now builds exactly one image message and never branches on provider again.

Running it locally was a one-liner:

npx -y @maximhq/bifrost
# or
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Then our call looks identical no matter who serves it. We just change the model string:

curl -X POST http://localhost:8080/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "vertex/gemini-1.5-pro",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe the scene. One line."},
        {"type": "image_url",
         "image_url": {"url": "data:image/png;base64,iVBORw0K... "}}
      ]
    }]
  }'
Enter fullscreen mode Exit fullscreen mode

Swap vertex/gemini-1.5-pro for openai/gpt-4o or anthropic/claude-... and the body doesn't move. That's the whole point. The gateway translates to each provider's native multimodal format on the way out. (Streaming and multimodal docs.)

Two things came along for free that I didn't plan for. First, automatic fallback. If Vertex throws a 503 mid-batch, the request retries on a configured backup model instead of dropping the frame. (Fallbacks docs.) Second, semantic caching. Our reconstructed frames are repetitive, lots of near-identical low-motion clips, so caching on semantic similarity cut a chunk of redundant VLM calls. (Semantic caching docs.)

How it compares

I won't pretend Bifrost is the only thing that does this. It isn't.

Concern Bifrost LiteLLM Portkey
Unified multimodal API Yes, OpenAI-compatible Yes, very mature Yes
Implementation Go, self-hosted Python, self-host or lib Hosted-first, self-host option
Provider breadth 23+ Largest I've seen Broad
Observability UI Prometheus metrics Functional Strongest dashboards

LiteLLM has been doing image normalisation longer and its provider list is wider than anyone's. If you're already deep in a Python stack and want a library call rather than a separate service, LiteLLM is genuinely the pragmatic pick. Portkey's hosted dashboards and guardrail tooling are more polished than what we run; for a team that wants observability out of the box without wiring Prometheus, that matters.

We picked Bifrost mostly because it's a single Go binary we self-host next to our annotation service, and the OpenAI-compatible surface meant near-zero changes to client code. Different teams will weigh that differently.

Trade-offs and Limitations

It's a network hop. You're adding a gateway between your service and the provider, so there's a small latency cost and one more thing that can fall over. For an offline labelling pipeline I don't care. For a tight real-time loop you'd measure it first.

You inherit the gateway's abstraction. If a provider ships a brand-new multimodal parameter, you wait for the gateway to expose it rather than calling the raw API. So far that's been fine for our image-description use, but it's a real constraint if you live on the bleeding edge of one provider's features.

And to be precise: this solved a payload-normalisation and reliability problem. It did nothing for our actual model quality. The VLM still mislabels heavy motion blur, and reconstructing good frames from sparse events is still the hard part of my week. The gateway just stopped me rewriting JSON shapes. If you can't make the upstream model smaller or the labels cleaner, no amount of plumbing fixes that.

One espresso's worth of setup, a quieter on-call. That trade I'll take.

Further Reading