
Jimmy WeiMy cofounder and I worked infra at Roblox. We tried a bunch of AI SRE tools. None of them worked. Not...
My cofounder and I worked infra at Roblox. We tried a bunch of AI SRE tools. None of them worked. Not because the AI was bad — because it had no idea how our systems actually worked.
Here's what I mean.
At Roblox, we had shorthands for everything. Our databases had internal names. Our Redis clusters had acronyms. Our datacenters had conventions that you'd only know if you'd been there a while. Our microservices followed specific templates and naming patterns. We had internal tools for managing configs, checking CI/CD, viewing deployment status — none of which any vendor AI has ever heard of.
When an incident happens, you need to check what changed. That means reading GitHub PRs. Then checking CI/CD to see which deployment in which service caused the change. Or maybe it was a config change — then you need to talk to a completely different internal service. Maybe it was a feature flag. Maybe it was a traffic shift. Each of these lives in a different internal tool.
Every AI SRE tool we tried connected to Datadog and called it a day. Ask it about our internal deploy system and it had nothing. Ask it what "chi1" means and it had no idea. Without that context, it's useless.
The other thing people don't talk about: even within the same company, every team's stack is different. The payments team uses different services, different databases, different monitoring than the matchmaking team. What works for one team's investigation workflow doesn't work for another.
A one-size-fits-all AI SRE will always produce generic advice. "Check your logs." "Look at recent deploys." Thanks — that's what I was already going to do.
So the answer is obvious: just build integrations with all your internal tools. Right?
The problem is nobody has time for that. Engineering teams aren't going to spin up MCP servers for every single internal tool from the ground up. They have features to ship. The integration work alone would take months, and by then priorities have shifted.
This is the gap we decided to focus on.
IncidentFox has an AI that researches your Slack history, Confluence docs, codebase, and metrics data to build an internal knowledge base. It figures out what internal tools exist, what they do, how they're called, what your team's shorthands mean — and then auto-generates the integrations.
Instead of months of integration work, teams get something working in hours.
But here's the part I care about most: engineers stay in control of everything.
We're all engineers. We know the feeling of using a tool that's a black box. So we made every aspect of the agent configurable:
The goal isn't to replace engineering judgment. It's to let engineers transfer their domain knowledge — all the tribal knowledge about how systems work, how to debug them, what to check first — into something repeatable and shareable across the organization.
The senior engineer who knows that "when RC2-west alerts fire, check the config service first because it's usually a regional failover issue" — that knowledge shouldn't live only in their head. It should be encoded into the agent so the next person on-call benefits from it too.
The AI SRE space is crowded. Everyone connects to the same standard vendors — Datadog, PagerDuty, Grafana — and calls it a day. But that's maybe 30% of the context you need during an incident. The other 70% lives in internal tools, Slack conversations, team-specific knowledge, and organizational context that no vendor integration will ever cover.
That's the gap. Not better AI reasoning. Not better prompts. Better access to the data that actually matters, and tools that let engineers build on top of them instead of being locked into someone else's workflow.
We open sourced the whole thing.
GitHub: github.com/incidentfox/incidentfox (Apache 2.0, fully self-hostable)
Demo Slack: Join our workspace — playground environment with real telemetry, one-click Slack bot install: https://join.slack.com/t/incidentfox/shared_invite/zt-3ojlxvs46-xuEJEplqBHPlymxtzQi8KQ
Website: incidentfox.ai
Would love feedback, especially from teams that have tried to build internal AI tooling. What did you run into?