Utkarsh BahugunaThis is a submission for the Gemma 4 Challenge: Build with Gemma 4 🏆 An earlier version of this...
This is a submission for the Gemma 4 Challenge: Build with Gemma 4
🏆 An earlier version of this project finished in the top 50 out of 8,000+ teams at the Meta × PyTorch Hackathon (see LinkedIn post). This submission rebuilds it on Gemma 4, with reasoning-quality scoring designed around Gemma 4's native thinking modes.
MIRR is a stateful, RL-compatible environment where a Gemma 4 agent debugs failures in a simulated microservice system the way an on-call SRE would. It pulls logs, queries metrics, walks the service graph, and commits to a root-cause hypothesis under uncertainty.
Here's the problem I wanted to solve. Most "agent benchmarks" today reward only the outcome. Did the agent fix it? Yes / no. That's a terrible signal for incident response, where a good engineer can be right for bad reasons (lucky pattern match), and a great engineer can be wrong for excellent reasons (a sensible hypothesis ruled out by data the team didn't have). If we want LLMs that on-call engineers actually trust at 3 AM, we have to score how they think, not just whether they happened to land on the answer.
MIRR introduces a novel diagnose() action that does exactly that. Every time the agent commits to a root-cause hypothesis, the environment scores:
…separately, with independent reward signals. That makes the env legible to RL fine-tuning with TRL. You can train Gemma 4 to be a better diagnostician, not just a luckier guesser.
The sim ships with:
step() / reset() interface, so the same env trains agents and serves the live demo🎥 Live Gradio demo: [your-space-link-here]
Try the "memory leak in payments" episode if you want to see the thinking mode really earn its keep.
📦 Repo: github.com/u7k4rs6/MIRR
🤗 Rollouts dataset: huggingface.co/datasets/u7k4rs6/incident-response-rollouts
The repo includes the OpenEnv environment, fault generators, Gemma 4 fine-tuning scripts (TRL + Unsloth), eval harness, and the Gradio demo. The HF dataset contains agent rollouts from MIRR episodes (state, action, reasoning trace, dual reward), ready to drop straight into a TRL training loop.
I used a two-model strategy: Gemma 4 E4B for fast, on-device iteration and RL fine-tuning, and Gemma 4 31B Dense for the heavy reasoning that does the actual diagnosing in the live demo.
The 31B is doing real chain-of-thought work: walking a service graph, correlating timestamps across logs, ruling out hypotheses. Two Gemma 4 properties make it exactly the right model:
Configurable thinking modes. This is the whole game for MIRR. The diagnose() action needs a visible, structured reasoning trace, because the environment scores reasoning quality independently of outcome. Gemma 4's native thinking mode gives me 4K+ tokens of clean chain-of-thought to grade against the ground-truth causal chain. I'd rather have one model that thinks transparently than two models stapled together with a "show your work" prompt that the model is free to ignore.
256K context. A real incident has logs from five services, three dashboards, and a runbook. The 31B eats all of that in one shot, with no RAG plumbing and no summarization step quietly dropping the critical line. For incident response specifically, context fidelity is everything.
The 31B is served via HF Inference for the demo, which keeps the Space cheap and snappy.
RL fine-tuning the 31B on a hackathon budget is a non-starter. But because Gemma 4 ships the same architecture and tokenizer across the entire family, I could fine-tune the E4B on the MIRR rollouts dataset using TRL + Unsloth on a single Colab T4, and then transfer the learnings (reward shaping, prompting structure, the diagnose() action schema) to the 31B at inference time. Same family, same instincts.
Per-Layer Embeddings (PLE) make E4B genuinely punchy too. Even the small model produces watchable demos on the on-device path, which matters for the eventual story of "your laptop runs a copy of your team's SRE agent locally."
If I had another week, I'd swap the 31B for the 26B MoE to cut inference cost on the demo (3.8B active params at 31B-class quality is hard to beat), and use E2B's native audio input to let on-call engineers literally talk to the agent during an incident. "What's burning?" → live answer, while the pager is still vibrating.
Gemma 4 is the first open model family where you can prototype on a phone-class checkpoint and ship on a workstation-class checkpoint without rewriting your stack. That's the unlock MIRR needed.