MIRR: An RL Environment Where Gemma 4 Gets Graded on How It Thinks, Not Just What It Answers

# devchallenge# gemmachallenge# gemma

Utkarsh Bahuguna

This is a submission for the Gemma 4 Challenge: Build with Gemma 4 🏆 An earlier version of this...

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

🏆 An earlier version of this project finished in the top 50 out of 8,000+ teams at the Meta × PyTorch Hackathon (see LinkedIn post). This submission rebuilds it on Gemma 4, with reasoning-quality scoring designed around Gemma 4's native thinking modes.

What I Built

MIRR is a stateful, RL-compatible environment where a Gemma 4 agent debugs failures in a simulated microservice system the way an on-call SRE would. It pulls logs, queries metrics, walks the service graph, and commits to a root-cause hypothesis under uncertainty.

Here's the problem I wanted to solve. Most "agent benchmarks" today reward only the outcome. Did the agent fix it? Yes / no. That's a terrible signal for incident response, where a good engineer can be right for bad reasons (lucky pattern match), and a great engineer can be wrong for excellent reasons (a sensible hypothesis ruled out by data the team didn't have). If we want LLMs that on-call engineers actually trust at 3 AM, we have to score how they think, not just whether they happened to land on the answer.

MIRR introduces a novel diagnose() action that does exactly that. Every time the agent commits to a root-cause hypothesis, the environment scores:

Reasoning quality: causal chain validity, evidence cited, alternatives ruled out
Outcome correctness: did the hypothesis actually match the injected fault

…separately, with independent reward signals. That makes the env legible to RL fine-tuning with TRL. You can train Gemma 4 to be a better diagnostician, not just a luckier guesser.

The sim ships with:

A microservice topology (auth → gateway → orders → payments → DB)
A fault library: cascading timeouts, deadlocks, memory leaks, poison messages, cert expiry, the usual hall of fame
Synthetic logs, metrics, and traces generated per episode
An OpenEnv-compatible step() / reset() interface, so the same env trains agents and serves the live demo

Demo

🎥 Live Gradio demo: [your-space-link-here]

Try the "memory leak in payments" episode if you want to see the thinking mode really earn its keep.

Code

📦 Repo: github.com/u7k4rs6/MIRR
🤗 Rollouts dataset: huggingface.co/datasets/u7k4rs6/incident-response-rollouts

The repo includes the OpenEnv environment, fault generators, Gemma 4 fine-tuning scripts (TRL + Unsloth), eval harness, and the Gradio demo. The HF dataset contains agent rollouts from MIRR episodes (state, action, reasoning trace, dual reward), ready to drop straight into a TRL training loop.

How I Used Gemma 4

I used a two-model strategy: Gemma 4 E4B for fast, on-device iteration and RL fine-tuning, and Gemma 4 31B Dense for the heavy reasoning that does the actual diagnosing in the live demo.

Gemma 4 31B Dense: the diagnostician

The 31B is doing real chain-of-thought work: walking a service graph, correlating timestamps across logs, ruling out hypotheses. Two Gemma 4 properties make it exactly the right model:

Configurable thinking modes. This is the whole game for MIRR. The diagnose() action needs a visible, structured reasoning trace, because the environment scores reasoning quality independently of outcome. Gemma 4's native thinking mode gives me 4K+ tokens of clean chain-of-thought to grade against the ground-truth causal chain. I'd rather have one model that thinks transparently than two models stapled together with a "show your work" prompt that the model is free to ignore.
256K context. A real incident has logs from five services, three dashboards, and a runbook. The 31B eats all of that in one shot, with no RAG plumbing and no summarization step quietly dropping the critical line. For incident response specifically, context fidelity is everything.

The 31B is served via HF Inference for the demo, which keeps the Space cheap and snappy.

Gemma 4 E4B: the training proxy

RL fine-tuning the 31B on a hackathon budget is a non-starter. But because Gemma 4 ships the same architecture and tokenizer across the entire family, I could fine-tune the E4B on the MIRR rollouts dataset using TRL + Unsloth on a single Colab T4, and then transfer the learnings (reward shaping, prompting structure, the diagnose() action schema) to the 31B at inference time. Same family, same instincts.

Per-Layer Embeddings (PLE) make E4B genuinely punchy too. Even the small model produces watchable demos on the on-device path, which matters for the eventual story of "your laptop runs a copy of your team's SRE agent locally."

Why Gemma 4 over other open models

Open weights + Apache 2.0. I can actually fine-tune and ship. A closed API would have killed the RL story before it started.
Family symmetry. Same tokenizer and chat template across E2B → 31B means a training signal designed on E4B transfers up cleanly. No other open family gives you this kind of clean ladder.
Thinking modes as first-class API, not a prompting hack. For an environment that grades reasoning, that's the difference between scoring real signal and scoring formatting.
Multimodal headroom. v2 of MIRR includes service topology images (Grafana panels, dependency graphs), and Gemma 4's vision input means one model handles it end-to-end.

What's next

If I had another week, I'd swap the 31B for the 26B MoE to cut inference cost on the demo (3.8B active params at 31B-class quality is hard to beat), and use E2B's native audio input to let on-call engineers literally talk to the agent during an incident. "What's burning?" → live answer, while the pager is still vibrating.

Gemma 4 is the first open model family where you can prototype on a phone-class checkpoint and ship on a workstation-class checkpoint without rewriting your stack. That's the unlock MIRR needed.