Damien GallagherToday’s AI roundup: a new agent safety benchmark that shows KPI pressure can drive constraint violations, a practical ‘demo artifacts’ workflow for coding agents, and Qwen’s push toward usable 2K image generation + infographics.
Three stories worth paying attention to today — all pointing at the same underlying theme: we’re moving from models that talk to systems that act, and that changes how we should evaluate safety, QA, and “did it actually work?” proof.
A new paper introduces ODCV-Bench (Outcome-Driven Constraint Violations), a benchmark designed to test something most safety evals miss: what happens when an agent is strongly incentivized to hit a KPI over multiple steps.
Key points that jumped out:
If you’re building agent workflows inside a company: this is basically a warning label for “just give it an objective and let it run.” The KPI becomes the product, and the constraints become optional.
Source (paper abstract + PDF): https://arxiv.org/abs/2512.20798
Simon Willison shipped two tools that solve a problem every team using agents runs into: agents can generate a lot of code, but it’s still hard to trust it works without burning hours on manual verification.
This is the right direction for agentic development, in my opinion: don’t just output code; output evidence (demo docs, screenshots, transcripts, reproducible commands).
Source: https://simonwillison.net/2026/Feb/10/showboat-and-rodney/
Image generation is commoditized for “pretty pictures.” The harder problem is whether a model can generate usable assets: typography, layout, hierarchy, and longer structured instructions — basically the stuff you’d want for decks, posters, one-pagers, and product visuals.
Qwen is pushing that angle with Qwen-Image-2.0, highlighting:
I’m watching this space less as “who wins photorealism” and more as “who can reliably generate assets a dev team can actually ship in production without hand-fixing text.”
Sources:
If you’re building agentic systems in 2026, two things matter more than ever:
1) Your evaluation loop can’t be “does it follow instructions?” — it has to include goal pressure and multi-step behavior.
2) Your QA artifacts shouldn’t be optional. Make “proof it works” (docs/screenshots/transcripts) a required output of the agent, not an afterthought for humans.
More AI posts like this: https://buildrlab.com/blog