I Audited My Own Open-Source Project With 26 AI Agents (and Found a Real Vulnerability)

# ai# audit# agents# security

Odilon HUGONNOT

26 AI agents comb through my PHP media server in parallel: one vulnerability found, a bug my own fix introduced, and the best lesson — knowing when to stop.

ShareBox is my self-hosted streaming server: a PHP thing I built because I just wanted to send someone a link to a movie without installing Plex and its ten gigabytes of dependencies. It runs on my seedbox, serves my users, and one morning I notice it's starting to pick up a few stars on GitHub.

And then, that little voice: "does this thing actually hold up?" Because between "works on my machine" and "code that strangers are going to install on their own box," there's a chasm. A chasm full of flaws I can't see anymore, because I've had my nose in it for weeks.

Normally, you re-read your code. Except re-reading 22,000 lines alone, honestly, you do it badly: you skim over what you think you already know. So I tried something else — unleashing a pack of 26 AI agents on it, each with a precise mission, and seeing what surfaced. Spoiler: they found a flaw that had been sitting right under my eyes from the start.

26 agents to comb through my own code

The idea wasn't "AI, tell me if my code is good" — that always produces the same encouraging, useless mush. The idea was to orchestrate: split the audit into roles, run the agents in parallel, then have a final, deliberately harsh agent tear apart the conclusions.

The pipeline looked like this: eleven readers start in parallel, each swallowing an entire slice of the code (the core, the streaming handlers, the API, the front end, the tests, the Docker setup…). Their reports flow up into an architecture synthesis and a test-coverage analysis. Then twelve "radar" agents each score one single axis — security, performance, architecture, tests… And finally, a "verdict" agent re-reads every score in adversarial mode: its job is to knock down the ones that are too kind.

Audit pipeline: 11 readers in parallel, then synthesis, then 12 radar agents, then an adversarial verdict. 11 readers in parallel each slice of the code read in full Architecture + coverage synthesis connect the pieces, measure the gaps 12 radar agents one agent = one scored axis Adversarial verdict knocks down kind scores → final score + roadmap

The pipeline: read in parallel, connect, score axis by axis, then have it all torn apart by a deliberately harsh final agent.

The verdict came in: 5.04 out of 10. Deliberately harsh calibration — the final agent was told to score like a demanding staff engineer, keeping in mind that "a few-weeks-old PHP media server is not Jellyfin." It stings in the moment. But a low, well-argued score is worth a thousand complacent "great project!"s.

The flaw that was right under my eyes

The moment that justifies the whole exercise on its own: one of the security agents flags the Docker startup script. My entrypoint.sh generates config.php from environment variables. And right above it sat a comment, in my own hand: "Sanitize strings to prevent PHP injection."

Except the sanitization only covered strings. Three numeric/boolean variables were interpolated raw into the generated PHP file:

define('STREAM_MAX_CONCURRENT', ${MAX_CONCURRENT});

Translation: if someone deploys the container with an environment variable like:

SHAREBOX_STREAM_MAX_CONCURRENT='1);system($_GET[x]);//'

…then the generated config.php contains executable PHP. Arbitrary code execution, in the config file, right under a comment that claimed to prevent it. The kind of thing you stop seeing because you wrote it yourself and you trust it.

Warning — always verify the agents. Before taking the agent's word for it, I went and re-read the source myself. That's the golden rule: an agent that says "vulnerability confirmed" can be wrong, and so can one that says "all good." Here the flaw was real. The fix: validate that these variables really are integers (or true|false) before writing them, otherwise fall back to a safe value.

Along the way, the audit also flagged write endpoints (TMDB poster management) reachable by anyone holding a public link, an unbounded ZIP export that could monopolize the server, and a database backup triggered on every web request. Seven "quick win" fixes in total — each verified in a real Docker container with an end-to-end test suite before touching production.

When my own fix introduced a bug

Here's the most instructive moment, and the most humbling. After applying my seven fixes, I ran another agent cycle — this time to re-score and hunt for regressions. And one of them found a bug. In my fix.

By moving the database backup out of the web request (good idea), I'd wired it onto container startup. But that code runs as root, and SQLite in WAL mode creates side files (-wal, -shm). The result: on restart over an already-populated volume, those files were owned by root, and the web server (running as www-data) could no longer write to the database. A fix that works on first launch and breaks on the second. The worst kind of bug.

No rushed human re-runs a full audit right after "finishing." That's exactly where the agent pack wins: it doesn't get tired, it doesn't congratulate itself, it re-reads the diff with the same cold rigor the second time as the first. Bug fixed, re-tested on restart, and this time for real.

Tests that read the code vs. tests that run it

The other slap from the audit landed on my tests. I had hundreds of them, I was proud of my little green badge. Except an agent put its finger on a comfortable lie: a large chunk of those tests read the source code and checked that a string was present in it — instead of running the code and checking its behavior.

// What the test did (reading the source):
$source = file_get_contents('functions.php');
$this->assertStringContainsString('aresample=async=1', $source);

// What a real test does (execution):
$args = buildFilterGraph(720, 0, burnSub: 2);
$this->assertStringContainsString('overlay', $args); // the filter is ACTUALLY built

The difference is huge. The first test stays green even if the function is broken, as long as the string is lying around somewhere in the file. It's fake coverage. The code that actually serves a file's bytes (HTTP Range handling, the heart of a streaming server) was never executed in a test.

So I converted the critical paths into real execution tests: an ephemeral PHP server that serves a file and checks the 206 Partial Content responses byte by byte, real calls to the ffmpeg command builders, the security gate tested over real HTTP. The "E2E coverage" axis went from 3 to 5 on the radar — the biggest jump of the whole exercise, and the most deserved.

The audit's best advice: don't follow it all the way

By the end, the score had climbed from 5.04 to 5.72. And the natural urge is to keep climbing. The verdict clearly pointed at the ceiling: two monolithic files of ~2,300 lines each, mixing routing, auth, business logic and views. The "textbook" answer: split it all up, introduce a router, controllers, namespaces.

And there, the final agent did something I didn't expect from an audit: it advised me not to.

A good audit also tells you what NOT to do. Rewriting 4,700 lines of procedural code that works, whose core (the ffmpeg pipeline) isn't even covered by execution tests, for a solo open-source project: weeks of work, a huge regression risk, and zero added value for the user. The risk/value ratio is bad. Keeping a deliberate procedural architecture is a defensible choice.

Instead, the highest-leverage move was invisible and risk-free: extend static analysis (PHPStan) to those never-analyzed big files, freezing the existing debt in a baseline. Result: ~4,700 of the riskiest lines are now under a net — any new regression fails CI, without forcing me to clean everything up today. Five minutes of config beat three weeks of rewrite.

What to take away

The agent pack didn't "audit in my place." It did what a lone human does badly: actually read every line, no skimming, no author's blind spot, in parallel, and start over after the fixes without getting bored. It found a real vulnerability, exposed fake test coverage, and even caught a bug in my own fix.

But at no point did it decide for me. I'm the one who verified the flaw in the source before believing it. I'm the one who decided one fix deserved a deploy and another deserved to wait. And it was the adversarial agent, not the complacent one, that produced the real value — by scoring harshly, and by saying "stop, don't overdo it."

The real win isn't the rising score. It's knowing, with numbers to back it, where the debt is, which part is worth repaying, and — the hardest thing for a developer — which part you should accept and leave alone.