Your Spec Files Are Lying to You. Mine Were Too.

# ai# softwareengineering# agents# testing

Diya Burman

Preface I want to be upfront about something before we get into it. None of the frameworks...

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

Every issue so far has worked with one service and one spec file. Issue #7 changes that. A second service enters the picture — a notification service that the order service calls after a confirmed payment — and with it comes the question that every growing system eventually forces: where do spec file boundaries go?

The answer turns out to matter more than it looks. And the audit at the end of this issue found seven spec debt items in files we've been running since Issue #2. All passing. All carrying risk.

The notification service — and a design decision that has spec implications

The new service is minimal: POST /notifications/order-confirmed accepts an order id, user id, and total, and returns a notification id and a QUEUED status. Simple enough. The interesting part is how the order service calls it.

The call is fire-and-forget.

When an order is confirmed, the order service starts a daemon thread, fires the notification request, and returns the CONFIRMED response immediately — without waiting for the notification to succeed. If the notification service is down, slow, or returning errors, the order is still confirmed. The customer gets their confirmation. The notification may or may not arrive.

This is a deliberate design decision. The order service owns the transaction. The notification service owns delivery. Coupling the order confirmation response to notification delivery would mean a flaky notification service could block order creation — which is a much worse failure mode than a missed notification.

But the decision has a direct spec implication: any scenario that asserts Then the order status is "CONFIRMED" must remain true regardless of what the notification service does. The spec cannot simultaneously require CONFIRMED and make CONFIRMED depend on notification success. That would be a hidden coupling — the spec would look independent but the implementation would not be.

This is the kind of architectural decision that should be in the spec before it's in the code. Once it's in the code it becomes folklore.

The wrong way first: one big spec file

Before doing it right I did it wrong deliberately. I added two notification scenarios to the bottom of order_creation.feature — the existing file that's been covering order creation since Issue #2.

All 7 tests passed. Green across the board. pytest has no opinion about spec architecture.

The problems are structural, not functional:

Mixed ownership. order_creation.feature line 1 says Feature: Order Creation. By line 48 it's testing notification delivery. If the notification team changes their contract — say, adding a channel field to the request — they have to open order_creation.feature to update it. That file is not theirs. The filename, the feature declaration, and the existing scenarios all signal "this belongs to the order team." The notification scenarios are squatters.

The growing file problem. At 5 scenarios the file is readable. At 7 it starts to smell. Extrapolate to a real system: 10 downstream services, 5–10 scenarios each, all appended to the originating feature file because each was "triggered by" an order creation event. The file becomes a catch-all that nobody owns and everybody edits. Ownership dissolves into "whoever last touched it."

The agent routing problem. When an agent is handed order_creation.feature to build against, it must now implement both order logic and notification logic. It cannot know from the file whether the notification call belongs in POST /orders or in a separate endpoint. It will make a decision — probably the wrong one — and that decision will be baked into the implementation before anyone notices.

Spec debt seed. The scenario "Order confirmation succeeds even if notification fails" uses the step "the notification service is unavailable" without defining what unavailable means. TCP connection refused? 503? A 30-second hang? Each is a different failure mode with different implications for retry logic. An agent will pick one interpretation silently. Two agents will pick different ones. Both implementations will pass the spec. This is spec debt: it forms quietly, passes its tests, and surfaces as a production incident months later.

The right way: bounded spec files

After documenting what was wrong, I moved the notification scenarios into their own file: tests/features/notification_service.feature. Rewrote both scenarios to:

Precisely define "unavailable" as 503 Service Unavailable — not a timeout, not a connection refused, not an ambiguous network failure
Describe the notification contract from the notification service's perspective
Make the file self-contained — a notification service team reading it wouldn't need to open order_creation.feature to understand it

The result:

order_creation.feature: 5 scenarios, all about order creation. No references to notifications.
notification_service.feature: 2 scenarios, all about notification delivery behaviour.

The file boundary is now a contract boundary. They can be versioned, owned, and handed to different agents independently.

Bounded spec files are not a tidiness preference. They are a precision tool for multi-agent systems. When a spec file is bounded to one service, an agent can be assigned exactly that file and nothing else. It builds one surface, tests one contract, returns. When the spec bleeds across services, the agent must make decisions about service ownership that were never written down. Those decisions accumulate as hidden assumptions in the implementation.

The spec debt audit

With the bounded file structure in place, I audited all four feature files in the project for spec debt — places where the spec passes its tests but leaves decisions that should have been made explicitly.

Seven items. All passing. All carrying risk.

1. Ambiguous timeout measurement

File: order_creation.feature — Scenario: payment gateway times out
Step: And the response is returned within 12 seconds

From when? The client sends the request? The server receives it? The last retry fires? Two agents will instrument this differently and both will pass. "Within 12 seconds of the order being submitted" — defining "submitted" as the moment the HTTP request body is sent — removes the ambiguity.

2. "Retried" vs "total attempts"

File: order_creation.feature — Scenario: payment gateway times out
Step: And the payment gateway is not retried more than 2 times

Does this mean 2 total attempts (1 original + 1 retry) or 2 retries on top of the original (3 total)? The English is genuinely ambiguous. An agent will pick one. The test will pass. The production system will behave differently than intended.

Fix: And the payment gateway receives no more than 2 charge requests total — "requests total" removes all ambiguity about whether the first attempt counts.

3. "Released" is not a mechanism

File: order_creation.feature — Scenario: payment declined
Step: And the inventory reservation is released

"Released" is not defined. Does the inventory service receive a DELETE? A POST to a release endpoint? Does a TTL fire? An agent will implement whichever mechanism seems natural. Two agents will produce incompatible implementations that both pass the spec.

Fix: Name the items and the mechanism: And the inventory service receives a reservation release request for SHOE-RED-42 and BELT-BRN-M.

4. "Explicit user action" describes a flow that doesn't exist

File: order_creation.feature — Scenario: partial availability
Step: And no order is confirmed without explicit user action

"Explicit user action" is not defined anywhere in the spec. A second API call? A UI confirmation? A webhook? This step passes trivially because no order is confirmed — the negative condition is true by absence. But it implies a follow-up confirmation flow that was never built, never specced, and never reviewed. If a future agent reads this step and builds a confirmation flow to satisfy it, it will invent something that was never intended.

Fix: Remove it if the follow-up flow is out of scope. Or replace it with a concrete step: And a subsequent POST to /orders/{order_id}/confirm is required to complete the order.

5. Presence without value

File: order_status_bad.feature
Step: field-name assertions without value or type assertions

Asserting that a field exists only catches absence — not incorrect presence. An agent can return {"status": null} and pass. The spec catches the wrong thing.

Fix: Assert the full expected shape with explicit values rather than just field names.

6. "An order exists" doesn't say how

File: order_status_good.feature
Step: Given an order exists with status "CONFIRMED"

"An order exists" doesn't specify how it got there — full creation flow, or directly seeded into the store. The two methods produce different side effects. An agent building a test harness may seed the order directly, bypassing the creation flow entirely, which means the status endpoint tests never verify that a real confirmed order is actually readable via the API.

Fix: Given a previously confirmed order created via POST /orders with id "{order_id}" — or explicitly state that direct seeding is acceptable.

7. "Correct" is relative

File: notification_service.feature
Step: And the notification contains the correct order id and total

"Correct" compared to what? If the order total is computed, two agents may compute it differently and both pass "correct" against their own computation.

Fix: Hardcode the expected value: And the notification request body contains order_id matching the confirmed order and total of 134.97.

Why all seven of these matter even though they're all green

Every item in that audit passes its test. That is the point.

Spec debt is not visible in a green CI run. It is visible only when you ask: what would a second agent build from this spec? The step "the payment gateway is not retried more than 2 times" has been in the codebase since Issue #2. It has passed every run. But it encodes an ambiguity that will be resolved differently by every agent that implements it fresh. The "no order is confirmed without explicit user action" step describes a flow that does not exist anywhere in the codebase. It passes because the negative condition is trivially true.

If a future agent reads that step and builds a confirmation flow to satisfy it, it will build something that was never specced, never reviewed, and never integrated. The spec invited it. The tests blessed it. Nobody noticed.

This is the exact failure mode that makes AI-assisted development unreliable at scale. Specs that look precise, pass their tests, and silently invite incompatible implementations. The debt doesn't announce itself. It compounds.

Where the project stands

Fifteen tests passing across four bounded feature files. The notification service is integrated. The Pact contracts — which existed before this session — remain unbroken because the notification call happens after the transaction completes. Adding a new service boundary didn't require touching existing contracts.

Seven spec debt items documented. None fixed yet. The fixes are the next issue.

Next issue: The Spec Audit — applying the debt framework to a real existing service and building the diagnostic tool readers can use on their own codebases.

Sources & Further Reading

Cucumber + Gherkin documentation
Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Dark Factory
Nate B. Jones — natebjones.com
Project repository
Session findings — Issue #7

This article was written with the assistance of AI tools.