
Gunnar GroschBuild a multi-agent system where each agent has a RISEN-structured system prompt and they coordinate through tool calls. Move from the theory of "behavioral contracts" to a working implementation with Strands Agents SDK and Amazon Bedrock.
The RISEN post introduced system prompts as behavioral contracts. One reader comment cut to the core of what comes next: "What happens when you have multiple agents that each need their own contract?"
The answer isn't complicated, but it's specific. The Expectation section of one agent defines the input format for the next. Narrowing prevents agents from doing each other's work. Steps encode the routing logic: which specialists to call, when, and why. The contract between agents lives in the prompts, not in orchestration code.
This post builds a working multi-agent system that demonstrates these contracts in practice. Here's what it looks like. Three different purchase requests, three completely different agent journeys:
"I need a 15 inch laptop for work"
→ Calling Price Research Agent
→ Calling Delivery & Logistics Agent
Agents called: Price Research, Delivery & Logistics
Agents skipped: Financing, Risk Assessment, Contract Review
"I want to buy a VW Golf, probably a used one"
→ Calling Price Research Agent
→ Calling Financing Agent
→ Calling Risk Assessment Agent
→ Calling Delivery & Logistics Agent
Agents called: Price Research, Financing, Risk Assessment, Delivery & Logistics
Agents skipped: Contract Review
"Looking for office space to rent, two-year lease, around 20 people"
→ Calling Price Research Agent
→ Calling Financing Agent
→ Calling Contract Review Agent
→ Calling Risk Assessment Agent
Agents called: Price Research, Financing, Contract Review, Risk Assessment
Agents skipped: Delivery & Logistics
Same coordinator, same five specialists available. The coordinator reads the request, decides which ones are needed, and only calls those. The rest of this post explains how.
The instinct when building a multi-agent system is to reach for a workflow engine. Define the steps, wire the handoffs, control the flow. This is the opposite of what makes agents useful. A workflow engine decides what happens next. An agent decides what happens next. When you hardcode the routing, you lose the ability for the agent to reason about what's actually needed. A laptop doesn't need risk assessment. A used car does. That's a judgment call, not a branching condition. With hardcoded routing, every new category of purchase means a code change. With the routing in the prompt, the coordinator handles it on its own.
The DEV415 session on A2A and MCP at re:Invent 2025 makes this point in a production context: you design agent behaviors, not control flow. The SwiftShip demo shows it running on AWS with Lambda functions and Agent-to-Agent communication.
The approach in this post is simpler: agents as tools. Each specialist is wrapped as a tool() function that the coordinator can invoke. No HTTP endpoints, no message queues, no infrastructure beyond a single TypeScript process. The pattern is the same as the production architecture. The implementation is small enough to clone and run in two minutes.
Not every task needs multiple agents. Here are signals that splitting makes sense:
If your task fits in one prompt with consistent Narrowing and no conditional branches, keep it as one agent. Adding coordination overhead for its own sake makes the system slower and harder to debug.
In a single-agent system, RISEN structures the output. In a multi-agent system, RISEN structures the coordination. Each component does double duty:
Expectation defines the handoff format. What one agent returns is what the next agent reads. The Price Research Agent's Expectation section says "return the typical price range with budget, mid-range, and premium tiers." That's what the coordinator gets back and synthesizes with findings from other specialists.
Narrowing defines ownership boundaries. The Financing Agent's Narrowing says "do not assess market pricing, delivery logistics, contract terms, or product risk." The Risk Assessment Agent's Narrowing says the same about financing, delivery, and contracts. No agent steps on another agent's job. Without this, agents drift into each other's domains and produce redundant or contradictory advice.
Steps encode routing logic. The coordinator's Steps section IS the routing decision. It's not code. It's plain English in the system prompt:
# Steps
1. Read the purchase request and identify: what is being purchased, the
likely category, the approximate value range, and any special
circumstances.
2. Always invoke the research_prices tool.
3. If the estimated value exceeds $5,000, or if financing is mentioned,
invoke the evaluate_financing tool.
4. If the item is a tangible physical product, invoke the plan_delivery
tool. Physical products always require delivery or collection planning.
5. If the purchase involves a subscription, lease, or multi-year
commitment, invoke the review_contract tool.
6. If the estimated value exceeds $10,000, the item is used, or the
category carries known risk, invoke the assess_risk tool.
7. Synthesize all specialist reports into a structured recommendation.
The model reads these instructions, assesses the purchase request against them, and calls the appropriate tools. A $1,000 laptop triggers Steps 2 and 4 (price research and delivery). A used car triggers Steps 2, 3, 4, and 6 (price research, financing, delivery, and risk). An office lease triggers Steps 2, 3, 5, and 6 (price research, financing, contract review, and risk). Delivery is skipped because office space is not a physical product. The routing is conditional and emerges from the prompt.
The demo has one coordinator agent and five specialist agents. Each specialist is wrapped as a tool() function and passed to the coordinator:
| Specialist | When called | Narrowing constraint |
|---|---|---|
| Price Research | Always | Only pricing. No risk, financing, or delivery. |
| Financing | Value > $5K or financing mentioned | Only financing. No pricing or contracts. |
| Delivery & Logistics | Physical product | Only logistics. No pricing or risk. |
| Risk Assessment | High value, used goods, or risky category | Only risk. No pricing or delivery. |
| Contract Review | Subscription, lease, or multi-year commitment | Only contract terms. No pricing or risk. |
The Strands Agents TypeScript SDK doesn't have a built-in agent.asTool() method. Instead, you wrap each specialist using the tool() function. The callback creates a fresh agent, invokes it, and returns its output:
import { tool } from '@strands-agents/sdk'
import { z } from 'zod'
import { invokeSpecialist } from './create-specialist-agent.js'
import { riskAssessmentPrompt } from '../prompts/risk-assessment.js'
import { ADVANCED_MODEL } from '../models.js'
export const assessRisk = tool({
name: 'assess_risk',
description: 'Identifies purchase risks, recommends due diligence steps, '
+ 'and estimates realistic total cost of ownership.',
inputSchema: z.object({
item: z.string().describe('What is being purchased'),
condition: z.enum(['new', 'used', 'refurbished', 'unknown']),
estimatedValue: z.string(),
riskContext: z.string().optional(),
}),
callback: async (input) =>
invokeSpecialist(riskAssessmentPrompt, input, { modelId: ADVANCED_MODEL }),
})
The invokeSpecialist helper creates the agent, invokes it with a 60-second timeout, and returns the string output. The shared helper keeps each tool wrapper to a few lines:
export async function invokeSpecialist(
systemPrompt: string,
input: Record<string, unknown>,
options?: SpecialistOptions
): Promise<string> {
const agent = new Agent({
model: createModel(options?.modelId ?? SPECIALIST_MODEL),
systemPrompt,
tools: options?.tools,
printer: false,
})
let timer: ReturnType<typeof setTimeout>
// Timeout produces a rejection, not partial output.
// If a specialist times out, the coordinator gets an error, not a half-answer.
const result = await Promise.race([
agent.invoke(`Purchase request details:\n${JSON.stringify(input, null, 2)}`),
new Promise<never>((_, reject) => {
timer = setTimeout(() => reject(new Error('Specialist timed out')), 60_000)
}),
]).finally(() => clearTimeout(timer))
return result.toString()
}
Notice two things: the Risk Assessment Agent uses a different model (ADVANCED_MODEL, which defaults to Sonnet 4.6) because risk analysis requires stronger reasoning than standard price research (which runs on Haiku 4.5). And options.tools lets specialists have their own sub-tools. The Price Research Agent has a save_price_snapshot tool that writes structured price data to a local JSON file. The coordinator never sees this tool. It's scoped to the specialist.
The coordinator itself is straightforward. It gets the RISEN prompt, all five specialist tools, and a hook for routing visibility:
const hook = new RoutingHook()
const agent = new Agent({
model: createModel(COORDINATOR_MODEL),
systemPrompt: coordinatorPrompt,
tools: allTools,
hooks: [hook],
printer: false,
})
const result = await agent.invoke(request)
printRecommendation(result.toString())
printSummary(hook.getCalledTools())
The RoutingHook uses the SDK's BeforeToolCallEvent to print each specialist as the coordinator decides to call it. Readers see the routing decisions happen in real time before the specialist output appears:
export class RoutingHook implements HookProvider {
private readonly calledTools: string[] = []
getCalledTools(): string[] {
return [...this.calledTools]
}
registerCallbacks(registry: HookRegistry): void {
registry.addCallback(BeforeToolCallEvent, (event) => {
const toolName = event.toolUse.name
const displayName = allToolNames[toolName]
if (displayName) {
this.calledTools.push(toolName)
console.log(` → Calling ${displayName} Agent`)
}
})
}
}
The hook also tracks which tools were called, which powers the summary at the end showing called vs. skipped agents. This is observability for multi-agent systems without any infrastructure: one hook provider, attached to the coordinator, watching the decisions flow.
You saw the routing output at the top: laptop triggers two specialists, used car triggers four, office lease triggers a different four (contract review instead of delivery). Here's what the full output looks like for the used car:
════════════════════════════════════════════════════════════
PURCHASE REQUEST
════════════════════════════════════════════════════════════
"I want to buy a VW Golf, probably a used one"
Coordinator is analyzing your request...
→ Calling Price Research Agent
→ Calling Financing Agent
→ Calling Risk Assessment Agent
→ Calling Delivery & Logistics Agent
════════════════════════════════════════════════════════════
PURCHASING RECOMMENDATION
════════════════════════════════════════════════════════════
[Coordinator synthesis: pricing tiers for used Golfs, financing options
with monthly payment estimates, risk assessment covering DSG transmission
and hidden maintenance costs, delivery logistics for vehicle collection]
────────────────────────────────────────────────────────────
Agents called: Price Research, Financing, Risk Assessment, Delivery & Logistics
Agents skipped: Contract Review
────────────────────────────────────────────────────────────
The coordinator calls four specialists, waits for all of them, then synthesizes their findings into a single recommendation. Each specialist stays in its lane: the Risk Assessment Agent talks about DSG transmission issues and hidden maintenance costs, the Financing Agent talks about loan terms and monthly payments, and neither comments on the other's domain. That separation comes from the Narrowing section of each specialist's RISEN prompt.
The two can also produce tension. Risk might flag a $3,000 first-year repair budget while Financing offers attractive loan terms. The coordinator doesn't resolve that tension. Its Expectation section says to surface findings from each specialist clearly attributed. Presenting both sides is the right call. Choosing one would mean overriding a specialist's domain, which is exactly what Narrowing is supposed to prevent.
How do you know the coordinator is making the right calls? The same pattern from the eval post applies here: define expected behavior, run it, check the results.
The routing eval replaces the real specialist tools with stubs that return immediately. The coordinator still runs against the real LLM, so this tests the RISEN Steps routing logic without paying for five specialist invocations per case:
const stubTools = Object.keys(allToolNames).map((name) =>
tool({
name,
description: `Stub for ${name}`,
inputSchema: z.object({ item: z.string() }).passthrough(),
callback: async () => `[stub response from ${name}]`,
})
)
Four test cases with expected routing:
ROUTING EVAL
Coordinator: global.anthropic.claude-sonnet-4-6
Test cases: 4 (specialist tools stubbed)
Budget laptop
PASS called Price Research
PASS called Delivery & Logistics
PASS skipped Financing
PASS skipped Risk Assessment
PASS skipped Contract Review
Used car with financing
PASS called Price Research
PASS called Financing
PASS called Risk Assessment
PASS called Delivery & Logistics
PASS skipped Contract Review
Office lease
PASS called Price Research
PASS called Financing
PASS called Contract Review
PASS called Risk Assessment
PASS skipped Delivery & Logistics
SaaS subscription
PASS called Price Research
PASS called Contract Review
PASS skipped Delivery & Logistics
RESULT: 4/4 routing cases passed
The SaaS case checks fewer assertions than the others. Financing and Risk Assessment are omitted from that test because the coordinator's decision on them is borderline for a 50-seat enterprise tool: the annual cost might exceed the financing and risk thresholds depending on how the coordinator estimates the value. The eval only asserts on routing decisions that are unambiguously right or wrong.
The calibration loop caught two prompt issues. The coordinator was calling the Delivery Agent for SaaS subscriptions (a purely digital product). Adding a Narrowing constraint ("Do not invoke DeliveryAgent for purely digital purchases") fixed that. It was also inconsistently calling Delivery for laptops because Step 4 said "requires shipping" rather than asserting that physical products always need delivery planning. Making the Step explicit ("Physical products always require delivery or collection planning, even if the buyer has not mentioned it") stabilized it.
This demo runs in a single process. Everything happens in-memory: the coordinator calls tool functions, those functions create specialist agents, the specialists return strings. That's fine for development and for understanding the pattern. But it doesn't scale, it has no fault isolation, and there's no way to monitor or manage the agents independently.
The core pattern stays identical: each specialist is a callable endpoint, the coordinator's tools make HTTP calls instead of function calls, and the routing logic in the RISEN Steps doesn't change at all. Two paths to get there:
Option 1: Lambda with Function URLs or API Gateway. Each specialist becomes a Lambda function exposed over HTTP. You can use Lambda Function URLs (simpler, direct IAM auth) or API Gateway (more control, useful if corporate policy restricts Function URLs). Either way, the coordinator's tool callbacks switch from local invokeSpecialist calls to HTTP requests, and IAM auth restricts invocation to the coordinator's execution role only. The SwiftShip demo uses this pattern with Function URLs: a triage agent calling payment, warehouse, and order agents over HTTP.
Option 2: Amazon Bedrock AgentCore Runtime. AgentCore is a serverless runtime purpose-built for AI agents. Each agent deploys as a containerized Express service inside AgentCore, which handles session isolation per user, automatic scaling, and built-in observability. It supports the Strands TypeScript SDK and the A2A protocol. The deployment model is more involved than Lambda (Docker and ECR are required). Choose it when you need per-user session state, want A2A protocol support without building your own routing layer, or need a runtime that scales agent sessions independently rather than per-request.
In both cases, the RISEN prompts carry over unchanged. The next post walks through a full deployment of this purchasing coordinator demo.
Not all agents need the same model. The coordinator and Risk Assessment Agent use Sonnet 4.6 for stronger reasoning. Standard specialists (Price Research, Financing, Delivery, Contract Review) use Haiku 4.5, which is faster and cheaper. The model IDs are centralized in models.ts with environment variable overrides:
export const COORDINATOR_MODEL = process.env.COORDINATOR_MODEL_ID
?? 'global.anthropic.claude-sonnet-4-6'
export const SPECIALIST_MODEL = process.env.SPECIALIST_MODEL_ID
?? 'global.anthropic.claude-haiku-4-5-20251001-v1:0'
export const ADVANCED_MODEL = process.env.ADVANCED_MODEL_ID
?? 'global.anthropic.claude-sonnet-4-6'
The Price Research Agent has its own tool (save_price_snapshot) that writes structured price data to a local JSON file. The coordinator never sees this tool. It's scoped to the specialist.
In a production system, that tool could be anything: a call to a live pricing API, a search against a product catalog, a query to a DynamoDB table, a vector search against a knowledge base. The point is that specialist agents aren't just prompt wrappers. Each one can have its own tool surface, scoped to its domain, invisible to the coordinator. The coordinator stays focused on routing. The specialist handles whatever retrieval or action its domain requires.
Testing routing decisions is cheaper than testing specialist output quality. By stubbing the specialist callbacks, the routing eval runs four coordinator LLM calls instead of twenty. The stubs return immediately, so the eval completes in under a minute. This is practical for the iterative calibration loop.
The coordinator is one-shot. It reads the request, calls specialists, synthesizes, done. A real purchasing advisor would ask follow-up questions: "What year range are you considering?" or "Do you have a trade-in?" After delivering a recommendation, a conversational coordinator could handle "Tell me more about the financing options" by calling just the Financing Agent again with the new context.
The Strands SDK supports multi-turn conversation through the agent's message history. Making this demo conversational would mean wrapping the coordinator invocation in a loop that reads user input and feeds it back to the same agent instance. The RISEN prompts wouldn't change. The coordinator's Steps already describe when to call each specialist, and those decisions would apply on follow-up turns too. The main addition would be a new Step telling the coordinator to ask clarifying questions when the request is ambiguous before routing to specialists.
This is a natural extension but adds enough complexity (input loop, conversation state, deciding when to re-route vs. answer directly) that it's better as a separate iteration than a first demo.
The contracts between your agents are just prompts. Change the Steps, change the routing. Add a Narrowing constraint, prevent an overlap. No code changes required. That's the payoff of RISEN in a multi-agent context: the coordination logic is readable, editable, and testable without touching the orchestration code.
What purchase would you try first? Run npm start with something unexpected and see which specialists the coordinator calls. Let me know in the comments!