Moon RobertSix months ago I shipped a feature that was supposed to extract structured data from messy customer...
Six months ago I shipped a feature that was supposed to extract structured data from messy customer support tickets — severity, product area, a short summary. Simple enough. I spent maybe two hours on the prompt, got it looking reasonable in playground testing, and pushed it. Within a week, our support team was filing complaints because the model kept hallucinating product names and miscategorizing anything that mentioned a bug in passing as "critical."
The problem wasn't the model. It was me treating prompt engineering like a one-shot intuition exercise instead of a discipline with actual techniques. I've since spent a lot of time fixing that gap. Here's what I learned — what genuinely moved the needle and what turned out to be mostly noise.
Most people's understanding of few-shot prompting stops at "include some examples in the prompt." That's technically correct but misses why it works and, more importantly, when it stops working.
The actual mechanism is that you're not just showing the model what the output looks like — you're implicitly encoding your decision logic. Every example is a little policy statement. If your five examples all treat ambiguous cases one way, the model will generalize that preference. The dangerous flip side: if your examples have inconsistent edge case handling, the model will pick up that inconsistency too.
Here's a simplified version of what I ended up with for the support ticket classifier:
SYSTEM_PROMPT = """You are a support ticket classifier. Classify each ticket into:
- severity: critical | high | medium | low
- product_area: auth | billing | api | dashboard | other
- summary: one sentence, 15 words max
Rules:
- "critical" means production is down or data loss is occurring RIGHT NOW
- A mention of a past bug is NOT critical
- If product area is genuinely ambiguous, use "other"
"""
EXAMPLES = [
{
"input": "Users can't log in at all. Getting 500 errors on /auth/login since 2pm.",
"output": '{"severity": "critical", "product_area": "auth", "summary": "Login endpoint returning 500 errors since 2pm, users fully blocked."}'
},
{
"input": "Hey, last week there was a bug where invoices showed wrong amounts. Is that fixed?",
"output": '{"severity": "low", "product_area": "billing", "summary": "Customer asking about status of a past billing display bug."}'
},
{
"input": "The export button on the dashboard does nothing when I click it.",
"output": '{"severity": "medium", "product_area": "dashboard", "summary": "Export button unresponsive for this user."}'
},
]
The second example is doing a lot of work. "Last week there was a bug" — I explicitly built in a case that a naive classifier would misread as critical. Before I added that example, the model was marking anything with words like "bug," "error," or "broken" as high or critical even when the ticket was clearly asking about a past resolved issue. One targeted example fixed it.
Practical takeaway: audit your failures, not your successes. The examples you need are the ones covering your actual edge cases, not a representative sample of easy cases.
Chain-of-thought prompting — getting the model to reason step-by-step before giving an answer — has genuinely good empirical backing for complex reasoning tasks. The classic "Let's think step by step" trick from the 2022 Wei et al. paper still holds up. But I've watched people apply it to tasks that don't need it and wonder why their latency tripled.
Here's the rough mental model I use: CoT helps when the answer depends on intermediate steps that aren't obvious from the input alone. Math problems. Multi-hop logic. Anything where a wrong intermediate conclusion cascades into a wrong final answer. It doesn't help much — and actively hurts cost and speed — for classification tasks where the answer is mostly pattern matching.
For the support ticket classifier above, CoT would be overkill. The model isn't doing multi-step reasoning; it's categorizing. But for a different task I worked on — evaluating whether a proposed code change would break backwards compatibility given a changelog and a set of API contracts — CoT was essential. Without it, the model would confidently give wrong answers because it was skipping the intermediate check of "does the old signature still exist?"
The prompt addition was minimal:
Before giving your final answer, reason through:
1. What does the old API contract guarantee?
2. What does the proposed change actually modify?
3. Is any existing guarantee violated?
Then give your verdict as JSON.
That structure forced it to surface the intermediate logic rather than leap to a conclusion. Accuracy on our eval set went from 71% to 89%. I was honestly surprised the gap was that large.
One thing I'd add: zero-shot CoT ("think step by step") works, but structured CoT (spelling out the reasoning steps explicitly) tends to be more reliable when you have domain-specific logic you want applied consistently. Your mileage may vary — I've only tested this on tasks in my own domain.
Around November last year I got excited about self-consistency sampling — the idea of running the same prompt multiple times and taking a majority vote on the answer. There's solid research behind it, and for high-stakes classifications it genuinely reduces variance.
So I implemented it for a content moderation helper we were building. Three samples, majority vote, done. Latency went up 3x. Cost went up 3x. And then I looked at the agreement rates: for ~85% of inputs, all three samples agreed immediately. I was paying triple cost for the 15% where it actually mattered — and for a lot of those disagreements, a smarter single prompt would have been just as good.
The lesson isn't that self-consistency is bad. It's that it's a latency and cost trade-off that only makes sense when: (a) you're on a high-stakes task where errors are genuinely expensive, and (b) you've already optimized your single-sample performance and hit a ceiling. I hadn't done (b). I just bolted it on as an improvement without checking whether my base prompt was already fixable by other means. It was — a few better examples brought the agreement rate up to ~94%, which made the remaining disagreements manageable with a simple human review queue.
Tree of Thought (ToT) takes CoT further by exploring multiple reasoning branches and backtracking when a path looks wrong. ReAct interleaves reasoning with tool use — the model thinks, calls a tool, observes the result, thinks again. Both are real techniques with real use cases. Both also add significant complexity.
I've used ReAct-style patterns in an internal tool that queries our metrics database to answer ad-hoc business questions. The flow is: model reasons about what query to write, executes it, checks if the result looks plausible, optionally refines the query. It works well. But "works well" required about two weeks of iteration on the tool-use schema, error handling for bad SQL, a retry loop, output validation, and a lot of prompt tuning around how the model decides when a result is "plausible enough." This is not an afternoon project.
ToT I've only used experimentally. For most production applications I've seen, the added complexity doesn't justify the gains unless you're doing something genuinely hard — planning problems, multi-constraint optimization, the kind of thing where a single reasoning chain regularly paints itself into a corner. For most business logic classification, extraction, or generation tasks, well-designed CoT with good examples gets you most of the way there at a fraction of the complexity.
So if you're considering ToT: run a careful eval on CoT first. If CoT is getting you to 85%+ accuracy on your task and you need to push to 92%+, maybe ToT is worth the engineering investment. If CoT is getting you 60%, your examples and instructions need work before you reach for fancier techniques.
One area that doesn't get enough attention in the "advanced techniques" conversation is caching. If you're using Anthropic's API (Claude Sonnet or Opus), they support prompt caching on long system prompts and example blocks — you can dramatically reduce costs on repeated calls with the same base prompt.
I'm not going to go deep on implementation here because the docs cover it well, but the architectural implication is important: it incentivizes you to move as much stable instruction content as possible into the system prompt or early turns, and keep the variable user input lean. This actually aligns well with good prompt structure anyway — long examples and detailed rules in the system prompt, just the raw input in the user turn.
One gotcha I hit: cache invalidation happens if you modify the system prompt at all. During active development I was tweaking examples constantly and wondering why I wasn't seeing cache hit rates. Obvious in retrospect. Keep a stable "production" prompt separate from your experimentation environment.
After all of this, here's my honest take on how to approach prompt engineering on a real project:
Start with zero-shot and understand why it fails. Don't jump to few-shot or CoT immediately. The failure modes of a zero-shot prompt tell you what examples to write and what reasoning steps to add. I wasted time writing five generic examples before I knew what edge cases I was actually dealing with.
When you add examples, make them diagnostic. Each example should cover a case that a naive prompt would get wrong. Three targeted examples beat ten generic ones.
Add CoT when you have multi-step reasoning or when intermediate conclusions are likely to cascade into downstream errors. Skip it for straightforward classification and extraction.
Don't reach for self-consistency or ToT until you've genuinely maxed out simpler approaches. Both are real tools but they come with real costs — latency, complexity, money. Most of the time, better instructions and better examples are the cheaper fix.
Build an eval set early. This sounds obvious but I kept skipping it "until things stabilized." Nothing stabilizes until you have numbers to stabilize against. Even 50 labeled examples you can run against automatically will save you from shipping regressions you didn't notice.
The prompt engineering field has a habit of making simple things sound mystical and making genuinely hard things sound like a quick trick. Neither is true. It's disciplined iteration, and the techniques above are just tools for that iteration — not shortcuts around it.