Margarita TitovaMost QA engineers who try AI and conclude "it's not that useful" are right, but for the wrong...

Most QA engineers who try AI and conclude "it's not that useful" are right, but for the wrong reason. The tool is not the problem. The input is. AI output quality is almost entirely a function of prompt quality, and prompting is a skill that takes practice to develop. The gap between a vague prompt and a well-structured one is not talent. It is technique.
This article covers the techniques that produce the most consistent improvement in AI output for QA work, what makes each of them work, and an honest account of what did not pan out.
The single biggest improvement in my outputs came from stopping to describe what I wanted and starting to show examples instead. This is called few-shot prompting: give the AI two or three examples of the exact pattern you want it to follow.
AI models are trained to recognize and reproduce patterns. They are much better at extrapolating from a concrete example than at interpreting an abstract description. "Write the title in the style of our team" means almost nothing. Three examples of titles your team has written means everything.
The second prompt produces a title that matches your team's format and specificity level without you having to explain what the format is. Once you understand this, few-shot prompting becomes the default for anything where format and style matter: ticket names, bug report summaries, test case titles, commit messages.
For complex prompts with multiple components (role, context, data, task, constraints), XML tags prevent the AI from mixing up what belongs where. They act like labeled containers. Without structure, AI may treat your context as part of the task instruction, or blend the data you provide into the output in confusing ways.
Here is the same task written as plain text:
You are a QA engineer for an insurance SaaS. We are testing multi-policy bundling where a customer can add up to 3 insurance products to a single quote and it has not shipped before. Write 6 test cases -- happy path, validation edge cases, and one for the max bundle limit -- in TestRail format with Title, Preconditions, Steps, Expected Result and no intro text. Only reference fields listed in the ACs and do not invent fields.
The constraint "only reference fields listed in the ACs" arrives well after the context that defines what the ACs are. The model has to infer where context ends and instruction begins. With a prompt this dense, it regularly gets that boundary wrong: it invents fields that sound plausible or lets the role description bleed into the output. XML removes that ambiguity:
<role>You are a QA engineer writing test cases for an insurance SaaS product.</role>
<context>We are testing multi-policy bundling. A customer can add up to 3 insurance products to a single quote. The feature is new and has not shipped before.</context>
<task>Write 6 test cases. Include happy path, validation edge cases, and one case for the maximum bundle limit.</task>
<format>TestRail format: Title, Preconditions, Steps, Expected Result. No intro text.</format>
<constraints>Only reference fields that are listed in the ACs. Do not invent fields.</constraints>
Output is noticeably more structured when the prompt is structured. This required a genuine mindset shift for me: writing prompts like messages to a colleague produces conversational results. XML produces engineering-grade results. Both have their place. Quick questions work fine as plain text; for anything multi-part, XML is worth the extra ten seconds.
A useful signal: if you find yourself re-reading an AI response and feeling like it "missed the point," your prompt probably needed better structure. The model answered the prompt it inferred from your text, which may not have been the prompt you intended.
For analytical tasks (root cause analysis, flaky test diagnosis, risk assessment), asking AI to reason step-by-step before giving an answer consistently produces better output than asking for a direct answer. The mechanism is that intermediate reasoning steps constrain the model toward logical conclusions rather than surface-level pattern matching.
A useful habit: when you get a CoT response, read the reasoning chain before the conclusion. This is where you spot a wrong assumption before it bakes into the final answer. Correcting it mid-conversation rather than starting over is one of the most valuable moves you can make.
Without Chain of Thought:
Why is this test flaky? [code]
With Chain of Thought:
Analyze this test step by step.
Step 1: Explain what the test is actually doing.
Step 2: Identify every place where non-determinism could be introduced (timing, external state, network calls).
Step 3: Rank these by likelihood of causing flakiness.
Step 4: Suggest the minimal fix for the most likely cause.
[code]
The slower approach catches more. On more than one occasion, step-by-step analysis surfaced a race condition that a direct "what's wrong?" answer missed entirely. More importantly, the chain-of-thought format makes the model's reasoning visible, so you can see where it goes wrong and correct it mid-conversation rather than acting on a subtly flawed conclusion.
The cost is slightly longer responses, which is a fine trade when the task is genuinely complex. For simple tasks (generating 5 test case names), skip the CoT instruction. It adds overhead without value.
Telling AI who it is changes what it generates. Not because the model becomes that person, but because the role constrains the relevant domain knowledge and perspective it applies to the task. The same test plan looks different through the eyes of a senior QA engineer versus a junior one, a security auditor versus a UX tester.
Roles that are useful in practice:
"Senior QA engineer with 10 years in fintech/insurtech": for reviewing test cases and surfacing gaps a junior might miss
"Non-technical end user of an insurance portal, 60 years old": for finding UX-level issues and accessibility gaps in test coverage
"Security auditor reviewing this API for OWASP Top 10 vulnerabilities": for generating security-focused test cases you would not normally think of
"Developer who wrote this feature and is now doing code review": for a technical perspective on where things could break internally
You can even ask for multiple perspectives in sequence: "Review these test cases first as a senior QA engineer, then as a non-technical end user. What does each perspective catch that the other misses?" The contrast is often illuminating.
After generating test cases, a bug report, or a test plan, add one more prompt:
Review the output you just wrote.
-- What scenarios are missing?
-- Which cases are duplicates or near-duplicates?
-- Where is the wording unclear or ambiguous to a junior QC?
-- Does this cover all the ACs?
A useful self-critique response looks like this:
Missing: no case for session expiry during step 3 of the checkout flow.
Duplicate: cases 2 and 4 both test the same validation path -- empty required field on submit.
Unclear: 'submit fails' in case 6 does not specify what the expected error message is.
Coverage gap: no test for the maximum bundle limit (3 policies) mentioned in the ACs.
This catches obvious gaps you would miss in a quick scan. It adds about thirty seconds to the process and regularly improves output quality. More importantly, it has changed how I review AI output personally: rather than just reading it, I now approach it in the same critical mode I would use reviewing a colleague's work. That shift in mindset matters more than the specific self-critique prompt.
The self-critique output also works as a review checklist. Even when the AI misses something specific, it flags the right categories to look for. "Check for missing negative scenarios" is a useful reminder even if the model missed a specific one.
For large, complex features, a single prompt produces worse results than a sequence of smaller ones. AI attention degrades over long contexts. Details from the beginning of a long prompt can get lost by the time the model reaches the task instruction at the end. A prompt with a 10-page spec, 20 ACs, and a test count instruction at the end risks the model ignoring that constraint by the time it starts generating. Chunking prevents this.
A pattern that works well for test planning:
Step 1: "Read this specification and list the main user flows it describes. Nothing else."
Step 2: "For the [specific flow], write happy-path test cases only."
Step 3: "Now add negative scenarios and edge cases for the same flow."
Step 4: "Review the complete list and remove duplicates. Flag any cases that need domain expertise to verify."
Breaking it down gives both AI and you a chance to catch misunderstandings before they compound. A wrong assumption in step one does not silently contaminate steps two and three. You see it, correct it, and continue.
One of the most common mistakes I see is treating AI as a one-shot oracle: you write a prompt, you get an answer, you either use it or you don't. The better mental model is a conversation. When the first response is 70% of what you need, the right move is not to rewrite the prompt. Continue the conversation instead.
Here is what that looks like in practice. First response from the AI covers the happy path and a few validation cases but payment testing is thin: two cases, both covering declined cards the same way. You do not start over. You continue:
Turn 1 (your prompt): Write test cases for the payment flow in the new policy checkout.
Turn 1 (AI response): [8 cases -- happy path, validation, 2 thin payment cases]
Turn 2 (you): The payment cases are too thin. Add 3 more covering: currency conversion edge
cases, expired card handling mid-flow, and concurrent submission from two browser tabs.
Turn 2 (AI response): [3 new cases, precisely scoped -- expired card mid-session with session
state preserved, currency mismatch on international cards, race condition on double-submit]
The second turn produces better cases than you would have gotten by asking for "comprehensive payment coverage" in turn one, because the constraint is precise and the AI already knows the context of the first response.
Or:
"These are too generic. Rewrite the preconditions to assume the user is logged in as an agent, not an end consumer. The feature works differently for agents."
Each follow-up narrows the output toward what you actually need. The context of the previous response is still in the conversation window, so you are building on it rather than starting over. Using AI as a conversation partner rather than a one-shot oracle changes your workflow more than any individual technique does.
It is worth separating genuine technique limitations from user errors. The fixes are different.
Domain-specific calculation logic. Insurance premium rules involve age brackets, regional multipliers, policy type combinations, and bundling discounts. AI consistently produces plausible-sounding but incorrect boundary values for these rules. This domain knowledge lives in your requirements documents and your domain experts, not in a general-purpose language model. Write those test cases by hand.
Truly comprehensive coverage. "Comprehensive" is the wrong goal for AI output. It will generate a complete-looking set of cases that feels thorough, until a domain expert finds the subtle coverage gaps that require deep system knowledge to see. Use AI to generate a strong first pass, then review it the same way you would a junior engineer's test plan.
Confabulated specifics. AI will invent field names, endpoint paths, and error codes that sound plausible in the context of your product. These are confidently stated and easy to miss on a quick read. If you do not have the spec in front of you to verify, you will ship test cases that test the wrong thing.
Asking for test cases without providing ACs. Assuming AI knows what the feature does is a consistent mistake. It does not. Vague input produces vague output, every time. Reading the acceptance criteria carefully before prompting is still the most important step. AI amplifies your understanding of the requirements; it does not replace it.
Skipping review because the output looked clean. Polished formatting is not evidence of correct content. AI outputs that look professional are just as likely to be wrong as messy ones. Budget review time for AI output the same way you would for a colleague's first draft. The self-critique step in section 5 is a forcing function for this: it creates a review checkpoint you cannot accidentally skip.
Prompting improves with practice: write a prompt, review the output critically, identify what was wrong, and adjust. Most engineers skip the middle step: they either accept the output or discard it without examining why it fell short. The engineers who improve fastest treat each output as feedback on how the model reads their language.
Start with the next task you would normally do manually. Write a structured prompt with a role, context, and explicit output format. See what comes back. Then ask yourself: what in the output do I need to change, and what in the prompt would have prevented that? That loop is where the skill builds. Each iteration sharpens your instincts for the next one.