I Built an AI Code Reviewer That Learns From Every Pull Request

Ditya

#I Built an AI Code Reviewer That Learns From Every Pull Request When we started building an AI code...

#I Built an AI Code Reviewer That Learns From Every Pull Request
When we started building an AI code reviewer, we assumed the hardest part would be generating good review comments.
We were wrong.
The biggest improvement didn't come from changing the model, writing a better prompt, or increasing context length. It came from giving the reviewer memory.
That realization led us to build Sentimental.ai, a code review system designed around a simple idea: every engineering team accumulates knowledge, but most AI reviewers forget it after every interaction.
Human reviewers don't work that way. Senior engineers remember architectural decisions, recurring mistakes, production incidents, coding conventions, and years of review discussions. That accumulated context shapes every review they leave.
Most AI reviewers start from zero.
We wanted to see what would happen if they didn't.

## The Problem With Stateless AI Reviewers
Most AI code review tools follow a straightforward workflow:

Pull Request
↓
Large Language Model
↓
Review Comments

The model receives a diff, generates recommendations, and moves on.
That sounds reasonable until you compare it with how engineering teams actually operate.
A payments team may require audit logs for every money-moving action.
An analytics team may enforce strict schema validation at ingestion boundaries.
An infrastructure team may care deeply about retry budgets, observability, and operational safety.
None of these conventions exist in the model's training data.
They're learned through years of pull requests, incidents, architecture discussions, and production experience.
A generic AI reviewer has no access to that context.
As a result, it often produces technically correct feedback that feels disconnected from the realities of the codebase.

Treating Organizational Knowledge as Memory

Instead of viewing memory as an implementation detail, we made it the foundation of the system.
The reviewer stores and retrieves four categories of organizational knowledge.

export type MemoryKind =
| "review"
| "team_standard"
| "developer"
| "architecture";

Each category captures a different aspect of how engineering teams make decisions.

### Review Memory
Past reviews become searchable memories.

export interface ReviewMemory {
prTitle: string;
codePattern: string;
comment: string;
outcome: "accepted" | "rejected";
}

This allows the system to remember which recommendations were accepted, which were rejected, and how similar situations were handled previously.

### Team Standards

Engineering standards are stored as structured memories rather than static documentation.

export interface TeamStandardMemory {
standard: string;
rationale: string;
confidence: number;
supportingReviewIds: string[];
}

Every standard has supporting evidence and a confidence score that evolves over time.
Instead of treating rules as permanent truths, the system treats them as beliefs supported by historical reviews.

### Developer Patterns

The reviewer also remembers recurring developer-specific habits.
Examples include:

Frequently missing audit log entries
Mutating shared data structures in place
Reading configuration inside hot paths
Forgetting error-path test coverage The goal isn't to judge engineers. The goal is to provide more context-aware reviews.

### Architecture Memory
Finally, the system stores architectural conventions.
Examples include:

Repository + Service + Controller layering
Retry budget wrappers for outbound requests
Observability requirements for handlers
Audit logging patterns

These memories help explain why the codebase is structured the way it is.

## Building a Memory-Aware Review Pipeline
The most important step in the entire review flow happens before the model sees the code.
For every pull request, the system retrieves relevant organizational context.

const ctx = await retrieveContext(input);

The retrieval process gathers:

Similar review precedents
Team standards
Developer patterns
Architecture conventions This is where we integrated Hindsight. We used Hindsight GitHub as the memory layer because we wanted persistent retrieval and storage of organizational knowledge rather than temporary prompt context.

Instead of asking the model to reason from a raw diff alone, we first retrieve relevant memories and assemble them into a structured context block.
The resulting prompt contains information that a senior engineer would already know.
Only then does the model generate feedback.

## The Same Model, Different Reviews
One of the most interesting features we built is Evolution Mode.
The same pull request is reviewed three times.
### Stage 1: No Memory
The model receives only the code diff.
No standards.
No architecture context.
No review history.
The feedback is generic and resembles what most AI reviewers produce today.

### Stage 2: Partial Memory
The reviewer receives a limited amount of retrieved context.
Some standards.
Some architecture notes.
Some historical reviews.
The feedback becomes noticeably more specific.

### Stage 3: Expert Memory
The reviewer receives the full organizational context.
Past precedents.
Architecture conventions.
Developer patterns.
Team standards.
The quality difference is difficult to ignore.
What's interesting is that the underlying model remains exactly the same.
Only the memory changes.
That experiment convinced us that memory quality often matters more than model selection in review workflows.

## Reviews That Show Their Work
One thing that has always bothered me about AI-generated feedback is the lack of traceability.
A reviewer might say:

Move validation to the service layer.
Why?
Most systems can't answer that question.
Sentimental.ai requires evidence.
Every review comment includes supporting memories.
A recommendation can reference:

Team standards

Previous accepted reviews

Developer patterns

Architecture conventions The system maps those references back into human-readable evidence and confidence scores. Instead of presenting recommendations as opinions, it presents them as arguments backed by organizational knowledge. Trust comes from evidence, not authority.

## Teaching the Reviewer Over Time
The most valuable part of the system happens after the review is generated.
Traditional AI systems stop there.
We decided to treat review outcomes as learning signals.
When an engineer accepts or rejects a recommendation, that outcome becomes memory.

await storeReview(memory);

Accepted feedback strengthens related standards.
Rejected feedback weakens them.
Confidence scores evolve based on real engineering behavior rather than assumptions.
Over time, the reviewer becomes increasingly aligned with how the team actually works.

## Automatically Discovering New Standards
The feature I enjoyed building most was pattern extraction.
Engineering standards often emerge organically.
Nobody formally writes them down.
A reviewer leaves the same recommendation repeatedly.
Engineers consistently accept it.
Eventually, it becomes accepted practice.
We wanted the system to recognize that transition automatically.
The extraction process looks for recurring accepted review patterns.

if (items.length >= 3) {
// candidate standard
}

Once a pattern appears frequently enough, the model synthesizes a new engineering standard.

For example:

{
"standard": "Every money-moving action writes an audit log entry",
"rationale": "Required for compliance and dispute investigation."
}

That standard is promoted into memory and becomes available during future reviews.
This creates a feedback loop where reviews influence standards and standards influence future reviews.

## Why We Chose Hindsight
The most important design decision in the project was separating memory from generation.
We wanted retrieval to be a first-class capability rather than an afterthought.
That's why we chose Hindsight as the memory layer.
Hindsight provides a retrieval-centric approach where memories can be searched, retrieved, updated, and reused across interactions.
It allowed us to build a reviewer that accumulates organizational knowledge instead of forgetting everything after every request.
While building the project, we also spent time exploring broader ideas around agent memory systems, particularly how persistent memory changes agent behavior over long time horizons.
The more we worked on the project, the more convinced we became that memory is one of the most important missing pieces in modern AI applications.

## What We Learned
A few lessons stood out during development.

### Memory compounds
Models improve incrementally.
Memory improves cumulatively.
Every accepted review makes the system slightly better.

### Feedback loops matter more than prompts
The most valuable information isn't the code itself.
It's whether engineers agreed with the recommendation.

### Context beats intelligence
The same model can behave like a generic assistant or an experienced reviewer depending entirely on the context it receives.

### Organizational knowledge is usually implicit
Teams know much more than they document.
Capturing behavior is often more valuable than generating advice.

### Trust requires evidence
Engineers don't trust recommendations because an AI generated them.
They trust recommendations when they can understand why they were generated.

## Closing Thoughts
Building Sentimental.ai changed how I think about AI systems.
Initially, I thought the challenge was generating better reviews.
Now I think the challenge is remembering the information that makes reviews useful.
The difference between a generic reviewer and a senior engineer isn't necessarily intelligence.
It's accumulated context.
Models already know a lot about software engineering.
What they're missing is the memory of how a specific team builds software.
The more we can capture and retrieve that knowledge, the closer AI systems get to feeling like real teammates rather than tools.

PROJECT REPOSITORY:
https://github.com/Anushka-Btech/Sentimental.ai--Code-Reviewer

IMAGES: