Debugging agent memory using Hindsight logs

Shreya Yagain

Debugging agent memory using Hindsight logs Did it seriously just remember a bug from...

Debugging agent memory using Hindsight logs

Did it seriously just remember a bug from three days ago? I stared at the logs as Hindsight flagged a recurring pointer logic error the candidate thought they had buried in a previous mission.

I have been building InterviewOS, a self-hosted platform designed to put SDE-1 candidates through a mission-based gauntlet of DSA, system design, and SQL. The idea is to move away from static LeetCode grinding and toward a stateful, evolving experience where an AI interviewer actually tracks your growth or lack thereof over a multi-day preparation journey.

The system is structured around a central mission dashboard that manages a stage-locked planner. Under the hood, it is a React-Vite frontend talking to a Supabase backend, but the real complexity lies in the orchestration of the AI mock interview module and the weakness detection engine. To make this work, the system has to do more than just summarize a single chat; it has to maintain a coherent narrative of a developer’s technical identity across dozens of sessions.

The problem with Stateless Interviews

When I started, I followed the standard pattern: send the last five messages of the chat to the LLM, maybe include a short summary of the user's resume, and hope for the best.

It failed immediately. If a user struggled with a sliding window problem on Monday, the interviewer on Wednesday had no idea. The agent was essentially a goldfish. I tried the infinite context window approach, but costs spiraled, and the model started losing the thread of the actual interview rubrics. I realized I did not need a bigger window; I needed a way for the system to perform actual agent memory retrieval that felt surgical rather than broad.

I needed a system that could look at a failed SQL query from two days ago and decide that today, we are not just doing system design; we are doing system design with a heavy focus on data consistency because this candidate clearly does not understand ACID properties.

Implementing the Hindsight Layer

To solve this, I integrated Hindsight. Instead of just dumping logs into a database, I started using Hindsight to create a structured Technical Shadow of the user.

The core of the implementation lives in the weakness-detection-engine module. Every time a user submits code or answers a behavioral question, the output is not just graded; it is passed to a Hindsight observer. This observer does not just store the text; it categorizes the intellectual debt the user is accruing.

// src/lib/memory/hindsight-integration.ts
async function syncSessionToHindsight(sessionId: string, interaction: Interaction) {
  const hindsight = new HindsightClient(process.env.HINDSIGHT_API_KEY);

  await hindsight.capture({
    actor: "candidate",
    action: "code_submission",
    payload: {
      code: interaction.code,
      language: interaction.lang,
      feedback: interaction.aiFeedback,
      identifiedWeakness: interaction.tags 
    },
    context: {
      missionDay: interaction.day,
      difficulty: "SDE-1"
    }
  });
}

By the time I got to the Hindsight documentation, I realized that the logs were not just for me to read; they were for the agent to read before every new session.

The Breakthrough: The Resume Interrogation Incident

The moment it clicked for me was during the development of the Resume Interrogation engine. I had a test user who had Distributed Systems on their resume. In a standard setup, the AI asks a generic question about Load Balancers.

With the Hindsight integration, the engine looked back at a Journal Entry from Day 2 of the mission where the user admitted they struggled to explain CAP theorem to a peer. When Day 4’s mock interview started, the agent did not ask a generic question. It said that earlier in the week, the user mentioned some difficulty articulating the tradeoffs in CAP theorem. It then asked how the user handled partition tolerance in the specific Project X listed on their resume.

That is not a chatbot; that is an interviewer with a memory.

Debugging Why Traditional Logs Failed

The hardest part was debugging why the agent would sometimes forget a critical failure. I found a bug where I was over-weighting successful completions and under-weighting Logic Gaps in the retrieval pipeline.

I had to refactor the retrieval logic to prioritize Negative Patterns. If the system now queries the memory store, it explicitly looks for failures. By explicitly pulling Negative Patterns, the agent became much more aggressive and realistic. It stopped being a tutor and started being an interviewer.

Results and Behavior

Today, the system behaves with a surprising amount of technical intuition.

In one scenario on Day 1, a user solved a Merge Intervals problem but used a nested loop instead of an optimized sorting approach. On Day 2, the same user missed an index-optimized query in a SQL exercise. By Day 3, instead of a random new problem, the Mission Controller identified the efficiency pattern in the Hindsight logs. It generated a System Design prompt for a Real-time Analytics Dashboard and specifically prompted the interviewer to ask about scaling the ingestion layer when the sort-merge process becomes a bottleneck.

The system effectively learned that the user has a blind spot regarding time complexity in data processing, and it forced that blind spot into a different domain to see if the knowledge gap was universal.

Lessons Learned

Context is a liability if it is unstructured. Dumping a giant JSON of past events into a prompt makes the LLM lazy. Using a tool like Hindsight to pre-filter for relevant failures makes the agent much sharper.

State belongs in a specialized layer. I initially tried to keep user history in my primary Supabase database. It was a nightmare to query for patterns. Moving the event-based memory to a dedicated system allowed me to treat User History as a searchable stream rather than a static table.

The Goldfish problem is an architectural choice. We often blame the model for forgetting, but usually, it is our retrieval logic that lacks depth. If you do not give the model a reason to remember, like a specific log of a previous failure, it will not.

Skepticism is healthy. I still do not trust the AI to summarize a user. I trust the AI to query a structured log of the user's past actions.

Building InterviewOS taught me that the OS part is not about the UI; it is about the file system of the candidate's mind. For that, you need a very good set of logs.

This images is a snippet of the frontend code.

this is the home page of the website

This is the architecture diagram