I Built Student Memory Into Groq Prompts Via Hindsight

Pavan K

I Built Student Memory Into Groq Prompts Via Hindsight I added twelve lines of Hindsight retrieval...

I added twelve lines of Hindsight retrieval to the Groq prompt
pipeline and the AI's study plan recommendations went from generic
topic lists to specific, session-aware guidance. The model didn't
change. The context did.

The setup

StudyMind is a study tool built for college students. Upload a PDF
or TXT file and the app generates a 5-question multiple-choice quiz,
10 spaced-repetition flashcards, and a personalized study plan —
all via Groq's Llama 3.3 70B. Answers are stored in Supabase.
Hindsight sits between
the database and the prompt layer, turning raw event logs into
context the model can reason about.

The problem I kept running into wasn't inference quality. Groq
generates excellent questions. The problem was that every prompt
was a clean slate. A student could fail the same concept three
sessions in a row and the model would have no idea — because nobody
told it.

This article is about how I fixed that, what the prompt layer
looks like now, and what I'd do differently.

What a stateless prompt looks like

The original study plan generation was straightforward:

def generate_study_plan(topic_scores: dict, exam_date: str,
                        hours_per_day: int) -> str:
    prompt = f"""
    You are a study planner. A student has an exam on {exam_date}.
    They have {hours_per_day} hours per day to study.

    Their current topic accuracy:
    {format_scores(topic_scores)}

    Build a day-by-day study schedule that prioritizes weak topics.
    """
    return groq_client.chat(prompt)

This works. It produces a reasonable schedule. But topic_scores
is a snapshot — accuracy averaged across all time. It has no
concept of trajectory. A student at 65% who improved from 40%
last week gets the same prompt as a student at 65% who dropped
from 85%. The model treats them identically because the context
treats them identically.

The flashcard insight generation had the same problem:

def generate_daily_insight(topic_scores: dict) -> str:
    prompt = f"""
    A student has the following quiz accuracy by topic:
    {format_scores(topic_scores)}

    Give them one specific, actionable study tip for today.
    """
    return groq_client.chat(prompt)

One tip, based on a number, with no knowledge of what the student
actually did yesterday, or the day before, or what kinds of mistakes
they're making. The output was technically correct and functionally
useless. Students stopped reading the daily insight after two days.

What gets logged

Before redesigning the prompt layer, I needed the event layer to
capture the right things. I settled on four event types:

# Quiz answer — core performance signal
hindsight.log({
    "event": "quiz_answer",
    "user_id": user_id,
    "topic": topic,
    "question_id": question_id,
    "question_text": question_text,
    "correct": is_correct,
    "chosen_answer": chosen_option,
    "correct_answer": correct_option,
    "timestamp": now_iso()
})

# Flashcard rating — confidence signal
hindsight.log({
    "event": "flashcard_rating",
    "user_id": user_id,
    "card_id": card_id,
    "topic": topic,
    "rating": rating,        # "hard" | "ok" | "easy"
    "timestamp": now_iso()
})

# Session start — study frequency signal
hindsight.log({
    "event": "session_start",
    "user_id": user_id,
    "timestamp": now_iso()
})

# Plan generated — prevents circular reinforcement
hindsight.log({
    "event": "plan_generated",
    "user_id": user_id,
    "topics_prioritized": topic_list,
    "timestamp": now_iso()
})

The last one matters more than it looks. Without logging which
topics the planner already prioritized, the system would keep
recommending the same topics regardless of whether the student
had already worked on them. It was reinforcing its own output
rather than responding to new behavior.

Rebuilding the prompt layer

The retrieval call is the same across all prompt types — pull the
last 50 events for this user, filter by relevance inside Python
before passing to the model:

def get_student_context(user_id: str,
                        topic: str = None) -> dict:
    events = hindsight.retrieve(
        user_id=user_id,
        event_types=["quiz_answer", "flashcard_rating",
                     "session_start", "plan_generated"],
        limit=50
    )

    quiz_events = [e for e in events if e["event"] == "quiz_answer"]
    if topic:
        quiz_events = [e for e in quiz_events
                       if e["topic"] == topic]

    wrong = [e for e in quiz_events if not e["correct"]]
    right = [e for e in quiz_events if e["correct"]]
    hard_cards = [e for e in events
                  if e["event"] == "flashcard_rating"
                  and e["rating"] == "hard"]
    session_count = len([e for e in events
                         if e["event"] == "session_start"])

    return {
        "wrong_answers": wrong,
        "right_answers": right,
        "hard_cards": hard_cards,
        "session_count": session_count,
        "raw_events": events
    }

This context dict feeds into every prompt that touches student
performance. The study plan prompt now looks like this:

def generate_study_plan(topic_scores: dict, exam_date: str,
                        hours_per_day: int, user_id: str) -> str:
    ctx = get_student_context(user_id)

    wrong_summary = "\n".join(
        f"- {e['topic']}: chose '{e['chosen_answer']}' "
        f"(correct: '{e['correct_answer']}')"
        for e in ctx["wrong_answers"][-15:]
    )
    hard_summary = "\n".join(
        f"- {e['topic']} flashcard marked Hard"
        for e in ctx["hard_cards"][-10:]
    )

    prompt = f"""
    You are a study planner. A student has an exam on {exam_date}.
    They have {hours_per_day} hours per day and have completed
    {ctx['session_count']} study sessions so far.

    Current topic accuracy:
    {format_scores(topic_scores)}

    Recent wrong answers (what they chose vs. what was correct):
    {wrong_summary}

    Flashcards marked Hard recently:
    {hard_summary}

    Build a day-by-day schedule. Prioritize topics with declining
    accuracy and repeated Hard flashcard ratings. Where you can
    identify a specific misconception from the wrong answers,
    name it explicitly in the plan.
    """
    return groq_client.chat(prompt)

The instruction "where you can identify a specific misconception,
name it explicitly" is the line that changed the output most
dramatically. Without it, the model would summarize the history
and produce a schedule. With it, the model would produce output
like: "Day 3: Thermodynamics — focus specifically on entropy
direction in open systems. You've selected 'entropy decreases'
in three recent questions where the correct answer was 'entropy
increases.'" That's the difference between a study plan and a
tutor.

The daily insight, rebuilt

The insight generator got the same treatment, compressed into a
tighter prompt since it's a single tip rather than a full schedule:

def generate_daily_insight(topic_scores: dict,
                           user_id: str) -> str:
    ctx = get_student_context(user_id)
    recent_wrong = ctx["wrong_answers"][-5:]
    recent_hard = ctx["hard_cards"][-5:]

    prompt = f"""
    A student has completed {ctx['session_count']} study sessions.

    Topic accuracy: {format_scores(topic_scores)}

    Last 5 wrong answers:
    {format_wrong(recent_wrong)}

    Last 5 Hard flashcard ratings:
    {format_hard(recent_hard)}

    Give one specific, actionable tip for today's session.
    Reference a concrete pattern you see in their history —
    not a generic study suggestion.
    """
    return groq_client.chat(prompt)

Students started reading the daily insight again. The tips stopped
saying things like "review your weakest topics" and started saying
things like "you've marked the entropy flashcard Hard four times
this week — try working through numerical examples rather than
re-reading the definition." Specific, session-aware, and grounded
in what the student actually did.

This is what agent memory
means in practice: not a smarter model, but a model with enough
context to act like it knows the person it's talking to.

Lessons

Wrong answers are more valuable than correct ones. Logging
chosen_answer alongside correct_answer is what enables
misconception identification. A boolean correct field tells you
the student failed. The chosen answer tells you why — and that's
what goes into the prompt.

Log your system's own outputs. The plan_generated event
prevents the planner from recommending the same topics indefinitely.
Any AI feature that influences user behavior should log what it
recommended, so future prompts can account for what was already tried.

Context length is a prompt design problem. Fifty events fit
comfortably in a Groq context window. Five hundred don't. Filtering
in Python before passing to the model — recent events, wrong-only,
topic-filtered — keeps the prompt focused and the latency low.
The model doesn't need everything; it needs the right things.

Specificity instructions unlock model capability. "Prioritize
weak topics" produces a schedule. "Where you can identify a specific
misconception from the wrong answers, name it explicitly" produces
a tutor. The model was capable of the second output all along.
It needed the instruction to aim there.

What's left

The chosen_answer field is logged for every wrong answer but
not yet used for distractor generation — creating new quiz questions
whose wrong choices directly mirror a student's known misconceptions.
That's the next integration: passing historical wrong answers into
the quiz generation prompt as candidate distractors, closing the
loop between what a student misunderstands and what the quiz tests.

The event log is already there. So is the
Hindsight retrieval. It's
a prompt change away.