How Hindsight Turned Raw Quiz Data Into Study Plans

Nidhi V Aathreya

"This plan is just telling me to study everything." A student said that after the first version of...

"This plan is just telling me to study everything." A student said
that after the first version of the planner went live. They were
right. Without behavioral history behind it, the AI had nothing
specific to prioritize — so it prioritized everything, which means
it prioritized nothing.

The pipeline in plain language

StudyMind takes a PDF or TXT file, generates a 5-question quiz and
10 flashcards via Groq's Llama 3.3 70B, tracks answers in Supabase,
and produces a personalized study plan keyed to an exam date.
The plan is the downstream output that students actually use. It's
also the place where weak data quality hurts most visibly — because
a bad quiz question is annoying, but a bad study plan wastes days.

The original planner pulled one thing from the database: accuracy
per topic. It handed that to Groq with an exam date and hours per
day, and asked for a schedule. The output was technically a study
plan. It wasn't a useful one.

Hindsight is what
sits between the raw quiz data and the prompt. Its job is to turn
a flat table of answers into a structured picture of how a specific
student actually learns — and hand that picture to the model
instead of a percentage.

What the planner was missing

The accuracy-only prompt had four blind spots.

Trajectory. A student at 65% who improved from 45% last week
and a student at 65% who dropped from 85% get identical prompts.
The first needs encouragement. The second needs immediate
intervention. Accuracy can't tell them apart.

Session frequency. A student who studies every day and a
student who studies once a week have completely different
scheduling needs. Session count and recency live in the event
log, not in the aggregate score.

Failure patterns. A score of 60% on thermodynamics could mean
random errors across all subtopics, or it could mean perfect recall
on everything except entropy. Those require different plans.
The only way to know is to look at which questions were answered
wrong — not just how many.

Prior plan history. If the planner already recommended spending
three days on thermodynamics last week, it should know that before
recommending the same thing again. Without logging its own output,
the planner repeats itself indefinitely.

Each of these blind spots has the same fix: log the event, retrieve
it later, pass it to the model.

The event schema

Four event types feed the planner. I've covered these in earlier
posts in the series, but the planner integration makes the
relationships between them concrete:

# Core performance signal
hindsight.log({
    "event": "quiz_answer",
    "user_id": user_id,
    "topic": topic,
    "question_id": question_id,
    "question_text": question_text,
    "correct": is_correct,
    "chosen_answer": chosen_option,
    "correct_answer": correct_option,
    "session_id": session_id,
    "timestamp": now_iso()
})

# Confidence signal — cross-referenced against quiz performance
hindsight.log({
    "event": "flashcard_rating",
    "user_id": user_id,
    "card_id": card_id,
    "topic": topic,
    "rating": rating,
    "session_id": session_id,
    "timestamp": now_iso()
})

# Prevents the planner from repeating itself
hindsight.log({
    "event": "plan_generated",
    "user_id": user_id,
    "exam_date": exam_date,
    "topics_prioritized": topic_list,
    "days_until_exam": days_remaining,
    "timestamp": now_iso()
})

The plan_generated event is the one most people skip. It's also
the one that most directly prevents the planner from becoming a
broken record. If thermodynamics has been the top priority in the
last two generated plans and accuracy hasn't moved, that's a signal
the current approach isn't working — not a reason to schedule it
a third time with the same framing.

Building the context object

Before any prompt is constructed, I retrieve and shape the event
history into a context object the prompt layer can reason from:

def build_planner_context(user_id: str,
                          topic_scores: dict) -> dict:
    events = hindsight.retrieve(
        user_id=user_id,
        event_types=["quiz_answer", "flashcard_rating",
                     "plan_generated"],
        limit=100
    )

    quiz_events = [e for e in events
                   if e["event"] == "quiz_answer"]
    plan_events = [e for e in events
                   if e["event"] == "plan_generated"]

    topic_trends = {}
    for topic in topic_scores:
        topic_quiz = sorted(
            [e for e in quiz_events if e["topic"] == topic],
            key=lambda e: e["timestamp"]
        )
        if len(topic_quiz) < 2:
            topic_trends[topic] = "insufficient_data"
            continue
        early = sum(1 for e in topic_quiz[:3] if e["correct"]) / 3
        recent = sum(1 for e in topic_quiz[-3:] if e["correct"]) / 3
        delta = recent - early
        if delta > 0.15:
            topic_trends[topic] = "improving"
        elif delta < -0.15:
            topic_trends[topic] = "declining"
        else:
            topic_trends[topic] = "stalled"

    wrong_answers = [
        e for e in quiz_events if not e["correct"]
    ][-20:]

    previously_prioritized = []
    if plan_events:
        last_plan = sorted(plan_events,
                           key=lambda e: e["timestamp"])[-1]
        previously_prioritized = last_plan.get(
            "topics_prioritized", []
        )

    hard_cards = [
        e for e in events
        if e["event"] == "flashcard_rating"
        and e["rating"] == "hard"
    ]

    return {
        "topic_trends": topic_trends,
        "wrong_answers": wrong_answers,
        "previously_prioritized": previously_prioritized,
        "hard_cards": hard_cards,
        "total_sessions": len(set(
            e.get("session_id") for e in quiz_events
            if e.get("session_id")
        ))
    }

topic_trends is the key output. It classifies each topic as
improving, declining, stalled, or data-insufficient by comparing
early and recent accuracy. A declining topic and a stalled topic
both have low accuracy — but they need different responses from
the planner, and now the model can tell them apart.

The prompt

With the context object built, the prompt assembles itself:

def generate_study_plan(topic_scores: dict, exam_date: str,
                        hours_per_day: int, user_id: str) -> str:
    ctx = build_planner_context(user_id, topic_scores)

    trend_lines = "\n".join(
        f"- {topic}: {score:.0%} accuracy, "
        f"trend: {ctx['topic_trends'].get(topic, 'unknown')}"
        for topic, score in topic_scores.items()
    )

    wrong_lines = "\n".join(
        f"- {e['topic']}: chose '{e['chosen_answer']}' "
        f"(correct: '{e['correct_answer']}')"
        for e in ctx["wrong_answers"][-10:]
    )

    hard_lines = "\n".join(
        f"- {e['topic']} flashcard marked Hard"
        for e in ctx["hard_cards"][-8:]
    )

    prior_note = (
        f"The last generated plan prioritized: "
        f"{', '.join(ctx['previously_prioritized'])}. "
        f"If accuracy on those topics hasn't improved, "
        f"adjust the approach rather than repeating the same schedule."
        if ctx["previously_prioritized"] else ""
    )

    prompt = f"""
    You are a study planner for a student with an exam on {exam_date}.
    They have {hours_per_day} hours per day and have completed
    {ctx['total_sessions']} study sessions so far.

    Topic accuracy and trajectory:
    {trend_lines}

    Recent wrong answers:
    {wrong_lines}

    Flashcards marked Hard:
    {hard_lines}

    {prior_note}

    Rules:
    - Prioritize declining topics over stalled ones; stalled over
      improving ones.
    - Where wrong answers reveal a specific misconception, name it
      in the plan — do not just say "review thermodynamics."
    - If a topic was prioritized last time and hasn't moved,
      suggest a different study method, not more of the same.
    - Distribute Hard flashcard topics into early sessions where
      retention is highest.

    Output a day-by-day schedule. Be specific.
    """
    return groq_client.chat(prompt)

The four rules at the bottom do the most work. They translate
the structured context into planning logic the model can follow
consistently. Without them, the model reads the history and
produces a reasonable-looking schedule. With them, it produces
a schedule that reflects actual decisions: declining before
stalled, specific misconceptions named, prior failures acknowledged.

This is what agent memory
enables at the output layer — not smarter generation, but generation
that's accountable to what happened before.

What changed in the output

The before and after are worth making concrete.

Before Hindsight, a typical plan output:

Day 1–2: Thermodynamics (weakest topic, 62%). Day 3–4: Fluid
dynamics (71%). Day 5: Optics review. Day 6–7: Full revision.

After Hindsight, the same student, same exam date:

Day 1: Thermodynamics — entropy direction in open systems
specifically. You've selected "entropy decreases" in four recent
questions where the correct answer was "entropy increases." Work
through two numerical examples before re-attempting the flashcard.
Day 2: Thermodynamics continued — this topic is declining (80%
two weeks ago, 62% now) and was prioritized in your last plan
without improvement. Try a different approach: practice with
worked examples rather than definition review. Day 3: Fluid
dynamics — stable but unresolved. Focus on the Bernoulli
application questions you missed...

The model didn't get smarter. It got context.

Lessons

Trajectory beats snapshot. A topic's current accuracy is almost
useless for planning without knowing which direction it's moving.
Computing a simple early-vs-recent delta per topic and passing the
trend label to the model produced more calibrated plans than any
prompt engineering I did on the accuracy number alone.

Rules in the prompt encode planning logic. Explicit ordering
rules — declining before stalled, specific over generic — give the
model a decision framework rather than a vague goal. The output
becomes more consistent and more auditable.

Log what the system recommends. The plan_generated event
is the least glamorous part of this integration and the one I'd
most regret skipping. A planner that can't see its own history
will repeat its mistakes indefinitely.

The context object is the real engineering surface. The prompt
is downstream of the context. Getting build_planner_context
right — what to retrieve, how to aggregate, what to filter — is
where the planning quality is actually determined. The model
handles the prose.

Where this ends up

The planner currently treats each topic as independent. The next
version should account for topic dependencies — a student who
hasn't resolved thermodynamics fundamentals probably isn't ready
for the fluid dynamics questions that assume them. That dependency
graph isn't in the event log yet. It would need to be either
inferred from co-occurrence patterns in wrong answers or defined
explicitly in the content layer.

Either way, it starts with the same step: log the right event,
retrieve it deliberately, hand it to
Hindsight before the model
ever sees it. The pattern is the same. The data just gets richer.