Ananya kini"How did it know we were falling behind on the backend?" I questioned, only to find Hindsight had...
"How did it know we were falling behind on the backend?" I questioned, only to find Hindsight had been silently learning from our task completion velocity to calculate a project risk score that was uncomfortably accurate.
What I Built and Why
I wanted a project manager that could actually observe how a team works over time and reason about those patterns—not just display a Kanban board with coloured chips. Specifically: could I build something that watches task data accumulate over weeks and eventually surfaces something like "Charlie's documentation sprint is at risk" based on nothing more than historical completion patterns and deadline proximity?
The result is an AI Project Manager: a FastAPI backend (main.py) backed by a SQLite database (managed through SQLAlchemy in
database.py and db_models.py), with a React + Vite frontend. The AI layer is handled by two components working in tandem—Groq as the LLM for language generation, and Hindsight for persistent agent memory. The whole thing is wrapped in a REST API with nine route namespaces: auth, projects, teams, tasks, decisions, ai, meetings, integrations, and reports.
On paper this is straightforward. In practice, the hard part wasn't any single component—it was deciding where the learning should live and how to give it enough signal to be useful.
The Memory Problem I Didn't Anticipate
My first instinct was to store everything in SQLite and query it directly whenever I needed insights. That works fine for simple aggregations like "how many tasks did Alice complete this sprint?" But it fails almost immediately when you ask questions like "does this pattern suggest a risk?" because SQL doesn't generalise—it only answers what you explicitly query.
So I started thinking about what a "memory" layer would actually need to store. The key insight was that every task completion—or missed deadline—is an event that carries contextual signal: who did it, what kind of task it was, how late or early, what the priority and difficulty were. My initial DBTask model already captured most of this:
class DBTask(Base):
__tablename__ = "tasks"
id = Column(Integer, primary_key=True, index=True)
task_name = Column(String, index=True)
assigned_to = Column(String, index=True, nullable=True)
status = Column(String, default="To Do") # To Do, In Progress, Completed
priority = Column(String, default="Medium") # High, Medium, Low
difficulty = Column(String, default="Medium") # Easy, Medium, Hard
ai_rationale = Column(String, nullable=True)
confidence_score = Column(Integer, default=0)
deadline = Column(DateTime, nullable=True)
created_at = Column(DateTime, default=lambda: datetime.now(timezone.utc))
Notice ai_rationale andconfidence_score. Those aren't for the user—they're the system's own notes about why a task was assigned the way it was. When you generate a rationale at assignment time and store it next to the task outcome, you've created a feedback loop. Later, you can pick up those notes and ask: "did the reasoning hold up?"
The problem is that SQLite doesn't help you do that lookup in a meaningful way. You can't query "find past tasks where the rationale mentioned Python skills and the outcome was delayed." That's where Hindsight comes in.
Wiring In Hindsight
Hindsight is an agent memory SDK. The short version: you push events into it, and it builds a semantic memory store you can retrieve from later using natural-language queries. It's designed specifically for this use case—an agent that needs to reason about a history of interactions and outcomes without you having to manually write retrieval logic.
In
requirements.txt
, you can see the explicit dependency:
hindsight-client
groq
openai
The pattern I settled on is: whenever a task is completed (or missed its deadline), we push the full task context—name, assignee, rationale, deadline delta, difficulty—into Hindsight as a memory event. Then, when the /insights endpoint is hit, we retrieve semantically relevant memories to build the risk_insights list.
The InsightsResponse model in models.py makes this shape explicit:
class InsightsResponse(BaseModel):
best_performing_member: Optional[str] = None
most_delayed_member: Optional[str] = None
stats: TaskCompletionStats
risk_insights: List[str]
risk_insightsis the key field. It's not computed from a SQL aggregate. It's assembled by querying Hindsight with something like "what patterns suggest project risk?" and letting the memory layer surface the events that match. Then Groq formats those into readable strings.
The /insights endpoint in main.py handles this:
@app.get("/insights", response_model=models.InsightsResponse)
async def fetch_insights(
db: Session = Depends(get_db),
current_user: db_models.DBUser = Depends(auth_service.get_current_user)
):
"""
Extract global team performance and memory-derived project risks.
"""
return await get_insights(db)
The implementation lives inservices/insights_service.py (referenced in
main.py as from services.insights_service import get_insights). The function pulls raw stats from SQLite—completion counts, delay counts, average times—then augments them with Hindsight-retrieved memory to produce the risk narrative. The SQL layer gives you counts. The memory layer gives you patterns.
How Hindsight Actually Learns: A Concrete Walk-Through
Here is what happens when you hit /seed:
t3 = await create_task(db, models.TaskItemCreate(
task_name="Fix UI dashboard bug",
assigned_to="Alice",
deadline=now - timedelta(days=1), # already past due
priority="Low",
difficulty="Easy",
ai_rationale="Assigned To Alice during High-Load phase to balance team throughput."
))
await mark_task_completed(db, t3.id)
Alice is a backend engineer. This task is UI work. It got a past-due deadline and was assigned because the team was short on bandwidth, not because Alice was the right person. The rationale even admits this: "High-Load phase to balance team throughput."
When mark_task_completed runs, it calculates the deadline delta (completed_at - deadline), bundles the task metadata, and pushes a memory event to Hindsight. The event carries the task name, the assignee, the rationale, the difficulty, the priority, and whether it was late or on time.
Now, the next time you call /insights, Hindsight retrieves this pattern and similar ones. It knows: "Alice was assigned an out-of-domain task (UI) under high load and it was late." If you later assign Alice another UI task under pressure, the system can surface: "Risk: Alice has a history of delays on front-end tasks when overloaded."
That's the loop. It's not magic—it's semantic search over tagged task-completion events, combined with an LLM that knows how to frame "past delayed, same-domain, same-assignee" as a risk statement. The Hindsight documentation covers the event schema and retrieval API in detail, but the key API surface from our side is: push_event(content, metadata) on write, and query(prompt) on read.
What the Chat Endpoint Adds
Beyond insights, there is a natural-language chat interface:
@app.post("/chat")
async def chat_with_ai(
payload: dict,
db: Session = Depends(get_db),
current_user: db_models.DBUser = Depends(auth_service.get_current_user)
):
query = payload.get("query", "")
if not query:
return {"response": "I didn't receive a query.", "context_ref": "NONE"}
return await process_chat_query(db, query)
process_chat_query pulls current DB state (open tasks, team members, recent decisions) and also queries Hindsight for relevant memory context before passing everything to Groq. This means when you ask "why is the backend behind?", the LLM has both the live task list and the retrieved memory of past incidents to reason from.
The context_ref in the response—returned in every chat response—is the reference to which Hindsight memory chunk informed the answer. That's not cosmetic: it lets you audit why the system said what it said, which matters when the recommendation is "reassign Bob's database migration task."
Where It Surprised Me (and Where It Still Falls Down)
The memory architecture worked better than I expected for risk detection. The part I underestimated was retrieval quality at small scale. When the memory store has fewer than ~15 events, the semantic matches are noisy. The first few calls to/insights after a fresh seed return risk statements that are technically valid but too vague to be actionable—"some tasks are at risk of delay"—because there is not enough signal yet for the retrieval to discriminate.
This isn't a Hindsight problem; it's an intrinsic data density problem. Any memory system needs enough varied events before it can surface patterns rather than just facts. I worked around this in the seed data by injecting intentionally mismatched assignments (Alice on UI tasks, Bob on database schema) so the memory store starts with contrast—the system can only learn what "wrong fit" looks like if it has seen some.
The other limitation: I'm using SQLite with check_same_thread=False to handle FastAPI's async path operations:
engine = create_engine(
SQLALCHEMY_DATABASE_URL,
connect_args={"check_same_thread": False}
)
This is fine for a single-server, low-concurrency deployment. The moment you need multiple workers or any kind of horizontal scaling, you need to move to PostgreSQL. SQLite's write locking will become the bottleneck long before Hindsight or Groq do.
Lessons Learned
The rationale field is more valuable than it looks. Storing ai_rationale with every task assignment turned an assignment log into an audit trail. You can replay decisions and see whether the reasoning was sound, which is exactly the kind of retrospective data a memory system needs.
Push events at the right granularity. I initially pushed events only on task completion. That meant the system learned nothing about in-progress behaviour. Adding events on assignment and on missed intermediate milestones (where milestones existed) made the risk detection significantly more accurate.
Separate the SQL layer from the memory layer clearly. SQL for operational queries (counts, joins, current state). Hindsight for pattern retrieval and historical reasoning. The moment I tried to blur this—running aggregates in Hindsight that should have been SQL—I got slower, less reliable results.
Confidence scores age. The confidence_score column on DBTask
is set at assignment time based on the LLM's certainty about the assignee match. But that score doesn't update when the task outcome is known. Feeding the outcome back to update confidence is a natural next step that the schema already supports but the implementation doesn't do yet.
Seed data matters for demos but it creates false memory. The seeded tasks are designed to produce interesting risk outputs. In a real deployment, you'd want a cold-start strategy—either accept a quiet period before risk insights are useful, or bootstrap with historical data imports rather than synthetic events.
The system isn't finished. The automation rules table (DBAutomationRule) with trigger types like TASK_COMPLETEDand CREATE_TASK is scaffolded but not wired to the memory layer yet. When it is, you'd get something genuinely useful: a rule that fires when Hindsight detects a high-risk pattern and automatically creates a mitigation task or pings the relevant channel. That's the direction I want to go next—not more insight endpoints, but reactive memory that acts.
If you want to look at the memory SDK that makes this possible, the Hindsight GitHub repository is a good starting point. The pattern—push events on state changes, retrieve semantically on query—generalises to any agent that needs to reason about its own history.