Three dumb ways our prod got slow (and not one was a slow algorithm)

Three dumb ways our prod got slow (and not one was a slow algorithm)

# postgres# webdev# performance# supabase

Vadym Arnaut

TL;DR. Three prod slowdowns in one week. A 30-second timeout that was secretly TWO MINUTES. A math...

TL;DR. Three prod slowdowns in one week. A 30-second timeout that was secretly TWO MINUTES. A math library we shipped twice to students who don't even have math. And a dashboard that built tens of thousands of objects just to draw a few hundred rows. Not one of them was a slow algorithm. The cause was never where I looked first.

We run Equip (open-source LMS: FastAPI + React + Supabase Postgres). I spent a week chasing latency, and the slow-query log lied to me every single time. Every fix lived UPSTREAM of the code that looked slow. Here's the wall of shame.

1. The 30-second timeout that was actually two minutes

I was sure we had a 30s statement_timeout. I set it the obvious way:

# looks right. does absolutely nothing in prod.
connect_args = {"options": "-c statement_timeout=30000"}

Then I ran SHOW statement_timeout on prod. It said 2min.

Two. Minutes. Not 30 seconds. The raw cluster default.

Here's why. We sit behind Supabase's Supavisor in transaction-pooling mode, and a transaction pooler SILENTLY drops libpq startup options. The server connection you borrow for a transaction is not the one you "connected" with, so everything you set at startup just... evaporates. Our 30-second safety net was a comment. One ugly query could pin a pooled connection for the full two minutes.

The fix is a SET LOCAL on every transaction, wired through a SQLAlchemy engine event:

@event.listens_for(_engine, "begin")
def _set_statement_timeout(conn):
    # SET LOCAL is scoped to THIS transaction, so it can't leak
    # onto the next client that borrows this pooled connection
    conn.exec_driver_sql("SET LOCAL statement_timeout = '30s'")

SET LOCAL is the ONLY form that survives a transaction pooler. A plain session SET leaks the value onto whoever grabs that connection next. The startup options get dropped. SET LOCAL resets on commit, so every transaction gets exactly the ceiling you asked for and nobody inherits it.

Lesson: behind a transaction pooler, "I configured the connection" is a sentence you should not trust. That connection isn't yours.

2. We shipped KaTeX TWICE, to people who have no math

The student chapter page kept getting heavier and nobody could tell me why. The content is mostly text. There's basically no math in prod.

So I opened the bundle. We were shipping KaTeX twice. TWICE! One copy (v0.17.0) imported straight into ChapterView. A second copy (v0.16.47) dragged in transitively by the rich-text editor's math extension. Two full copies of a typesetting library, and the student-facing one loaded eagerly on every single chapter render, formula or not.

And there was a nastier landmine under it. katex.min.css only lived in the teacher-editor chunk. So the very first time a teacher typed a formula, it looked perfect for THEM and rendered unstyled and broken for every student, because students never load that chunk. Ouch.

Two fixes. Dedupe to one copy with an npm override so the editor extension uses the app's katex. Then make the whole thing lazy, loading the library AND its CSS only when there's actually math on the page:

// only runs if the page has unrendered math markers
const [{ default: katex }] = await Promise.all([
  import("katex"),
  import("katex/dist/katex.min.css"), // ship the CSS WITH the lib, not in some other chunk
]);

ChapterView fires it and forgets. The student chapter chunk dropped from 84.2 KB to 9.9 KB gzip. For the page loads with no math (almost all of them) katex now never downloads at all.

Lesson: a dependency that your dependency drags in is STILL your bytes. "Import it where it's used" beats "import it at the top" exactly when most people never hit that path.

3. We built tens of thousands of objects to draw a few hundred rows

This one wore three masks, all the same sin: doing O(N×M) work to render O(N) of UI.

/users/me/courses (the dashboard) eager-loaded the WHOLE module-and-chapter tree for every course. Hundreds of entities on a big course. Then the dashboard schema threw all of it in the bin, because the dashboard shows course cards, not chapters.
/progress/course/{id}/students shipped the full per-chapter breakdown for every student. That's students times chapters. On our fat-seed test course it was tens of thousands of objects in ONE response, just to draw a table of names and averages.
The localized course detail validated the same tree up to THREE times per request and cloned a couple hundred chapter objects with model_copy on the way.

The fixes are boring, and that's the point. A slim dashboard schema with no modules field, so the tree never loads. A progress list that returns server-computed averages and lazy-loads one student's detail only when you expand their row. And a single bottom-up validation pass with Pydantic v2's revalidate_instances="never", so each entity is validated once instead of three times.

Then the database side of the exact same story: we were missing indexes. Three of them, one migration. The two that mattered are partial composite indexes:

-- we only ever aggregate COMPLETED attempts, so don't index the rest
CREATE INDEX ix_quiz_attempts_quiz_user_completed
  ON quiz_attempts (quiz_id, user_id)
  WHERE completed_at IS NOT NULL;

CREATE INDEX ix_chapter_progress_chapter_user_completed
  ON chapter_progress (chapter_id, user_id)
  WHERE completed;

That WHERE predicate is the whole trick. These queries only ever touch completed work, so indexing the unfinished rows is dead weight. The third index, on certificates(status), was already declared in our ORM model and had just never been migrated to prod. A model-vs-schema drift the audit caught on the way past.

Lesson: the slow-query log shows you the query that's slow. It does NOT tell you the query should never have been fetching that shape to begin with.

The thread tying it together

None of this was algorithmic. A startup option a pooler ignores. A duplicated dependency. Over-fetching a tree just to throw it away. A missing WHERE on an index. Every one of them lived upstream of the code that profiled slow: in the connection layer, the bundler, the serializer, the migration history.

The profiler points at the symptom. Now I try to ask, every single time: "what BUILT this input?" before I optimize whatever's chewing through it.

A few things I'd honestly love to hear back on:

If you're behind a transaction pooler (Supavisor, PgBouncer): do you SET LOCAL per transaction, or push timeouts up to the gateway and keep the app dumb? We went app-side and I'm genuinely not sure it's right long-term.
Has a transitive dependency ever shipped a second copy of something already in your bundle? How did you catch it before prod? We're eyeing a gzip size budget in CI.
Where do you draw the lazy-loading line? Math was easy. I'm much less sure about, say, a charting lib used on a third of pages.

ArVaViT / equip

Free, open-source LMS for Bible schools, ministries, and nonprofit educational programs. React + FastAPI + Supabase.

Equip

A free, open-source learning management system built for Bible schools church ministries, and nonprofit educational programs

Live demo · Roadmap · Contributing · Support · Changelog

Screenshots

_{Sign in (light)}	_{Sign in (dark)}
_{Account creation — student / teacher role picker}	_{Mobile (390px)}

Live at equipbible.com. Teacher and admin views (gradebook, course editor, analytics) are behind sign-in — create a free account to explore.

Why this project?

Hundreds of small Bible schools, home churches, and missionary training programs around the world still manage courses on paper, WhatsApp, or spreadsheets. Commercial LMS platforms are expensive, overkill, or require technical expertise that volunteer-run organizations simply don't have.

Equip is designed to change that:

Free forever — MIT-licensed, no paywalls, no "premium" tiers.
Simple to deploy — one-click Vercel deploy with a free Supabase database. No Docker, no servers to manage.
Built for small scale — optimized for 20-100 students, not…