SongBhaav v3: Deduplication, Realtime, and Building a Self-Healing Pipeline

# ai# webdev# nextjs# architecture

Pulkit Goyal

In Part 1, I covered how SongBhaav's backend broke two days after launch and how the v2 rewrite...

In Part 1, I covered how SongBhaav's backend broke two days after launch and how the v2 rewrite solved the serverless timeout problem with an async job queue.

That architecture held up well. But a few things still bothered me — some were UX rough edges, one was a ticking time bomb I found out about from an email.

This post covers the three changes that make up v3: synchronous lyrics scraping for faster failure feedback, job deduplication, switching from polling to Realtime, and a self-healing system for a Spotify policy change that was about to break my daily sync pipeline.

Problem 1: "Lyrics Not Found" Took Too Long to Tell You

In v2, the flow was: kick off a background job → wait → eventually find out if lyrics existed at all.

This meant that if a song's lyrics genuinely couldn't be found anywhere — LRCLIB, Genius, none of it — the user would sit through the entire async handoff, watch a loading state, and then get told there was nothing to show. The failure mode was slow in exactly the case where it should have been instant.

The fix in v3 was to move lyrics fetching back into the synchronous part of the request, but only the fetching — not the AI processing.

User → POST /api/start-job
         ├─ Check processed_songs (cache hit? return immediately)
         ├─ Check background_jobs (job already in-flight? return existing job_id)
         └─ New song: run fetchLyricsCascade() synchronously
                ├─ No lyrics found anywhere → return immediately, no job ever created
                └─ Lyrics found → save to DB → create job → push to QStash → return job_id

If lyrics genuinely don't exist, the user gets that answer in the time it takes to query three lyrics providers — a couple of seconds — instead of going through the full async detour first. Only the slow, unpredictable part (Gemini analysis) stays async. The fast, deterministic part (does this lyric exist anywhere) happens upfront.

This also let me split the loading UI into two honest stages instead of one generic spinner — an orange "hunting down the lyrics" stage while scraping runs, and a purple "breaking down the bars" stage once AI processing kicks in. Small thing, but it makes the wait feel like progress rather than a black box.

Problem 2: Two Users, Same Song, Double the Work

This one only shows up at a small scale of concurrent traffic, but it's a real bug. If two people searched the same uncached song within a few seconds of each other, both requests would independently check the cache, both would miss, and both would spin up separate QStash jobs — duplicate lyrics fetches, duplicate Gemini calls, duplicate API spend, for the exact same song.

The fix is a simple in-flight check before creating a new job:

checking background_jobs for this spotify_id:
  - status = 'pending' or 'processing' already exists?
      → return that existing job_id to the new request
  - otherwise:
      → proceed to create a new job

Both users end up subscribed to the same job, watching the same background worker resolve once, and both get the result when it completes. No wasted API calls, no race condition, no double billing on Gemini calls for a song that was already being processed.

Problem 3: Polling Worked, But Realtime Is Just Better Here

In v2, the frontend polled /api/check-job every few seconds to check status. It worked, and I wrote about why polling was a reasonable choice over WebSockets at the time — simpler, no persistent connection overhead.

In v3, I switched to Supabase Realtime instead. A few reasons pushed this:

Polling means either over-fetching (checking too often, wasting requests) or under-fetching (checking too rarely, feeling laggy). Realtime sidesteps that tradeoff entirely — the update arrives the instant the row changes, no guessing on interval timing.
With deduplication now in place, multiple clients can be waiting on the same job. Realtime handles "many clients subscribed to one row" more naturally than each client running its own polling loop against the same endpoint.
The actual implementation overhead turned out to be smaller than I expected — subscribing to a filtered channel on job_id is a few lines, and Supabase handles the connection lifecycle.

So this wasn't a case of "polling was wrong" — it was right for v2's constraints. It just stopped being the better tradeoff once concurrent job-sharing entered the picture.

Problem 4: An Email From Spotify About Something That Hadn't Broken Yet

SongBhaav runs a daily sync via GitHub Actions that pulls my recently played tracks from Spotify, so songs are pre-processed before anyone searches for them. This depends on a Spotify refresh token staying valid indefinitely.

Spotify recently emailed every developer with a policy change: starting July 20, 2026, refresh tokens will expire after six months. Once expired, any attempt to refresh returns an invalid_grant error, and the only fix is sending the user through the sign-in flow again.

This hadn't broken anything yet. But it was going to, on a specific date, with zero warning beyond that email — exactly the kind of failure that's easy to forget about until your daily sync silently stops working months from now and you have no idea why.

So instead of waiting for it to break, I built the recovery flow ahead of time:

Daily sync script runs
  → reads refresh_token from system_credentials table (DB, not env vars)
  → requests new access token from Spotify
  → if invalid_grant:
        → generate one-time secure ticket, insert into admin_sessions (1hr expiry)
        → POST a magic link to a private Discord webhook
        → exit gracefully (no retry, no crash loop)

When I click the Discord link:
  → /api/admin/spotify-login validates the ticket, marks it used, redirects to Spotify OAuth
  → /api/admin/spotify-callback exchanges the auth code, UPSERTs new refresh_token into system_credentials

Next sync run → reads the new token → succeeds

A few deliberate choices here:

Credentials live in the database, not environment variables. This means re-authorizing doesn't require a redeploy — the next scheduled run just picks up the new token from the DB automatically.

The ticket is single-use and time-boxed. Even if the Discord webhook URL ever leaked, an expired or already-used ticket can't be replayed.

The script exits cleanly on invalid_grant instead of retrying. Retrying a dead token wastes calls and clutters logs with the same error repeatedly. Better to fail once, loudly, and wait for the human-in-the-loop fix.

The actual mechanism is simple — a magic link sent over a webhook. What made it worth building was reading the policy email in advance and treating "this will eventually fail" with the same urgency as "this is currently failing."

What Changed, In Summary

Concern	v2	v3
Lyrics-not-found feedback	Async, slow	Synchronous, instant
Duplicate concurrent searches	Each spawns a new job	Deduplicated, shared job
Client notification	Polling `/api/check-job`	Supabase Realtime subscription
Spotify token expiry	Not handled	Self-healing via Discord + OAuth

None of these were emergencies. That's sort of the point — v2 wasn't broken, it just had room to get better, and one looming policy change that hadn't caused damage yet but eventually would have.

If you want to try SongBhaav yourself, drop any song in the search bar: song-bhaav.vercel.app