Your Postgres Is Failing Quietly: 7 SQL Checks That Catch It Before Grafana Does

# postgres# database# performance# sql

Victor

Production Postgres rarely fails with a clean red alarm. More often, one endpoint starts taking eight...

Production Postgres rarely fails with a clean red alarm. More often, one endpoint starts taking eight seconds instead of 200 ms while Grafana is still green. CPU looks fine. Memory is fine. Disk is not full. The database is alive.

It is also slowly getting worse.

A table has accumulated millions of dead rows. An index nobody uses is still maintained on every write. A forgotten transaction is holding back vacuum. A query that looked harmless at 4 ms is called two million times a day. None of this is a crash. It is just quiet decay.

For years I kept a queries.sql file for exactly this situation. When production felt wrong, I pasted the same checks into psql and looked for the usual suspects. Below are the seven checks I still reach for first. They do not require an agent, a paid monitoring product, or a new service in the middle of your database path.

Most queries work on PostgreSQL 13+. The one extension used here is pg_stat_statements, and I call that out where it appears.

1. Dead rows: the cheapest early warning

Postgres does not immediately remove old row versions after UPDATE or DELETE. Because of MVCC, dead tuples stay around until vacuum can clean them up. If vacuum cannot keep up, your table gets heavier and every scan has more garbage to step over.

Start with pg_stat_user_tables. It is an estimate, not a forensic report, but it is fast and usually good enough to tell you where to look.

SELECT
    schemaname,
    relname AS table_name,
    n_live_tup,
    n_dead_tup,
    round(
        n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0),
        1
    ) AS dead_ratio,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;

What I look for:

Large tables with dead_ratio above 20%.
Tables where last_autovacuum is empty or very old.
A hot table where dead tuples keep growing between checks.

Do not jump straight to VACUUM FULL. It rewrites the table and takes an ACCESS EXCLUSIVE lock, which usually means downtime for that table. For normal cleanup, tune autovacuum. For already accumulated bloat, look at tools like pg_repack and plan the operation properly.

The boring fix that often matters most is per-table autovacuum tuning:

ALTER TABLE public.orders SET (
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_analyze_scale_factor = 0.02
);

Defaults are fine for many tables. They are not fine for every large write-heavy table.

2. The queries that actually consume database time

Slow query logs are useful, but they miss the other common failure mode: a query that is individually fast but called constantly.

That is where pg_stat_statements earns its keep. If it is not already enabled, you need it in shared_preload_libraries, then a restart, then:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Then ask which normalized queries consumed the most total execution time:

SELECT
    calls,
    round(total_exec_time::numeric, 1) AS total_ms,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round(
        (100 * total_exec_time / nullif(sum(total_exec_time) OVER (), 0))::numeric,
        1
    ) AS pct,
    left(query, 180) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 15;

The important sort key is total_exec_time, not mean_exec_time.

A query that averages 4 ms can be your biggest database cost if it runs millions of times. A 3-second report that runs twice a day is noisy, but it may not be where your server is spending its life.

Postgres 12 and older use total_time and mean_time. Postgres 13+ split planning and execution time, which is why the columns above are named total_exec_time and mean_exec_time.

Once you find the real offenders, the next step is usually:

EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;

Use EXPLAIN on the actual query shape, with realistic parameters, not on a simplified toy version.

3. Tables that are read fully again and again

Sequential scans are not automatically bad. On a small table, a seq scan is often the right plan. On a large table, frequent full-table reads can become the invisible tax behind slow endpoints.

This query gives you candidates:

SELECT
    schemaname,
    relname AS table_name,
    seq_scan,
    idx_scan,
    seq_tup_read,
    CASE
        WHEN seq_scan > 0 THEN seq_tup_read / seq_scan
        ELSE 0
    END AS avg_rows_per_seq_scan,
    pg_size_pretty(pg_relation_size(relid)) AS table_size
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

What I look for:

Big seq_tup_read on large tables.
High avg_rows_per_seq_scan.
A table with many seq scans and very few index scans.

This query does not tell you which index to create. It only tells you where to investigate. Cross-check it with the top queries from pg_stat_statements, then run EXPLAIN (ANALYZE, BUFFERS).

Do not index a table just because it appears here. If the table has 1,000 rows, the planner may be doing exactly the right thing.

4. Indexes that only slow down writes

An unused index is not free. It consumes disk, takes cache space, and must be updated on every relevant INSERT, UPDATE, and DELETE.

This is the first pass I use to find obvious candidates:

SELECT
    s.schemaname,
    s.relname AS table_name,
    s.indexrelname AS index_name,
    s.idx_scan,
    pg_size_pretty(pg_relation_size(s.indexrelid)) AS index_size
FROM pg_stat_user_indexes s
JOIN pg_index i ON i.indexrelid = s.indexrelid
WHERE s.idx_scan = 0
  AND NOT i.indisunique


  AND NOT i.indisprimary
ORDER BY pg_relation_size(s.indexrelid) DESC
LIMIT 20;

The exclusions matter. Primary keys and unique indexes enforce correctness. You do not drop them just because SELECT queries do not use them.

Also, idx_scan = 0 is a candidate, not a verdict. Check these before dropping anything:

How long has the database been up? Stats reset on restart.
Does the index support a monthly or quarterly job?
Is it used on a read replica rather than primary?
Is it attached to a feature that has not gone live yet?

When you are confident, prefer:

DROP INDEX CONCURRENTLY index_name;

It avoids taking the same kind of blocking lock as a plain DROP INDEX.

5. Long transactions that hold the whole system back

This is one of the most underrated checks. An open transaction can hold locks and keep the xmin horizon old, which prevents vacuum from removing dead row versions. One forgotten BEGIN can make bloat grow across the database.

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start  AS xact_age,
    now() - query_start AS query_age,
    wait_event_type,
    left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND state <> 'idle'
ORDER BY xact_start
LIMIT 20;

What I look for:

idle in transaction sessions older than a few minutes.
Application names that point to a worker, migration, BI tool, or forgotten admin client.
A transaction age that lines up with vacuum not cleaning a table.

The server-side seat belt is:

ALTER SYSTEM SET idle_in_transaction_session_timeout = '2min';
SELECT pg_reload_conf();

Do not copy that value blindly. Some maintenance jobs legitimately need long transactions. But leaving the timeout unset forever is how a quiet bug becomes a Friday incident.

6. Who is blocking whom right now

When production is stuck, you do not want to read raw lock tables by hand. Start with pg_blocking_pids().

SELECT
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    left(blocked.query, 100) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    left(blocking.query, 100) AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

The left side is waiting. The right side is holding it up.

On a busy system, this can become a chain: one session blocks another, which blocks another, which blocks the request your users are waiting on.

Be careful with pg_terminate_backend(). Terminating the blocker may be the right move during an incident, but it can roll back work, break a migration, or surprise an application that thought it owned the transaction. Identify the session first. Terminate second.

7. One number when you only have 30 seconds

If I have almost no time, I still check cache hit ratio:

SELECT
    datname,
    round(
        blks_hit * 100.0 / nullif(blks_hit + blks_read, 0),
        2
    ) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

For many OLTP workloads, you want this to be very high, often 99%+. If it drops sharply, it can mean:

the working set no longer fits well in memory;
a query is repeatedly dragging a large table through disk;
an index or plan change moved reads from cache-friendly to disk-heavy;
a batch job is disturbing the rest of the workload.

This is not a complete diagnosis. Postgres also benefits from the OS page cache, and this metric alone will not tell you which query is guilty. Treat it as a smoke alarm, then go back to sections 2 and 3.

What to do on Monday

Do not wait for an incident to run these. Put them into a queries.sql file and run them against your main database when everything is calm. Save the output. That gives you a baseline.

Then repeat every couple of weeks:

Top dead tuples by table.
Top total execution time by query.
Tables with heavy sequential reads.
Large unused indexes.
Long and idle transactions.
Current blocking chains.
Cache hit ratio.

The point is not to fix everything immediately. The point is to know what normal looks like before production teaches you what abnormal looks like.

The part I automated

Running these checks manually works. The problem is that manual checks are boring, so they usually happen only after something is already slow.

I eventually got tired of opening my old queries.sql file and built the same workflow into ide99, a local-first desktop IDE for Postgres that I am working on. The Health Screen is basically the checks above: bloat signals, slow queries from pg_stat_statements, unused indexes, hanging transactions, blocking chains, cache hit, vacuum status, and a visual EXPLAIN view next to the query editor.

Disclosure: ide99 is my project. It is free, open source, and does not need to sit between your app and your database. You can read the code or grab binaries from GitHub.

But the SQL in this post is the important part. Copy it, change it, put it in your own queries.sql, and run it even if you never install another tool.

What is in your Postgres diagnostic file? I am especially interested in weird production checks that caught something before the dashboard did.

Your Postgres Is Failing Quietly: 7 SQL Checks That Catch It Before Grafana Does

1. Dead rows: the cheapest early warning

2. The queries that actually consume database time

3. Tables that are read fully again and again

4. Indexes that only slow down writes

5. Long transactions that hold the whole system back

6. Who is blocking whom right now

7. One number when you only have 30 seconds

What to do on Monday

The part I automated

Useful links