
VictorProduction Postgres rarely fails with a clean red alarm. More often, one endpoint starts taking eight...
Production Postgres rarely fails with a clean red alarm. More often, one endpoint starts taking eight seconds instead of 200 ms while Grafana is still green. CPU looks fine. Memory is fine. Disk is not full. The database is alive.
It is also slowly getting worse.
A table has accumulated millions of dead rows. An index nobody uses is still maintained on every write. A forgotten transaction is holding back vacuum. A query that looked harmless at 4 ms is called two million times a day. None of this is a crash. It is just quiet decay.
For years I kept a queries.sql file for exactly this situation. When production felt wrong, I pasted the same checks into psql and looked for the usual suspects. Below are the seven checks I still reach for first. They do not require an agent, a paid monitoring product, or a new service in the middle of your database path.
Most queries work on PostgreSQL 13+. The one extension used here is pg_stat_statements, and I call that out where it appears.
Postgres does not immediately remove old row versions after UPDATE or DELETE. Because of MVCC, dead tuples stay around until vacuum can clean them up. If vacuum cannot keep up, your table gets heavier and every scan has more garbage to step over.
Start with pg_stat_user_tables. It is an estimate, not a forensic report, but it is fast and usually good enough to tell you where to look.
SELECT
schemaname,
relname AS table_name,
n_live_tup,
n_dead_tup,
round(
n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0),
1
) AS dead_ratio,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;
What I look for:
dead_ratio above 20%.last_autovacuum is empty or very old.Do not jump straight to VACUUM FULL. It rewrites the table and takes an ACCESS EXCLUSIVE lock, which usually means downtime for that table. For normal cleanup, tune autovacuum. For already accumulated bloat, look at tools like pg_repack and plan the operation properly.
The boring fix that often matters most is per-table autovacuum tuning:
ALTER TABLE public.orders SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_analyze_scale_factor = 0.02
);
Defaults are fine for many tables. They are not fine for every large write-heavy table.
Slow query logs are useful, but they miss the other common failure mode: a query that is individually fast but called constantly.
That is where pg_stat_statements earns its keep. If it is not already enabled, you need it in shared_preload_libraries, then a restart, then:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Then ask which normalized queries consumed the most total execution time:
SELECT
calls,
round(total_exec_time::numeric, 1) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(
(100 * total_exec_time / nullif(sum(total_exec_time) OVER (), 0))::numeric,
1
) AS pct,
left(query, 180) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 15;
The important sort key is total_exec_time, not mean_exec_time.
A query that averages 4 ms can be your biggest database cost if it runs millions of times. A 3-second report that runs twice a day is noisy, but it may not be where your server is spending its life.
Postgres 12 and older use total_time and mean_time. Postgres 13+ split planning and execution time, which is why the columns above are named total_exec_time and mean_exec_time.
Once you find the real offenders, the next step is usually:
EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;
Use EXPLAIN on the actual query shape, with realistic parameters, not on a simplified toy version.
Sequential scans are not automatically bad. On a small table, a seq scan is often the right plan. On a large table, frequent full-table reads can become the invisible tax behind slow endpoints.
This query gives you candidates:
SELECT
schemaname,
relname AS table_name,
seq_scan,
idx_scan,
seq_tup_read,
CASE
WHEN seq_scan > 0 THEN seq_tup_read / seq_scan
ELSE 0
END AS avg_rows_per_seq_scan,
pg_size_pretty(pg_relation_size(relid)) AS table_size
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;
What I look for:
seq_tup_read on large tables.avg_rows_per_seq_scan.This query does not tell you which index to create. It only tells you where to investigate. Cross-check it with the top queries from pg_stat_statements, then run EXPLAIN (ANALYZE, BUFFERS).
Do not index a table just because it appears here. If the table has 1,000 rows, the planner may be doing exactly the right thing.
An unused index is not free. It consumes disk, takes cache space, and must be updated on every relevant INSERT, UPDATE, and DELETE.
This is the first pass I use to find obvious candidates:
SELECT
s.schemaname,
s.relname AS table_name,
s.indexrelname AS index_name,
s.idx_scan,
pg_size_pretty(pg_relation_size(s.indexrelid)) AS index_size
FROM pg_stat_user_indexes s
JOIN pg_index i ON i.indexrelid = s.indexrelid
WHERE s.idx_scan = 0
AND NOT i.indisunique
AND NOT i.indisprimary
ORDER BY pg_relation_size(s.indexrelid) DESC
LIMIT 20;
The exclusions matter. Primary keys and unique indexes enforce correctness. You do not drop them just because SELECT queries do not use them.
Also, idx_scan = 0 is a candidate, not a verdict. Check these before dropping anything:
When you are confident, prefer:
DROP INDEX CONCURRENTLY index_name;
It avoids taking the same kind of blocking lock as a plain DROP INDEX.
This is one of the most underrated checks. An open transaction can hold locks and keep the xmin horizon old, which prevents vacuum from removing dead row versions. One forgotten BEGIN can make bloat grow across the database.
SELECT
pid,
usename,
application_name,
client_addr,
state,
now() - xact_start AS xact_age,
now() - query_start AS query_age,
wait_event_type,
left(query, 120) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
AND state <> 'idle'
ORDER BY xact_start
LIMIT 20;
What I look for:
idle in transaction sessions older than a few minutes.The server-side seat belt is:
ALTER SYSTEM SET idle_in_transaction_session_timeout = '2min';
SELECT pg_reload_conf();
Do not copy that value blindly. Some maintenance jobs legitimately need long transactions. But leaving the timeout unset forever is how a quiet bug becomes a Friday incident.
When production is stuck, you do not want to read raw lock tables by hand. Start with pg_blocking_pids().
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
left(blocked.query, 100) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
left(blocking.query, 100) AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
The left side is waiting. The right side is holding it up.
On a busy system, this can become a chain: one session blocks another, which blocks another, which blocks the request your users are waiting on.
Be careful with pg_terminate_backend(). Terminating the blocker may be the right move during an incident, but it can roll back work, break a migration, or surprise an application that thought it owned the transaction. Identify the session first. Terminate second.
If I have almost no time, I still check cache hit ratio:
SELECT
datname,
round(
blks_hit * 100.0 / nullif(blks_hit + blks_read, 0),
2
) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
For many OLTP workloads, you want this to be very high, often 99%+. If it drops sharply, it can mean:
This is not a complete diagnosis. Postgres also benefits from the OS page cache, and this metric alone will not tell you which query is guilty. Treat it as a smoke alarm, then go back to sections 2 and 3.
Do not wait for an incident to run these. Put them into a queries.sql file and run them against your main database when everything is calm. Save the output. That gives you a baseline.
Then repeat every couple of weeks:
The point is not to fix everything immediately. The point is to know what normal looks like before production teaches you what abnormal looks like.
Running these checks manually works. The problem is that manual checks are boring, so they usually happen only after something is already slow.
I eventually got tired of opening my old queries.sql file and built the same workflow into ide99, a local-first desktop IDE for Postgres that I am working on. The Health Screen is basically the checks above: bloat signals, slow queries from pg_stat_statements, unused indexes, hanging transactions, blocking chains, cache hit, vacuum status, and a visual EXPLAIN view next to the query editor.
Disclosure: ide99 is my project. It is free, open source, and does not need to sit between your app and your database. You can read the code or grab binaries from GitHub.
But the SQL in this post is the important part. Copy it, change it, put it in your own queries.sql, and run it even if you never install another tool.
What is in your Postgres diagnostic file? I am especially interested in weird production checks that caught something before the dashboard did.