
beefed.aiDesign an incremental-forever backup system for PostgreSQL that minimizes RPO, reduces storage, and enables reliable point-in-time recovery.
Incremental-forever changes the economics of PostgreSQL backups: one full snapshot up front, then a continuous stream of small, reliable increments tied to WAL makes sub-hour (and often sub-minute) RPOs realistic without multiplying storage and restore time. This is the pattern you operate when you treat the WAL as the source of truth and automate every step from archive to verification.
The symptoms I see in the field are consistent: teams run heavyweight fulls because nightly schedules feel safer, then hit exploding storage bills and long restore windows; others enable WAL archiving but treat the archive as “write-only” and never prove restores, which destroys confidence when an incident arrives. Without continuous WAL capture you cannot reliably perform point-in-time recovery (PITR) — PostgreSQL requires a base backup plus the matching WAL stream for PITR and the server’s archive_command / restore_command plumbing must be correct.
A traditional nightly-full plan makes your RPO equal to the backup cadence (e.g., 24 hours) and multiplies storage by the number of fulls you keep. Incremental-forever flips the trade: one full backup, then store only changed blocks + WAL. That reduces data written per job, shortens windows, and keeps storage growth roughly linear with change-rate rather than with retention-count.
Example (simple math that your CFO understands):
| Strategy | Stored data after 30 days |
|---|---|
| Daily fulls (30 fulls kept) | 30 × 1.0 TB = 30 TB |
| Weekly full + diffs | 4 × 1.0 TB + 26 × ~10 GB = ~5.26 TB |
| Incremental-forever (1 full + increments) | 1.0 TB + 30 × 10 GB = 1.3 TB |
The cost math and operational surface both favor incremental-forever when your daily change rate is small relative to the full size.
A robust incremental-forever architecture for PostgreSQL has three minimal pieces that must be engineered together:
Base backup (the initial full): create one consistent physical base using pg_basebackup or a vendor tool that integrates with PostgreSQL’s backup API. pg_basebackup writes a manifest and coordinates WAL handling for you; tools such as wal-g and pgBackRest provide higher-level integration for pushing the base to object storage.
WAL streaming/archive (continuous change capture): set wal_level = replica (or higher), enable archive_mode = on, and use an archive_command that reliably transfers completed WAL segments to durable storage. For streaming replication use replication slots to avoid premature WAL removal; for archive mode configure archive_timeout to bound the delay between transaction commit and WAL availability. These settings are the backbone of PITR.
Durable object storage and a repository format: store base backups and WAL in a versioned, durable object repository (S3/GCS/Azure or equivalent). Tools like wal-g can backup-push and wal-push directly to S3/GCS; pgBackRest supports multi-repo strategies and has strong retention/expire semantics for WAL and backups.
Concrete config examples (short snippets):
postgresql.conf (core WAL settings)
# essential
wal_level = replica
archive_mode = on
archive_timeout = 60 # seconds — force a switch on low-traffic systems
max_wal_senders = 5
# archive_command examples:
# wal-g
archive_command = 'envdir /etc/wal-g.d/env wal-g wal-push %p'
# pgBackRest
# archive_command = 'pgbackrest --stanza=demo archive-push %p'
Those archive_command forms are standard integration points for wal-g and pgBackRest.
A standard run: take the base backup once (or weekly), then continuously wal-push each WAL segment as PostgreSQL completes it. The archive is your point-in-time data stream.
Retention policy must align with your RPO window, legal retention, and restore window you are willing to accept. Two categories exist: backup-object retention (how many/which base backups to keep) and WAL retention (how long WAL is kept and which WAL segments are necessary to restore to a particular base).
repo*-retention-* options such as repo1-retention-full, repo1-retention-diff and repo1-retention-archive to express retention as counts or days; expirations remove backups and their dependent WAL segments atomically. delete retain semantics to prune backups and relies on WAL meta to expire WAL safely; wal-g also documents features like reverse-delta unpack and redundant-archive skipping to reduce restore I/O. Space optimization levers (what to tune and why):
zstd or lz4 for balanced CPU vs size (pgBackRest supports compress-type and compress-level). --delta option (used on restore or backup) leverages checksums to skip unchanged files; this dramatically reduces I/O during restore/backup in many environments. Example retention matrix (illustrative):
How to compute required WAL retention:
A restore plan must be explicit and automated. There are three restore patterns you will use repeatedly:
PITR (physical) with pgBackRest (example):
# restore to a point in time and auto-generate recovery settings (pgBackRest will write recovery config)
sudo -u postgres pgbackrest --stanza=demo --delta \
--type=time --target="2025-11-01 12:34:56+00" --target-action=promote \
restore
# start postgres (now configured to replay WAL up to that time)
sudo systemctl start postgresql
pgBackRest will create the restore_command and recovery parameters so PostgreSQL can fetch WAL from the configured repo during startup.
PITR with wal-g (pattern):
# fetch base backup
wal-g backup-fetch /var/lib/postgresql/data LATEST
# configure restore_command to fetch WAL segments
echo "restore_command = 'wal-g wal-fetch %f %p'" >> /var/lib/postgresql/data/postgresql.auto.conf
# create recovery.signal (Postgres 12+)
touch /var/lib/postgresql/data/recovery.signal
chown -R postgres:postgres /var/lib/postgresql/data
pg_ctl -D /var/lib/postgresql/data start
wal-g supports wal-fetch for restore_command and backup-fetch for base restore.
Partial restores and the pragmatic pattern:
pg_dump -t schema.table), then import into primary. Tools such as pgBackRest offer --db-include to limit what files are restored, and wal-g has an experimental --restore-only for database-level partial restores, but the safe, proven model is the isolated restore + logical dump.
Verification steps in every restore:
recovery progress; check server logs for missing segment errors and recovery_target_time success.Automation turns theory into safety. These are the automation items I run in production-grade fleets.
Monitoring primitives (minimum set):
ccp_backrest_last_full_backup_time_since_completion_seconds. Alert when > your RPO threshold. wal-show/wal-verify or pgBackRest info showing missing WAL segments).
pgbackrest info --output json (or wal-g metadata) to feed repo capacity dashboards.restore_success metric.Sample Prometheus alert (pgBackRest + pgMonitor metrics):
- alert: FullBackupTooOld
expr: ccp_backrest_last_full_backup_time_since_completion_seconds > 86400 # 24h
labels:
severity: critical
annotations:
summary: "Full backup older than 24h for stanza {{ $labels.stanza }}"
pgMonitor and exporters translate pgBackRest/wal-g repo info into metrics you can alert on.
Automated restore testing (scripting pattern)
backup-fetch / backup-fetch and populate restore_command.touch recovery.signal for PG >=12).Example minimalist test-restore script using wal-g (Bash):
#!/usr/bin/env bash
set -euo pipefail
export WALG_S3_PREFIX="s3://my-bucket/pg"
export AWS_ACCESS_KEY_ID="XXX"
export AWS_SECRET_ACCESS_KEY="YYY"
DATA=/tmp/pg_restore_test
rm -rf "$DATA"
mkdir -p "$DATA"
# fetch latest base backup
wal-g backup-fetch "$DATA" LATEST
# recovery settings: use wal-g to fetch WAL
cat >> "$DATA/postgresql.auto.conf" <<'EOF'
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '2025-12-01 00:00:00+00' # example target
EOF
touch "$DATA/recovery.signal"
chown -R postgres:postgres "$DATA"
# start Postgres and wait for recovery to finish
PGDATA="$DATA" pg_ctl -w -D "$DATA" start
# run verification queries (example)
psql -At -c "SELECT count(*) FROM important_table;" \
|| { echo "verification failed"; exit 2; }
pg_ctl -D "$DATA" stop
echo "restore-test succeeded"
Run this in CI weekly (or after any backup-critical change). wal-g and pgBackRest both support backup-fetch and will produce logs you can assert on.
Important: Automated restores are non-optional. A backup that has never been restored is not a backup — it’s a liability. Schedule restore tests, record success rates, and measure the time-to-usable-data as your RTO metric.
Pre-flight checklist (before enabling archiving on production)
wal_level = replica and archive_mode = on are acceptable for your workload.Quick bootstrap (wal-g pattern)
wal-g and place credentials in something like /etc/wal-g.d/env.archive_command = 'envdir /etc/wal-g.d/env wal-g wal-push %p' and restore_command template for recoveries. # as postgres user
wal-g backup-push $PGDATA
wal-g wal-show
wal-g wal-verify integrity
backup-push (e.g., weekly full) and hourly incremental scheduling if you use tool-specific incrementals. Quick bootstrap (pgBackRest pattern)
pgBackRest, create a stanza and configure repository paths in /etc/pgbackrest/pgbackrest.conf.archive_command = 'pgbackrest --stanza=demo archive-push %p' in postgresql.conf. sudo -u postgres pgbackrest --stanza=demo backup
sudo -u postgres pgbackrest --stanza=demo info
repo1-retention-full, repo1-retention-diff, and archive-async as needed and validate pgbackrest info output. Minimal verification checklist for every backup:
backup command exit code 0 and concise logs.info shows the new backup and WAL start/stop LSN.time since last WAL pushed < your RPO threshold (monitoring metric).Short automation snippets
pgBackRest --type=incr runs).Final operational tips that matter:
--permanent in wal-g, or repo*-retention with a high number in pgBackRest).Sources:
PostgreSQL: Continuous Archiving and Point-in-Time Recovery (PITR) - Official PostgreSQL documentation describing WAL archiving, archive_command, restore_command, base backup requirements and recovery target settings used for PITR.
WAL-G for PostgreSQL (Read the Docs) - wal-g usage for backup-push, backup-fetch, wal-push/wal-fetch, features like reverse-delta unpack and partial restore options.
pgBackRest User Guide - pgBackRest concepts: full/diff/incr backups, --delta restore option, retention flags (repo1-retention-*), and archive-push/archive-get integration.
Azure Backup glossary (RPO/RTO definitions) - clear definitions of RPO and RTO and how they drive backup design.
pgMonitor exporter (Crunchy Data) — Backup Metrics - recommended Prometheus metrics for tracking pgBackRest backups and repository health.
pgbackrest_exporter (GitHub) - Prometheus exporter that scrapes pgbackrest info and exposes backup metrics for alerting and dashboards.
Managing the lifecycle of objects — Amazon S3 User Guide - S3 lifecycle rules and considerations (transition to Glacier/Deep Archive, minimum-storage-duration caveats).