Incremental-Forever Backup Architecture for PostgreSQL

# database

beefed.ai

Design an incremental-forever backup system for PostgreSQL that minimizes RPO, reduces storage, and enables reliable point-in-time recovery.

Why incremental-forever outguns nightly fulls for RPO/RTO
Essential components: base backups, WAL streaming, and durable storage
Retention, pruning and storage optimizations that actually save dollars
Restore playbook: fast PITR and practical partial restores
Automation, monitoring, and automated restore testing
Practical application: checklists and scripts you can run today

Incremental-forever changes the economics of PostgreSQL backups: one full snapshot up front, then a continuous stream of small, reliable increments tied to WAL makes sub-hour (and often sub-minute) RPOs realistic without multiplying storage and restore time. This is the pattern you operate when you treat the WAL as the source of truth and automate every step from archive to verification.

The symptoms I see in the field are consistent: teams run heavyweight fulls because nightly schedules feel safer, then hit exploding storage bills and long restore windows; others enable WAL archiving but treat the archive as “write-only” and never prove restores, which destroys confidence when an incident arrives. Without continuous WAL capture you cannot reliably perform point-in-time recovery (PITR) — PostgreSQL requires a base backup plus the matching WAL stream for PITR and the server’s archive_command / restore_command plumbing must be correct.

Why incremental-forever outguns nightly fulls for RPO/RTO

A traditional nightly-full plan makes your RPO equal to the backup cadence (e.g., 24 hours) and multiplies storage by the number of fulls you keep. Incremental-forever flips the trade: one full backup, then store only changed blocks + WAL. That reduces data written per job, shortens windows, and keeps storage growth roughly linear with change-rate rather than with retention-count.

The fundamental enabler for sub-hour RPOs is continuous WAL capture (archive or streaming), because WAL carries the minimal, ordered set of changes needed to roll a base backup forward to an exact timestamp.
RPO and RTO are distinct design constraints: RPO dictates how often you must snapshot or ship WAL; RTO dictates how quickly you must fetch base + WAL and validate the restore. Use RPO to size your WAL persistence, use RTO to size your fetch/restore pipeline and test cadence.

Example (simple math that your CFO understands):

Base backup: 1.0 TB
Average daily changed data (block-level): 10 GB/day
Retention: 30 days

Strategy	Stored data after 30 days
Daily fulls (30 fulls kept)	30 × 1.0 TB = 30 TB
Weekly full + diffs	4 × 1.0 TB + 26 × ~10 GB = ~5.26 TB
Incremental-forever (1 full + increments)	1.0 TB + 30 × 10 GB = 1.3 TB

The cost math and operational surface both favor incremental-forever when your daily change rate is small relative to the full size.

Essential components: base backups, WAL streaming, and durable storage

A robust incremental-forever architecture for PostgreSQL has three minimal pieces that must be engineered together:

Base backup (the initial full): create one consistent physical base using pg_basebackup or a vendor tool that integrates with PostgreSQL’s backup API. pg_basebackup writes a manifest and coordinates WAL handling for you; tools such as wal-g and pgBackRest provide higher-level integration for pushing the base to object storage.
WAL streaming/archive (continuous change capture): set wal_level = replica (or higher), enable archive_mode = on, and use an archive_command that reliably transfers completed WAL segments to durable storage. For streaming replication use replication slots to avoid premature WAL removal; for archive mode configure archive_timeout to bound the delay between transaction commit and WAL availability. These settings are the backbone of PITR.
Durable object storage and a repository format: store base backups and WAL in a versioned, durable object repository (S3/GCS/Azure or equivalent). Tools like wal-g can backup-push and wal-push directly to S3/GCS; pgBackRest supports multi-repo strategies and has strong retention/expire semantics for WAL and backups.

Concrete config examples (short snippets):

postgresql.conf (core WAL settings)

# essential
wal_level = replica
archive_mode = on
archive_timeout = 60          # seconds — force a switch on low-traffic systems
max_wal_senders = 5
# archive_command examples:
# wal-g
archive_command = 'envdir /etc/wal-g.d/env wal-g wal-push %p'
# pgBackRest
# archive_command = 'pgbackrest --stanza=demo archive-push %p'

Those archive_command forms are standard integration points for wal-g and pgBackRest.

A standard run: take the base backup once (or weekly), then continuously wal-push each WAL segment as PostgreSQL completes it. The archive is your point-in-time data stream.

Retention, pruning and storage optimizations that actually save dollars

Retention policy must align with your RPO window, legal retention, and restore window you are willing to accept. Two categories exist: backup-object retention (how many/which base backups to keep) and WAL retention (how long WAL is kept and which WAL segments are necessary to restore to a particular base).

pgBackRest exposes repo*-retention-* options such as repo1-retention-full, repo1-retention-diff and repo1-retention-archive to express retention as counts or days; expirations remove backups and their dependent WAL segments atomically.
wal-g provides delete retain semantics to prune backups and relies on WAL meta to expire WAL safely; wal-g also documents features like reverse-delta unpack and redundant-archive skipping to reduce restore I/O.

Space optimization levers (what to tune and why):

Compression: use zstd or lz4 for balanced CPU vs size (pgBackRest supports compress-type and compress-level).
Block-level incremental or checksum delta: pgBackRest’s --delta option (used on restore or backup) leverages checksums to skip unchanged files; this dramatically reduces I/O during restore/backup in many environments.
Reverse-delta and tar composition: wal-g supports reverse delta unpack and rating composer modes to place frequently changing files into separate tarballs to speed targeted restores.
Object storage lifecycle: once a backup/WAL region ages past frequent restore windows, transition it to cheaper archival tiers (Glacier, Deep Archive) using S3 lifecycle rules. Account for minimum storage durations and transition request costs.

Example retention matrix (illustrative):

Keep hourly increments for 48 hours (fast recovery during immediate incidents).
Keep daily point-in-time for 14 days.
Keep weekly full synthetic/retained images for 12 weeks.
Archive monthly fulls to cold storage for 7 years (regulatory needs).

How to compute required WAL retention:

Keep WAL until the latest point you might need to recover to (earliest base backup you will keep) plus a safety margin for delays. In practice, expire WAL only when pgBackRest/wal-g confirms that a retained full (or synthetic full) no longer needs the earlier WAL.

Restore playbook: fast PITR and practical partial restores

A restore plan must be explicit and automated. There are three restore patterns you will use repeatedly:

Full cluster restore to a timestamp (PITR).
Restore-to-standby for reporting or verification (standby recovery).
Partial (table/DB) restores achieved by restoring a cluster to an isolated host and extracting logical data.

PITR (physical) with pgBackRest (example):

# restore to a point in time and auto-generate recovery settings (pgBackRest will write recovery config)
sudo -u postgres pgbackrest --stanza=demo --delta \
  --type=time --target="2025-11-01 12:34:56+00" --target-action=promote \
  restore
# start postgres (now configured to replay WAL up to that time)
sudo systemctl start postgresql

pgBackRest will create the restore_command and recovery parameters so PostgreSQL can fetch WAL from the configured repo during startup.

PITR with wal-g (pattern):

# fetch base backup
wal-g backup-fetch /var/lib/postgresql/data LATEST
# configure restore_command to fetch WAL segments
echo "restore_command = 'wal-g wal-fetch %f %p'" >> /var/lib/postgresql/data/postgresql.auto.conf
# create recovery.signal (Postgres 12+)
touch /var/lib/postgresql/data/recovery.signal
chown -R postgres:postgres /var/lib/postgresql/data
pg_ctl -D /var/lib/postgresql/data start

wal-g supports wal-fetch for restore_command and backup-fetch for base restore.

Partial restores and the pragmatic pattern:

A physical backup cannot “inject” a single table into a running primary. The practical flow: restore the physical backup to an isolated host (or an ephemeral container), start it in recovery mode up to the desired PITR, run logical export (e.g., pg_dump -t schema.table), then import into primary. Tools such as pgBackRest offer --db-include to limit what files are restored, and wal-g has an experimental --restore-only for database-level partial restores, but the safe, proven model is the isolated restore + logical dump.

Verification steps in every restore:

Confirm the backup set’s WAL coverage up to the target LSN/time before restore.
Start PostgreSQL and watch for recovery progress; check server logs for missing segment errors and recovery_target_time success.
Run application-level smoke queries and checksums to validate business data integrity.

Automation, monitoring, and automated restore testing

Automation turns theory into safety. These are the automation items I run in production-grade fleets.

Monitoring primitives (minimum set):

Time since last successful backup (full/diff/incr) per stanza. Metric example from pgMonitor: ccp_backrest_last_full_backup_time_since_completion_seconds. Alert when > your RPO threshold.
WAL archive health: detect gaps in WAL archive (wal-g wal-show/wal-verify or pgBackRest info showing missing WAL segments).
Repository size and growth rate: use pgbackrest info --output json (or wal-g metadata) to feed repo capacity dashboards.
Restore test success rate: a synthetic pipeline should run a restore into an ephemeral host and report restore_success metric.

Sample Prometheus alert (pgBackRest + pgMonitor metrics):

- alert: FullBackupTooOld
  expr: ccp_backrest_last_full_backup_time_since_completion_seconds > 86400  # 24h
  labels:
    severity: critical
  annotations:
    summary: "Full backup older than 24h for stanza {{ $labels.stanza }}"

pgMonitor and exporters translate pgBackRest/wal-g repo info into metrics you can alert on.

Automated restore testing (scripting pattern)

Provision ephemeral test host (VM / container) with the same Postgres minor version.
backup-fetch / backup-fetch and populate restore_command.
Start Postgres in recovery mode (touch recovery.signal for PG >=12).
Wait for recovery completion; run a set of deterministic verification queries (row counts, known checksums).
Publish result to CI and to your monitoring system.

Example minimalist test-restore script using wal-g (Bash):

#!/usr/bin/env bash
set -euo pipefail
export WALG_S3_PREFIX="s3://my-bucket/pg"
export AWS_ACCESS_KEY_ID="XXX"
export AWS_SECRET_ACCESS_KEY="YYY"

DATA=/tmp/pg_restore_test
rm -rf "$DATA"
mkdir -p "$DATA"

# fetch latest base backup
wal-g backup-fetch "$DATA" LATEST

# recovery settings: use wal-g to fetch WAL
cat >> "$DATA/postgresql.auto.conf" <<'EOF'
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '2025-12-01 00:00:00+00'  # example target
EOF
touch "$DATA/recovery.signal"
chown -R postgres:postgres "$DATA"

# start Postgres and wait for recovery to finish
PGDATA="$DATA" pg_ctl -w -D "$DATA" start
# run verification queries (example)
psql -At -c "SELECT count(*) FROM important_table;" \
  || { echo "verification failed"; exit 2; }
pg_ctl -D "$DATA" stop
echo "restore-test succeeded"

Run this in CI weekly (or after any backup-critical change). wal-g and pgBackRest both support backup-fetch and will produce logs you can assert on.

Important: Automated restores are non-optional. A backup that has never been restored is not a backup — it’s a liability. Schedule restore tests, record success rates, and measure the time-to-usable-data as your RTO metric.

Practical application: checklists and scripts you can run today

Pre-flight checklist (before enabling archiving on production)

Have reliable object storage credentials and service limits validated.
Ensure wal_level = replica and archive_mode = on are acceptable for your workload.
Confirm you have monitoring (Prometheus + dashboard) and alerting for WAL gap and backup age.

Quick bootstrap (wal-g pattern)

Install wal-g and place credentials in something like /etc/wal-g.d/env.
Set archive_command = 'envdir /etc/wal-g.d/env wal-g wal-push %p' and restore_command template for recoveries.
Run initial base backup:

# as postgres user
wal-g backup-push $PGDATA

Verify WAL archive health:

wal-g wal-show
wal-g wal-verify integrity

Add periodic backup-push (e.g., weekly full) and hourly incremental scheduling if you use tool-specific incrementals.

Quick bootstrap (pgBackRest pattern)

Install pgBackRest, create a stanza and configure repository paths in /etc/pgbackrest/pgbackrest.conf.
Configure archive_command = 'pgbackrest --stanza=demo archive-push %p' in postgresql.conf.
Run:

sudo -u postgres pgbackrest --stanza=demo backup
sudo -u postgres pgbackrest --stanza=demo info

Configure repo1-retention-full, repo1-retention-diff, and archive-async as needed and validate pgbackrest info output.

Minimal verification checklist for every backup:

backup command exit code 0 and concise logs.
Repository info shows the new backup and WAL start/stop LSN.
time since last WAL pushed < your RPO threshold (monitoring metric).
Periodic restore test completed within RTO budget and smoke queries pass.

Short automation snippets

Cron job (example): hourly incremental + weekly base (or automated pgBackRest --type=incr runs).
Systemd timer for restore-test container, run weekly, post metric to Prometheus pushgateway.

Final operational tips that matter:

Rotate and test credentials for object storage.
Track the last available WAL LSN and alert if you can’t reach the needed WAL for your oldest retained base.
Preserve at least one permanent full backup for disaster scenarios (--permanent in wal-g, or repo*-retention with a high number in pgBackRest).

Sources:
PostgreSQL: Continuous Archiving and Point-in-Time Recovery (PITR) - Official PostgreSQL documentation describing WAL archiving, archive_command, restore_command, base backup requirements and recovery target settings used for PITR.

WAL-G for PostgreSQL (Read the Docs) - wal-g usage for backup-push, backup-fetch, wal-push/wal-fetch, features like reverse-delta unpack and partial restore options.

pgBackRest User Guide - pgBackRest concepts: full/diff/incr backups, --delta restore option, retention flags (repo1-retention-*), and archive-push/archive-get integration.

Azure Backup glossary (RPO/RTO definitions) - clear definitions of RPO and RTO and how they drive backup design.

pgMonitor exporter (Crunchy Data) — Backup Metrics - recommended Prometheus metrics for tracking pgBackRest backups and repository health.

pgbackrest_exporter (GitHub) - Prometheus exporter that scrapes pgbackrest info and exposes backup metrics for alerting and dashboards.

Managing the lifecycle of objects — Amazon S3 User Guide - S3 lifecycle rules and considerations (transition to Glacier/Deep Archive, minimum-storage-duration caveats).