Skip to main content
RAID is not a backup. A mirror survives a dead disk and nothing else — not a fat-fingered rm, not a bad upgrade, not a fire. Real protection is layered.
The homelab follows the 3-2-1 rule — three copies of the data, on two kinds of media, with one copy off-site — built up as four layers that each survive a bigger failure than the last. ZFS makes this cheap: snapshots are near-free until data changes, and zfs send ships only the blocks that moved.

Four layers of defense

Each layer is strictly stronger than the one before it. The first one barely counts as a backup at all — it is in the table to make the point.
LayerWhat it isSurvivesDoes not survive
0In-pool redundancy (mirror / raidz)A failed diskDeletion, corruption, node loss
1Local snapshotsA bad change, an accidental deletePool or node death
2Cross-node replicationLoss of a whole nodeSite loss (fire, theft, flood)
3Off-site / offline copySite loss, ransomware— the last line

What replicates, and what doesn’t

Not everything earns a second copy. Replication costs bandwidth, disk, and snapshot retention on the far side, so it is reserved for data that is irreplaceable or is the system of record. Everything else gets local snapshots only — enough to undo a mistake, but never shipped across the wire.
TierWhat it coversPolicy
Replicate aggressivelyConfigs, databases, secrets stores, irreplaceable media, system-of-record telemetrySnapshot and replicate to a second node, then reach the offline copy
Snapshot-onlyScratch, transient downloads, re-downloadable model weights, queue buffersLocal snapshots only; flagged so replication skips them
The rule that decides which bucket a dataset lands in is simple:

How replication flows

Two always-on nodes replicate to each other on a nightly incremental schedule — only changed blocks move, so even large datasets sync in seconds once seeded. A third node stays powered down most of the time. When it wakes, it pulls the latest snapshots from both always-on nodes, then shuts back off. That offline window is a feature, not a gap. A node that is powered off is air-gapped — ransomware and a bad zfs destroy can’t reach it. And because the cold node pulls rather than being pushed to, a compromised primary has no standing credentials to corrupt the archive. The powered-down copy is the “1” in 3-2-1.

The toolchain

Each concern maps to one well-worn open-source tool. None of it is bespoke.
ConcernToolRole
Snapshot scheduling & retentionsanoidTakes time-based snapshots and prunes them on an hourly / daily / monthly ladder
Incremental replicationsyncoidWraps zfs send | zfs receive to ship only changed blocks between nodes
App-consistent VM/LXC backupProxmox Backup ServerDeduplicated, verifiable guest backups — complements raw ZFS send for things mid-write
Capacity alertingntfyPushes a notification when a pool crosses 50% / 75% / 90%
Snapshots and replication protect the filesystem; Proxmox Backup Server protects the guests (a database mid-transaction needs an application-consistent backup, not just a block snapshot). The two are complementary, not redundant.

What this connects to

Homelab

The hardware the pools run on.

ansible-proxmox

Where sanoid, syncoid, and the ZFS roles are defined.

terraform-proxmox

Declares the nodes, pools, and the backup-server guest.

Infrastructure overview

How the Proxmox stack fits the rest of the homelab.