Project:Infrastructure/Backups v3

= Planning new backups = Infra used to have good backups, but all past backups got lost with multiple sponsor losses (we had two copies of the data, one in europe, one in USA, both got lost for different reasons, in a short time frame).

We should get backups going again, and cover EVERYTHING in future, and keep a redundant copy of the backups in the cloud, since we lost multiple hosts before :-(.

Backups will probably be sent to a backup server "nearby" (in the same continent), and archived from there to multiple cloud locations. The loss of a backup server will require at least restoring metadata from the other server or cloud, but hopefully not full data from the cloud.

https://3ofcoins.net/2013/11/14/backups-suck-a-rant/

Requirements

 * MUST: be open-source (including server-side if any)
 * MUST: provide a CLI
 * MUST: True incremental
 * with garbage collection
 * MUST: Encryption
 * SHOULD: encryption should be optional, as we have some public data
 * MUST: Safe unattended backup
 * MUST NOT: disclose the passphrase or key data to generate new backups
 * MUST: Compression
 * SHOULD: be optional, some repos are already over pre-compressed data (eg distfiles, releases)
 * MUST: provide validation of backups
 * MUST: Support External/off-host storage
 * MUST NOT: require a complete local copy of the data to be retained


 * SHOULD: support multiple threads
 * Some data sets for backup are very large, but change slowly, and we have lots of IO behind them
 * Having multiple readers (of files to backup)/writers (of backup data) would massively improve performance.
 * SHOULD: support permanent/forever retention
 * with purging on a per-file basis.
 * SHOULD: Support cloud storage (AWS-S3, Ceph-S3, other).
 * This can be in the form of simply syncing the off-host storage to the cloud
 * AWS S3-IA/Glacier are $0.0125/USD/GB/month or less.
 * rsync.net has a Borg/Attic option at $0.03USD/GB/month (http://rsync.net/products/attic.html)
 * Maybe Rackspace account...
 * Maybe free Ceph-S3 from Dreamhost...
 * SHOULD: Scale to >2TB single repos (releases+distfiles historical)
 * Attic had a corruption issue at scale: http://librelist.com/browser/attic/2015/3/31/comparison-of-attic-vs-bup-vs-obnam/#cbbe599389a20c787a74b137dc78fb1a
 * SHOULD: Provide de-duplication
 * PREFERENCE:
 * sliding-window
 * fixed-block
 * file-based
 * WANT: de-dupe data from different hosts.
 * SHOULD: bundle small files into blobs
 * SHOULD: provide a catalog of files

Known Backup software
[Arch Linux Sync & Backup programs] contains a good feature comparison list.

Incremental by-chunk

 * Arq (closed source, included for comparision)
 * Attic
 * Borg fork is actively maintained
 * See also borgmatic, Atticmatic
 * bup
 * ddar
 * Duplicati
 * obnam
 * tarsnap (closed source, included for comparison)
 * ZBackup
 * See also https://github.com/davidbartonau/zbackup-tar
 * Rdedup
 * https://github.com/dpc/rdedup
 * Backup v4.x
 * https://github.com/backup/backup
 * backup
 * https://github.com/marfarma/backup

Incremental by-file

 * duplicity
 * rdiff-backup
 * rsnapshot

Snapshot

 * Amanda
 * BackupPC
 * Bacula
 * See fork Bareos
 * Dirvish
 * UrBackup

Other

 * Unison
 * git-annex
 * SyncThing

TODO

 * btar
 * DAR
 * Restic
 * Snebu

Wrappers

 * Backupninja
 * Wraps duplicity & rdiff-backup
 * Burp
 * librsync
 * Deja-dup
 * Wraps duplicity
 * Includes cloud targets for storing
 * deltaic
 * SafeKeep
 * Wrapper for rdiff-backup

Misc Ideas

 * Combine:
 * obnam (with encryption)
 * git-annex the obnam repo to S3 & alternate hosts.
 * Put extra catalogs in git-annex


 * Combine:
 * borg-backup
 * rclone to send data over to S3 & alternate hosts


 * Combine:
 * zbackup-tar
 * Run both a per-filesystem tar
 * AND a tar for each actual file in the system (ignore symlinks, dirs)
 * Should de-dupe superbly
 * Should allow intelligent retention.

Review notes

 * Attic & Borg need to be ruled out because they require clients to take an exclusive lock on the repo for the entire backup. We could work around this by having a repo for each thing, but then we don't get de-duplication between hosts.
 * obnam works so far, but is slow, even with the recommended performance tuning. Seems to be limited by it's bad gpg usage.