Project:Infrastructure/Backups v3

Planning new backups
Infra used to have good backups, but all past backups got lost with multiple sponsor losses (we had two copies of the data, one in Europe, one in USA, both got lost for different reasons, in a short time frame).

We should get backups going again, and cover EVERYTHING in future, and keep a redundant copy of the backups in the cloud, since we lost multiple hosts before :-(.

Backups will probably be sent to a backup server "nearby" (in the same continent), and archived from there to multiple cloud locations. The loss of a backup server will require at least restoring metadata from the other server or cloud, but hopefully not full data from the cloud.

https://3ofcoins.net/2013/11/14/backups-suck-a-rant/

Strict

 * MUST: be open-source (including server-side if any)
 * MUST: provide a CLI
 * MUST: True incremental
 * with garbage collection
 * MUST: Encryption
 * SHOULD: encryption should be optional, as we have some public data
 * MUST: Safe unattended backup
 * MUST NOT: disclose the passphrase or key data to generate new backups
 * MUST: Compression
 * SHOULD: be optional, some repos are already over pre-compressed data (eg distfiles, releases)
 * MUST: provide validation of backups
 * MUST: Support External/off-host storage
 * MUST: Support cloud storage (AWS-S3, Ceph-S3, other).
 * This can be in the form of simply syncing the off-host storage to the cloud
 * AWS S3-IA/Glacier is USD0.0125/GB/mo or less. Glacier deep storage is USD0.001/GB/mo
 * rsync.net has a Borg/Attic option at $0.03USD/GB/month (http://rsync.net/products/attic.html)
 * Maybe free S3-based storage from a sponsor (Dreamhost, DigitalOcean...)
 * MUST NOT: require a complete local copy of the data to be retained

Nice-to-have

 * SHOULD: support multiple threads
 * Some data sets for backup are very large, but change slowly, and we have lots of IO behind them
 * Having multiple readers (of files to backup)/writers (of backup data) would massively improve performance.
 * SHOULD: support permanent/forever retention
 * with purging on a per-file basis.
 * SHOULD: Scale to >2TB single repos (releases+distfiles historical)
 * Attic had a corruption issue at scale: http://librelist.com/browser/attic/2015/3/31/comparison-of-attic-vs-bup-vs-obnam/#cbbe599389a20c787a74b137dc78fb1a
 * SHOULD: Provide de-duplication
 * PREFERENCE:
 * sliding-window
 * fixed-block
 * file-based
 * WANT: de-dupe data from different hosts.
 * SHOULD: bundle small files into blobs
 * SHOULD: provide a catalog of files

Known backup software
[Arch Linux Sync & Backup programs] contains a good feature comparison list.

Active contenders

 * rdup + rdedup

Contenders tried

 * obnam
 * Run by robbat2 2016/08 - 2016/12
 * very slow, not a good fit to problem
 * Single-threaded, locking issues on repo access
 * 2017/08: Upstream announced development has stopped, please migrate away
 * restic
 * 2017/01 briefly
 * Very promising EXCEPT for the symmetric encryption issue; either the backup password lives on the host (and can be taken by an attacker), or you can't do backups at all.

Incremental by-chunk

 * Arq
 * included for comparision, good design
 * Ruled out: closed source
 * Attic/Borg
 * Borg fork is actively maintained
 * Ruled out: symmetric encryption only
 * See also borgmatic, Atticmatic
 * btar
 * http://vicerveza.homeunix.net/~viric/cgi-bin/btar/doc/trunk/doc/home.wiki
 * Ruled out: Not really a backup tool, but rather an advanced tar-like
 * bup
 * Ruled out: No encryption
 * ddar
 * https://github.com/basak/ddar
 * Ruled out: No Encryption (dedupe tool ONLY)
 * Duplicacy
 * Ruled out: Upstream Go build hard to package.
 * Ruled out: Not compliant with Social Contract, no longer freely licensed. https://duplicacy.com/buy.html Upstream was approached about this, and were not willing to provide a license that was DFSG-free.
 * Duplicati
 * Ruled out: Problematic to build on Linux (very up to date Mono environment needed)
 * obnam
 * Ruled out: slow
 * Thread raised on upstream list about alternate encryption, because core algorithm is fast, but still single-threaded
 * 2017/08: Upstream announced development has stopped, please migrate away
 * tarsnap
 * Included for comparison
 * Ruled out: closed source
 * ZBackup
 * http://zbackup.org/
 * See also https://github.com/davidbartonau/zbackup-tar
 * Ruled out: Symmetric Encryption only (needs password to run the backup)
 * Rdedup
 * https://github.com/dpc/rdedup
 * Restic
 * https://restic.github.io/
 * Ruled out: Symmetric Encryption only (needs password to run the backup)

Incremental by-file

 * duplicity
 * Ruled out: Very heavy on temp space usage (duplicity)
 * rdiff-backup
 * Ruled out: No Encryption
 * rsnapshot
 * Ruled out: No Encryption
 * Snebu
 * Ruled out: No encryption

Snapshot
Most of these are very old-school backup programs.


 * Amanda
 * Ruled out: No true forever incremental
 * BackupPC
 * Ruled out: No encryption
 * Bacula/Bareos
 * What is the state of the business argument between Bacula & the Bareos fork? AGPL issues?
 * DAR
 * TODO
 * Dirvish
 * Ruled out: rsync/rdiff wrapper
 * Previously used by infra, painful to scale.
 * UrBackup
 * Ruled out: No encryption

Other

 * Unison
 * Ruled out: Sync tool, not backup
 * git-annex
 * Ruled out: not backup per se.
 * SyncThing
 * Ruled out: Sync tool, not backup

Wrappers

 * Backupninja
 * Wraps duplicity & rdiff-backup
 * Ruled out: wrapper only
 * Ruled out: no encryption in underlying tools (rdiff-backup)
 * Ruled out: Very heavy on temp space usage (duplicity)
 * Burp
 * librsync in v2, real dedup in v2?
 * Ruled out: wrapper only??
 * Deja-dup
 * Wraps duplicity
 * Includes cloud targets for storing
 * Ruled out: wrapper only
 * Ruled out: Very heavy on temp space usage (duplicity)
 * deltaic
 * NOTE: May be useful seperately to capture the GitHub Organization
 * Ruled out: wrapper only
 * Ruled out: no incremental
 * SafeKeep
 * Wrapper for rdiff-backup
 * Ruled out: wrapper only
 * Ruled out: no encryption in underlying tools (rdiff-backup)
 * backup
 * https://github.com/marfarma/backup
 * Wrapper of many tools
 * Ruled out: wrapper only
 * Ruled out: minimal incremental support (simple rsync only)
 * Backup v4.x
 * https://github.com/backup/backup
 * Newer version of marfarma backup.
 * Ruled out: wrapper only
 * Ruled out: minimal incremental support (simple rsync only)

Misc ideas

 * Combine:
 * obnam (with encryption)
 * git-annex the obnam repo to S3 & alternate hosts.
 * Put extra catalogs in git-annex


 * Combine:
 * borg-backup
 * rclone to send data over to S3 & alternate hosts


 * Combine:
 * zbackup-tar
 * Run both a per-filesystem tar
 * AND a tar for each actual file in the system (ignore symlinks, dirs)
 * Should de-dupe superbly
 * Should allow intelligent retention.

Review notes

 * Attic & Borg need to be ruled out because they require clients to take an exclusive lock on the repo for the entire backup. We could work around this by having a repo for each thing, but then we don't get de-duplication between hosts.
 * obnam works so far, but is slow, even with the recommended performance tuning. Seems to be limited by it's bad gpg usage.