Project:Infrastructure/Backups v3

From Gentoo Wiki
Jump to: navigation, search

Planning new backups

Infra used to have good backups, but all past backups got lost with multiple sponsor losses (we had two copies of the data, one in europe, one in USA, both got lost for different reasons, in a short time frame).

We should get backups going again, and cover EVERYTHING in future, and keep a redundant copy of the backups in the cloud, since we lost multiple hosts before :-(.

Backups will probably be sent to a backup server "nearby" (in the same continent), and archived from there to multiple cloud locations. The loss of a backup server will require at least restoring metadata from the other server or cloud, but hopefully not full data from the cloud.

https://3ofcoins.net/2013/11/14/backups-suck-a-rant/

Requirements

Strict

  • MUST: be open-source (including server-side if any)
  • MUST: provide a CLI
  • MUST: True incremental
    • with garbage collection
  • MUST: Encryption
    • SHOULD: encryption should be optional, as we have some public data
  • MUST: Safe unattended backup
    • MUST NOT: disclose the passphrase or key data to generate new backups
  • MUST: Compression
    • SHOULD: be optional, some repos are already over pre-compressed data (eg distfiles, releases)
  • MUST: provide validation of backups
  • MUST: Support External/off-host storage
  • MUST NOT: require a complete local copy of the data to be retained

Nice-to-have

  • SHOULD: support multiple threads
    • Some data sets for backup are very large, but change slowly, and we have lots of IO behind them
    • Having multiple readers (of files to backup)/writers (of backup data) would massively improve performance.
  • SHOULD: support permanent/forever retention
    • with purging on a per-file basis.
  • SHOULD: Support cloud storage (AWS-S3, Ceph-S3, other).
    • This can be in the form of simply syncing the off-host storage to the cloud
    • AWS S3-IA/Glacier are $0.0125/USD/GB/month or less.
    • rsync.net has a Borg/Attic option at $0.03USD/GB/month (http://rsync.net/products/attic.html)
    • Maybe Rackspace account...
    • Maybe free Ceph-S3 from Dreamhost...
  • SHOULD: Scale to >2TB single repos (releases+distfiles historical)
  • SHOULD: Provide de-duplication
    • PREFERENCE:
      1. sliding-window
      2. fixed-block
      3. file-based
    • WANT: de-dupe data from different hosts.
  • SHOULD: bundle small files into blobs
  • SHOULD: provide a catalog of files

Known Backup software

[Arch Linux Sync & Backup programs] contains a good feature comparison list.

Active Contenders

  • rdup + rdedup

Contenders tried

  • obnam
    • Run by robbat2 2016/08 - 2016/12
    • very slow, not a good fit to problem
    • Single-threaded, locking issues on repo access
  • restic
    • 2017/01 briefly
    • Very promising EXCEPT for the symmetric encryption issue; either the backup password lives on the host (and can be taken by an attacker), or you can't do backups at all.

Incremental by-chunk

Incremental by-file

  • duplicity
    • Ruled out: Very heavy on temp space usage (duplicity)
  • rdiff-backup
    • Ruled out: No Encryption
  • rsnapshot
    • Ruled out: No Encryption
  • Snebu
    • Ruled out: No encryption

Snapshot

Most of these are very old-school backup programs.

  • Amanda
    • Ruled out: No true forever incremental
  • BackupPC
    • Ruled out: No encryption
  • Bacula/Bareos
    • What is the state of the business argument between Bacula & the Bareos fork? AGPL issues?
  • DAR
    • TODO
  • Dirvish
    • Ruled out: rsync/rdiff wrapper
    • Previously used by infra, painful to scale.
  • UrBackup
    • Ruled out: No encryption

Other

  • Unison
    • Ruled out: Sync tool, not backup
  • git-annex
    • Ruled out: not backup per se.
  • SyncThing
    • Ruled out: Sync tool, not backup

TODO

Wrappers

  • Backupninja
    • Wraps duplicity & rdiff-backup
    • Ruled out: wrapper only
    • Ruled out: no encryption in underlying tools (rdiff-backup)
    • Ruled out: Very heavy on temp space usage (duplicity)
  • Burp
    • librsync in v2, real dedup in v2?
    • Ruled out: wrapper only??
  • Deja-dup
    • Wraps duplicity
    • Includes cloud targets for storing
    • Ruled out: wrapper only
    • Ruled out: Very heavy on temp space usage (duplicity)
  • deltaic
    • NOTE: May be useful seperately to capture the GitHub Organization
    • Ruled out: wrapper only
    • Ruled out: no incremental
  • SafeKeep
    • Wrapper for rdiff-backup
    • Ruled out: wrapper only
    • Ruled out: no encryption in underlying tools (rdiff-backup)
  • backup
  • Backup v4.x

Software Comparison

Legend

- Key Option 1 ...
Encryption Symmetric-only asymmetric+symmetric paired GPG other None
Unattended Yes No Only if unencrypted, disclosure of passphrase

Data

Software Encryption Unattended Notes ...
Amanda GPG, asymmetric+symmetric paired Yes ...
Arq asymmetric+symmetric paired Yes ...
Attic Symmetric-only Only if unencrypted, disclosure of passphrase ...
BackupPC No ... ...
Bacula ... ... ...
btar ... ... ...
bup No (see encbup) Yes ...
Burp ... ... ...
DAR Yes ... ...
ddar No ... ...
Dirvish No ... ...
Duplicati GPG Yes Requires very modern DotNet/Mono environment, tools not up to date in Gentoo ...
duplicity GPG ... ...
git-annex Yes, many forms ... ...
obnam Yes Yes ...
rdiff-backup No ... ...
restic Symmetric Only if unencrypted, disclosure of passphrase ...
rsnapshot No ... ...
SafeKeep No ... ...
Snebu No ... ...
SyncThing special ... ...
tarsnap asymmetric+symmetric paired Yes ...
Unison No Yes ...
UrBackup ... ... ...
ZBackup Symmetric-only No ...

Misc Ideas

  • Combine:
    • obnam (with encryption)
    • git-annex the obnam repo to S3 & alternate hosts.
    • Put extra catalogs in git-annex


  • Combine:
    • borg-backup
    • rclone to send data over to S3 & alternate hosts
  • Combine:
    • zbackup-tar
      • Run both a per-filesystem tar
      • AND a tar for each actual file in the system (ignore symlinks, dirs)
      • Should de-dupe superbly
      • Should allow intelligent retention.

Review notes

  • Attic & Borg need to be ruled out because they require clients to take an exclusive lock on the repo for the entire backup. We could work around this by having a repo for each thing, but then we don't get de-duplication between hosts.
  • obnam works so far, but is slow, even with the recommended performance tuning. Seems to be limited by it's bad gpg usage.