Project:Infrastructure/Rsync replacement

Replacing rsync

While cheap to operate, rsync is a non-http protocol in a world of http protocols. The primary aim is to reduce our dependence on custom protocols. Using 'dumb' http pipes allows us to simplify serving infrastructure and utilize cheaper commodity components like content delivery networks (CDNs). Rsync requires us to run compute nodes near our users (to support the rsync protocol) and this requires us to sustain and operate a network edge. We have not done well at this in recent years and we are looking to get rid of it for this reason.

Requirements

How clients use the service.

Clients typically have 3 operations:

a full sync. This is for clients who don't have a tree.
check sync. This is for any client looking to do an update, they want to know if they are behind (and if so, by how much.)
an incremental sync. This is for any client with some copy of the repo, they want to pull the incremental updates.

Typically full syncs are rare (most clients have a copy) Many clients *check* every few hours and do an incremental if updates are available. Because of the pace of development; updates are almost always visible to these clients.

Client

Syncs should take < 2m for a full sync
Syncs should take < 10 for an incremental sync
Syncs should include repository metadata
There needs to be some kind of latency / bandwidth consideration...but its less clear what it is.

Server

Syncs should be possible over http
Syncs should scale easily, so we do e.g. 1 million syncs a day
In reality there are 3 types of syncs. "Full", "Incremental", "check".