While cheap to operate, rsync is a non-http protocol in a world of http protocols. The primary aim is to reduce our dependence on custom protocols. Using 'dumb' http pipes allows us to simplify serving infrastructure and utilize cheaper commodity components like content delivery networks (CDNs). Rsync requires us to run compute nodes near our users (to support the rsync protocol) and this requires us to sustain and operate a network edge. We have not done well at this in recent years and we are looking to get rid of it for this reason.
How clients use the service: Clients typically have 3 operations:
- a full sync. This is for clients who don't have a tree.
- check sync. This is for any client looking to do an update, they want to know if they are behind (and if so, by how much.)
- an incremental sync. This is for any client with some copy of the repo, they want to pull the incremental updates.
Typically full syncs are rare (most clients have a copy.) Many clients *check* every few hours and do an incremental if updates are available. Because of the pace of development; updates are almost always visible to these clients.
- Syncs should take < 2m for a full sync
- Syncs should take < 10 for an incremental sync
- Syncs should include repo metadata
- There needs to be some kind of latency / bandwidth consideration...but its less clear what it is.
- Syncs should be possible over http
- Syncs should scale easily, so we do e.g. 1 million syncs a day
- In reality there are 3 types of syncs. "Full", "Incremental", "check".