Project:Portage/Sync/post-rsync-world

IN PROGRESS

This page discusses repository syncing at a high level (goals and tradeoffs to reach those goals) and then drills down into various implementations. Repository syncing is syncing repository data and metadata; currently this design focuses on ebuild repositories only (and not the installed packages repository, nor binary package repos.) In the specific we focus on the ::gentoo repository as its the most common one and is large in size; making it an ideal candidate for discussion. However, the implementation should be usable for any ebuild repository.

What is an ebuild repository?

An ebuild repository is a combination of artifacts that nominally includes:

Everything in gentoo.git (ebuilds, eclasses, manifests, profiles, scripts, repository metadata.)
metadata/news (gentoo-news.git merged news/)
metadata/md5-cache (ebuild metadata cache; a latency optimization for repository consumers.)
metadata/glsa (data/glsa.git merged into glsa/)
metadata/dtd (dtds)
metadata/xml-schema (schemas for XML files.)
metadata/projects.xml (list of projects)

Some of these are optional; but in general:

gentoo.git, news, md5-cache, gsla are non-optional components; we currently generate a friendly version of a 'repository' in git at: https://gitweb.gentoo.org/repo/sync/gentoo.git/tree/metadata.

Not only gentoo.git

Alec asserts here that using a bare git repository (e.g. gentoo.git) as your ebuild repository is a bad idea and will generally result in a bad time as important data is missing from gentoo.git (GLSAs, news, metadata-cache, etc.)

A series of commits

The repository itself is then a combination of various git repos (glsa, news, gentoo, etc.) that we combine into 'a repository'. If git is a series of commits (in a chain) then a repository is those repos combined; hopefully in a smart way.

Why combine at all?

The obvious next question then is why combine at all? We can simply sync 4-6 git repos as a 'sync'. Or we could use git submodules to essentially make a 'bare repository git repo' that syncs the submodules.

TODO(antarus): Why don't we do this? Traditionally it was because of rsync? TODO(antarus): Note that metadata-cache is not in any repo, and is generated server side. If we wanted to however, we could put it in a git repo. This is generally not a great idea though (its not what git is for) and you already start to see the downside of the "put everything in git" mentality. Note as well the metadata-cache is mapped directly to a specific git commit; having the metadata-cache and the git repos individually sync-able is an error; and in fact they should be related / joined.

A repository is a series of commits + metadata

Drawing from the previous segment then, a repository is a series of commits (the git repos) plus the associated repository metadata. We generate the metadata on the server as the metadata maps 1:1 to a given head commit.