GLEP:44

Abstract
This GLEP proposes a new format for the Portage Manifest and digest file system by unifying both filetypes into one to improve functional and non-functional aspects of the Portage Tree.

Motivation
Please see for a general overview. The main long term goals of this proposal are to:


 * Remove the tiny digest files from the tree. They are a major annoyance as on a typical configuration they waste a lot of disk space and the simple transmission of the names for all digest files during a  needs a substantial amount of bandwidth.
 * Reduce redundancy when multiple hash functions are used
 * Remove potential for checksum collisions if a file is recorded in more than one digest file
 * Difference between filetypes for a more flexible verification system

Specification
The new Manifest format would change the existing format in the following ways:
 * Addition of a filetype specifier, currently planned are
 * for files directly used by ebuilds (e.g. patches or initscripts), located in the  subdirectory
 * for all ebuilds
 * for files not directly used by ebuilds like  or   files
 * for release tarballs recorded in the  variable of an ebuild, these were previously recorded in the digest files.
 * Future portage improvements might extend this list (for example with types relevant for eclasses or profiles)
 * Only have one line per file listing all information instead of one line per file and checksum type
 * Remove the separated digest-* files in the  subdirectory

Each line in the new format has the following format:

...

However theses entries will be stored in the existing Manifest files.

An actual example for a (pure) Manifest2 file..

Compability Entries
To maintain compatibility with existing portage versions a transition period after is the introduction of the Manifest2 format is required during which portage will not only have to be capable of using existing Manifest and digest files but also generate them in addition to the new entries. Fortunately this can be accomplished by simply mixing old and new style entries in one file for the Manifest files, existing portage versions will simply ignore the new style entries. For the digest files there are no new entries to care about.

Scope
It is important to note that this proposal only deals with a change of the format of the digest and Manifest system.

It does not expand the scope of it to cover eclasses, profiles or anything else not already covered by the Manifest system, it also doesn't affect the Manifest signing efforts in any way (though the implementations of both might be coupled).

Also while multiple hash functions will become standard with the proposed implementation they are not a specific feature of this format.

Number of hashes
While using multiple hashes for each file is a major feature of this proposal we have to make sure that the number of hashes listed is limited to avoid an explosion of the Manifest size that would revert the main benefit of this proposal (reducing tree size). Therefore the number of hashes that will be generated will be limited to three different hash functions. For compatibility though we have to rely on at least one hash function to always be present, this proposal suggest to use SHA1 for this purpose (as it is supposed to be more secure than MD5 and currently only SHA1 and MD5 are directly available in python, also MD5 doesn't have any benefit in terms of compatibility).

Rationale
The main goals of the proposal have been listed in the Motivation, here now the explanation why they are improvements and how the proposed format will accomplish them.

Removal of digest files
Normal users that don't use a "tuned" filesystem for the portage tree are wasting several dozen to a few hundred megabytes of disk space with the current system, largely caused by the digest files. This is due to the filesystem overhead present in most filesystems that have a standard blocksize of four kilobytes while most digest files are under one kilobyte in size, so this results in approximately a waste of three kilobytes per digest file (likely even more). At the time of this writing the tree contains roughly 22.000 digest files, so the overall waste caused by digest files is estimated at about 70-100 megabytes. Furthermore it is assumed that this will also reduce the disk space wasted by the Manifest files as they now contain more content, but this hasn't been verified yet.

By unifying the digest files with the Manifest these tiny files are eliminated (in the long run), reducing the apparent tree size by about 20%, benefiting both users and the Gentoo infrastructure.

Reducing redundancy
When multiple hashes are used with the current system both the filename and filesize are repeated for every checksum type used as each checksum is standalone. However this doesn't add any functionality and is therefore useless, so the new format removes this redundancy. This is a theoretical improvement at this moment as only one hash function is in use, but expected to change soon (see ).

Removal of checksum collisions
The current system theoretically allows for a  type file to be recorded in multiple digest files with different sizes and/or checksums. In such a case one version of a package would report a checksum violation while another one would not. This could create confusion and uncertainty among users. So far this case hasn't been observed, but it can't be ruled out with the existing system.

As the new format lists each file exactly once this would be no longer possible.

Flexible verification system
Right now portage verifies the checksum of every file listed in the Manifest before using any file of the package and all  files of an ebuild before using that ebuild. This is unnecessary in many cases:


 * During the "depend" phase (when the ebuild metadata is generated) only files of type  are used, so verifying the other types isn't necessary. Theoretically it is possible for an ebuild to include other files like those of type   at this phase, but that would be a major QA violation and should never occur, so it can be ignored here. It is also not a security concern as the ebuild is verified before parsing it, so each manipulation would show up.
 * Generally files of type  don't need to be verified as they are only used in very specific situations, aren't executed (just parsed at most) and don't affect the package build process.
 * Files of type  only need to be verified directly after fetching and before unpacking them (which often will be one step), not every time their associated ebuild is used.

Backwards Compatibility
Switching the Manifest system is a task that will need a long transition period like most changes affecting both portage and the tree. In this case the implementation will be rolled out in several phases:
 * 1) Add support for verification of Manifest2 entries in portage
 * 2) Enable generation of Manifest2 entries in addition to the current system
 * 3) Ignore digests during   to get the size-benefit clientside. This step may be omitted if the following steps are expected to follow soon.
 * 4) Disable generation of entries for the current system
 * 5) Remove all traces of the current system from the tree (serverside)

Each step has its own issues. While 1) and 2) can be implemented without any compatibility problems all later steps have a major impact:


 * Step 3) can only be implemented when the whole tree is Manifest2 ready (ideally speaking, practically the requirement will be more like 95% coverage with the expectation that for the remaining 5% either bugs will be filed after step 3) is completed or they'll be updated at step 5).
 * Steps 4) and 5) will render all portage versions without Manifest2 support basically useless (users would have to regenerate the digest and Manifest for each package before being able to merge it), so this requires a almost100% coverage of the userbase with Manifest2 capable portage versions (with step 1) completely implemented).

Another problem is that some steps affect different targets:
 * Steps 1) and 3) target portage versions used by users
 * Steps 2) and 4) target portage versions used by devs
 * Step 5) targets the portage tree on the cvs server

While it is relatively easy to get all devs to use a new portage version this is practically impossible with users as some don't update their systems regularly. While six months are probably sufficient to reach a 95% coverage one year is estimated to reach an almost-complete coverage. All times are relative to the stable-marking of a compatible portage version.

No timeframe for implementation is presented here as it is highly dependent on the completion of each step.

In summary it can be said that while a full conversion will take over a year to be completed due to compatibility issues mentioned above some benefits of the system can selectively be used as soon as step 2) is completed.

Impacts on infrastructure
While one long term goal of this proposal is to reduce the size of the tree and therefore make life for the Gentoo Infrastructure easier this will only take effect once the implementation is rolled out completely. In the meantime however it will increase the tree size due to keeping checksums in both formats. It's not possible to give a usable estimate on the degree of the increase as it depends on many variables such as the exact implementation timeframe, propagation of Manifest2 capable portage versions among devs or the update rate of the tree. It has been suggested that Manifest files that are not gpg signed could be mass converted in one step, this could certainly help but only to some degree (according to a recent research about 40% of all Manifests in the tree are signed, but this number hasn't been verified).

Reference Implementation
A patch for a prototype implementation of Manifest2 verification and partial generation has been posted at, it will be reworked before being considered for inclusion in portage. However it shows that adding support for verification is quite simple, but generation is a bit tricky and will therefore be implemented later.

Options
Some things have been considered for this GLEP but aren't part of the proposal yet for various reasons:


 * timestamp field: the author has considered adding a timestamp field for each entry to list the time the entry was created. However so far no practical use for such a feature has been found.
 * convert size field into checksum: Another idea was to treat the size field like any other checksum. But so far no real benefit (other than a slightly more modular implementation) for this has been seen while it has several drawbacks: For once, unlike checksums, the size field is definitely required for all  files, also it would slightly increase the length of each entry by adding a   keyword.
 * removal of the  type: It has been suggested to completely drop entries of type  . This would result in a minor space reduction (its rather unlikely to free any blocks) but completely remove the ability to check these files for integrity. While they don't influence portage or packages directly they can contain viable information for users, so the author has the opinion that at least the option for integrity checks should be kept.

Credits
Thanks to the following persons for their input on or related to this GLEP (even though they might not have known it): Ned Ludd (solar), Brian Harring (ferringb), Jason Stubbs (jstubbs), Robin H. Johnson (robbat2), Aron Griffis (agriffis)

Also thanks to Nicholas Jones (carpaski) to make the current Manifest system resistent enough to be able to handle this change without too many transition problems.

Copyright
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.