User:OstCollector/Reproducible Build
Motivation
Reproducible build is generally good. [1] Now some distributions, including Debian, Fedora, Arch Linux, OpenSUSE, NixOS, NetBSD, are making effort to or even already achieved reproducible builds.
Current state of Gentoo Linux
There was some discussion about reproducibility. But seems that Gentoo lacks resource to focus on it. sam is one of the developers interested in this field. The value of reproducibility building in Gentoo Linux is still questionable.
What is the progress
Currently, the @world of stage3 can be reproduced with most files, except /var identical cross builds, with a customized build process. Following files currently can not be reproduced:
- files generated by portage during installation
- System-wide
/var/cache/edb/mtimedb
/var/cache/edb/vdb_metadata.pickle
/var/cache/edb/vdb_metadata_delta.json
- Per-packet, located at
/var/db/pkg/<category>/<P>
./BINPKGMD5
./BUILD_TIME
./CONTENTS
- System-wide
- files generated by other apps during installation
/var/cache/ldconfig/aux-cache
Customized Build Process
Since catalyst seems quite hard to deal with (I don't understand the affects of environment variables and the make.conf). A a build system generating a stage3-like output is used. It consists the steps described below:
- Fetch a stage3 tarball and a later portage snapshot. Alternatively, just using the official images on Docker.
- Add customized portage configuration into /etc/portage
- Run
emerge -uDN @world
to make the building environment newest. - emerge some packages that are required to be modified for reproducibility. (I tried to emerge such packages along with in last step, but seems not every specified package is emerged)
- emerge packages to a special path, like
emerge @world --root=output_path
What is the penalty?
mtime of ebuild is used as input, which may cause binaries identical to others, which may make attacking Gentoo easier.
When Python loads a .pyc file, Python will also compute the hash of the corresponding source file and check if the hashes matches the .pyc file. So the loading process would slow down.
How to make it?
To archive reproducible build, it is necessary to modify the build system a bit, so the factors that make builds different can be removed. Some factors identified and how to mitigate are shown below.
Assuming the files for mitigation are placed in PORTAGE_CONFIG
.
Timestamps in static library
Though it is not recommanded to use static libraries in *nix world, Gentoo still builds some static libs for several low-level packages, including glibc, tcl and libcap, when using default config.
However, .a files contain timestamps of the object files, which causes .a files differ cross builds. Luckily, binutils, which is responsible to create .a files, has a configure-time flag --enable-deterministic-archives. When binutils is configured with this flag, it can generate .a files with zero timestamps, so the .a files can be reproducible. Gentoo maintainer for binutils has added but commented out this option. It can be activated using package-level environment variables in portage:
root #
cat ${PORTAGE_CONFIG}/env/binutils-deterministic
EXTRA_ECONF="--enable-deterministic-archives"
root #
cat ${PORTAGE_CONFIG}/package.env/binutils
sys-devel/binutils binutils-deterministic
The use of SOURCE_DATE_EPOCH
and PYTHONHASHSEED
C/C++ has __DATE__ and __TIME__ macros which are expanded into the date and time when preprocessor is invoked.
Also, Python inject the time of the source file into .pyc files. The mtime is used to check if the original source file is modified. These timestamps makes reproducible build impossible.
reproducible-build.org has purposed to use SOURCE_DATE_EPOCH to get rid of such problems. GCC and LLVM have taken this envariable into account [2] [3].
On the other hand, when python build .pyc with SOURCE_DATE_EPOCH set to a valid timestamp, Python will use hash rather than mtime of file to check whether the .pyc is still valid.
Note: This bahavior will cause performance degrade, since it is much faster to check if the mtime is still correct.
Fedora Linux is using a different technique, in which they still use mtime, but make the timestamp of all source files at SOURCE_DATE_EPOCH [4].
In addition, Python has randomized order for set/map/etc. the order can be controlled by PYTHONHASHSEED.
Following patch injects SOURCE_DATE_EPOCH with mtime of the ebuild file into the building process. The patch also sets PYTHONHASHSEED with SOURCE_DATE_EPOCH.
root #
cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/source-time-epoch.patch
diff -aurN portage-3.0.44.orig/lib/portage/package/ebuild/_config/special_env_vars.py portage-3.0.44/lib/portage/package/ebuild/_config/special_env_vars.py --- portage-3.0.44.orig/lib/portage/package/ebuild/_config/special_env_vars.py 2023-01-15 22:11:19.000000000 -0000 +++ portage-3.0.44/lib/portage/package/ebuild/_config/special_env_vars.py 2023-06-06 23:16:46.706213851 -0000 @@ -187,6 +187,8 @@ "ROOT", "ROOTPATH", "SANDBOX_LOG", + "SOURCE_DATE_EPOCH", + "PYTHONHASHSEED", "SYSROOT", "T", "TMP", diff -aurN portage-3.0.44.orig/lib/portage/package/ebuild/doebuild.py portage-3.0.44/lib/portage/package/ebuild/doebuild.py --- portage-3.0.44.orig/lib/portage/package/ebuild/doebuild.py 2023-01-15 22:11:19.000000000 -0000 +++ portage-3.0.44/lib/portage/package/ebuild/doebuild.py 2023-06-06 17:13:08.494561529 -0000 @@ -346,6 +346,8 @@ mysettings = settings mydbapi = db ebuild_path = os.path.abspath(myebuild) + source_date_epoch = os.stat(ebuild_path).st_mtime_ns + source_date_epoch = source_date_epoch // 1000 ** 3 pkg_dir = os.path.dirname(ebuild_path) mytree = os.path.dirname(os.path.dirname(pkg_dir)) mypv = os.path.basename(ebuild_path)[:-7] @@ -420,6 +422,8 @@ mysettings["PORTAGE_DEBUG"] = "1" mysettings["EBUILD"] = ebuild_path + mysettings["SOURCE_DATE_EPOCH"] = str(source_date_epoch) + mysettings["PYTHONHASHSEED"] = str(source_date_epoch) mysettings["O"] = pkg_dir mysettings.configdict["pkg"]["CATEGORY"] = cat mysettings["PF"] = mypv
Debuginfo stripping
Portage performs executable stripping parallel. However, build process will make some executables as hardlinks of other executables, e.g. binutils and gcc. Current portage implements will make nondeterministic results. This can be fix by following patch:
root #
cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/deterministic-strip.patch
diff -aurN portage-3.0.63.orig/bin/estrip portage-3.0.63/bin/estrip --- portage-3.0.63.orig/bin/estrip 2024-02-25 08:29:43.000000000 +0000 +++ portage-3.0.63/bin/estrip 2024-05-02 10:53:39.562353114 +0000 @@ -508,11 +508,13 @@ while IFS= read -d '' -r x ; do inode_link=$(get_inode_number "${x%.estrip}") || die "stat failed unexpectedly" echo "${x%.estrip}" >> "${inode_link}" || die "echo failed unexpectedly" -done < <(find "${ED}" -name '*.estrip' -delete -print0) +done < <(find "${ED}" -name '*.estrip' -delete -print0 | sort -z) fi # Now we look for unstripped binaries. for inode_link in $(shopt -s nullglob; echo *) ; do +( +__multijob_child_init while read -r x do @@ -521,8 +523,6 @@ banner=true fi - ( - __multijob_child_init f=$(file -S "${x}") || exit 0 [[ -z ${f} ]] && exit 0 @@ -570,10 +570,9 @@ if ${was_not_writable} ; then chmod u-w "${x}" fi - ) & - __multijob_post_fork - done < "${inode_link}" +) & +__multijob_post_fork done # With a bit more work, we could run the rsync processes below in
Portage metadata
Portage has various metadata exported to /var/db/pkg/<category>/<P>
, some of them are identified:
environment.bz2
from bash
Bash has several environment variables, like ${RANDOM} ${SRANDOM} ${EPOCHREALTIME} ${EPOCHSECONDS}. These variables would be exported to environment.bz2. Use following patch to drop them:
root #
cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/trim-bash-var.patch
diff -aurN portage-3.0.44.orig/bin/save-ebuild-env.sh portage-3.0.44/bin/save-ebuild-env.sh --- portage-3.0.44.orig/bin/save-ebuild-env.sh 2023-01-15 22:11:19.000000000 -0000 +++ portage-3.0.44/bin/save-ebuild-env.sh 2023-06-10 04:53:29.835669374 -0000 @@ -115,6 +115,9 @@ # user config variables unset DOC_SYMLINKS_DIR INSTALL_MASK PKG_INSTALL_MASK + # Always changing variables, causing build nonreproducible + unset EPOCHREALTIME EPOCHSECONDS RANDOM SRANDOM + declare -p declare -fp if [[ ${BASH_VERSINFO[0]} == 3 ]]; then
FEATURES
variable
Updated on 2024-09-30: This is also observed by the Chromium guys and a similar patch is already merged. [5]
Under certain circumstances, file FEATURES
and variable FEATURES
in environment.bz2 are not ordered deterministically, even though PYTHONHASHSEED
is set.
Following patch can make FEATURES ordered:
root #
cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/predictable-features-with-test.patch
diff -aurN portage-3.0.44.orig/lib/portage/package/ebuild/config.py portage-3.0.44/lib/portage/package/ebuild/config.py --- portage-3.0.44.orig/lib/portage/package/ebuild/config.py 2023-01-15 22:11:19.000000000 -0000 +++ portage-3.0.44/lib/portage/package/ebuild/config.py 2023-06-11 23:16:01.171073495 -0000 @@ -2206,7 +2206,7 @@ # "test" is in IUSE and USE=test is masked, so execution # of src_test() probably is not reliable. Therefore, # temporarily disable FEATURES=test just for this package. - self["FEATURES"] = " ".join(x for x in self.features if x != "test") + self["FEATURES"] = " ".join(x for x in sorted(self.features) if x != "test") # Allow _* flags from USE_EXPAND wildcards to pass through here. use.difference_update(
Per package
Perl
Perl has perlbug
and perlthank
, which are generated at build time with timestamp embedded. Following patch can be used to get rid of it:
root #
cat ${PORTAGE_CONFIG}/patches/dev-lang/perl/reproducible.patch
--- perl-5.36.0.orig/Configure 2023-06-11 12:08:49.131553625 -0000 +++ perl-5.36.0/Configure 2023-06-11 12:11:24.282596273 -0000 @@ -3867,7 +3867,11 @@ . ./posthint.sh : who configured the system -cf_time=`LC_ALL=C; LANGUAGE=C; export LC_ALL; export LANGUAGE; $date 2>&1` +if $test -n "${SOURCE_DATE_EPOCH}" ;then + cf_time=`LC_ALL=C; LANGUAGE=C; export LC_ALL; export LANGUAGE; $date --date="@${SOURCE_DATE_EPOCH}" 2>&1` +else + cf_time=`LC_ALL=C; LANGUAGE=C; export LC_ALL; export LANGUAGE; $date 2>&1` +fi case "$cf_by" in "") cf_by=`(logname) 2>/dev/null` diff -aurN perl-5.36.0.orig/utils/perlbug.PL perl-5.36.0/utils/perlbug.PL --- perl-5.36.0.orig/utils/perlbug.PL 2020-12-28 16:57:44.000000000 -0000 +++ perl-5.36.0/utils/perlbug.PL 2023-06-11 12:11:55.788746681 -0000 @@ -31,6 +31,9 @@ or die "Can't find patchlevel.h: $!"; my $patchlevel_date = (stat _)[9]; +if ( exists $ENV{'SOURCE_DATE_EPOCH'} && defined $ENV{'SOURCE_DATE_EPOCH'} ) { + $patchlevel_date = $ENV{'SOURCE_DATE_EPOCH'}; +} # TO DO (perhaps): store/embed $Config::config_sh into perlbug. When perlbug is # used, compare $Config::config_sh with the stored version. If they differ then
Perl also embeds kernel string into Errno.pm
, and hostname of building machine into Config.h
root #
cat ${PORTAGE_CONFIG}/patches/dev-lang/perl/reproducible-hostinfo.patch
diff -aurN perl-5.38.2.orig/config.over perl-5.38.2/config.over --- perl-5.38.2.orig/config.over 2024-05-02 11:05:55.379906469 +0000 +++ perl-5.38.2/config.over 2024-05-02 13:00:22.592634643 +0000 @@ -9,3 +9,8 @@ done ccdlflags="$tmp" lddlflags="$lddlflags $LDFLAGS" + +osvers="6.6.21-gentoo" +myuname="gentoolinux" +myhostname="gentoolinux" +
This diff is based on [6]. Arch Linux also specified cf_time
, but it is still unclear to the writer why it works.
vim
Similar to perl, vim embeds hostname of the building node into the executable. It can be fixed by setting emerge-time environment:
root #
cat ${PORTAGE_CONFIG}/env/vim-reproducible
EXTRA_ECONF="--with-compiledby='Gentoo Linux'"
root #
cat ${PORTAGE_CONFIG}/package.env/vim
app-editors/vim vim-reproducible
lsof
lsof traditionally embeds kernel version string into its executable. This behavior was intended to help dealing the behavior of different versions of the same OS. But Linux has stable procfs interface, so this mechanism is never used. Recently a PR is merged into the upstream [7].
kubelet
This package is not in @world of stage3, but it is still tested and found non-reproducible. The reason is still unknown.
Again, about timestamps and binpkgs
There was some discuss about if it is possible to make portage install files using a deterministic timestamp, but the developer of portages said the timestamp is actually used to check if a file needs to be updated. (The detail may be required by the history of IRC). The developers said it may be possible to make binpkgs reproducible, though.
A related question is, which time should be specified to be the timestamp of the emerge process? In this wiki, the timestamp is chosen to be the mtime of ebuild, but is it reasonable? Should the timestamp be updated if any of the dependencies is updated?
References
- ↑ citation needed
- ↑ https://gcc.gnu.org/onlinedocs/cpp/Environment-Variables.html
- ↑ https://releases.llvm.org/16.0.0/tools/clang/docs/ReleaseNotes.html#non-comprehensive-list-of-changes-in-this-release
- ↑ https://fedoraproject.org/wiki/Changes/ReproducibleBuildsClampMtimes#Python_bytecode
- ↑ https://github.com/gentoo/portage/pull/1312/commits/1d0a15277ab72c4862f85598f6998076d9f1841e
- ↑ https://gitlab.archlinux.org/archlinux/packaging/packages/perl/-/blob/f290a49cbdc193561f9aaa3fc189a2335dd12ae4/config.over
- ↑ https://github.com/lsof-org/lsof/pull/314