User:OstCollector/Reproducible Build

From Gentoo Wiki
Jump to:navigation Jump to:search
This article is a stub. Please help out by expanding it - how to get started.

Motivation

Reproducible build is generally good. [1] Now some distributions, including Debian, Fedora, Arch Linux, OpenSUSE, NixOS, NetBSD, are making effort to or even already achieved reproducible builds.

Current state of Gentoo Linux

There was some discussion about reproducibility. But seems that Gentoo lacks resource to focus on it. sam is one of the developers interested in this field. The value of reproducibility building in Gentoo Linux is still questionable.

What is the progress

Currently, the @world of stage3 can be reproduced with most files, except /var identical cross builds, with a customized build process. Following files currently can not be reproduced:

  • files generated by portage during installation
    • System-wide
      • /var/cache/edb/mtimedb
      • /var/cache/edb/vdb_metadata.pickle
      • /var/cache/edb/vdb_metadata_delta.json
    • Per-packet, located at /var/db/pkg/<category>/<P>
      • ./BINPKGMD5
      • ./BUILD_TIME
      • ./CONTENTS
  • files generated by other apps during installation
    • /var/cache/ldconfig/aux-cache

Customized Build Process

Since catalyst seems quite hard to deal with (I don't understand the affects of environment variables and the make.conf). A a build system generating a stage3-like output is used. It consists the steps described below:

  • Fetch a stage3 tarball and a later portage snapshot. Alternatively, just using the official images on Docker.
  • Add customized portage configuration into /etc/portage
  • Run emerge -uDN @world to make the building environment newest.
  • emerge some packages that are required to be modified for reproducibility. (I tried to emerge such packages along with in last step, but seems not every specified package is emerged)
  • emerge packages to a special path, like emerge @world --root=output_path

What is the penalty?

mtime of ebuild is used as input, which may cause binaries identical to others, which may make attacking Gentoo easier.

When Python loads a .pyc file, Python will also compute the hash of the corresponding source file and check if the hashes matches the .pyc file. So the loading process would slow down.

How to make it?

To archive reproducible build, it is necessary to modify the build system a bit, so the factors that make builds different can be removed. Some factors identified and how to mitigate are shown below.

Assuming the files for mitigation are placed in PORTAGE_CONFIG.

Timestamps in static library

Though it is not recommanded to use static libraries in *nix world, Gentoo still builds some static libs for several low-level packages, including glibc, tcl and libcap, when using default config.

However, .a files contain timestamps of the object files, which causes .a files differ cross builds. Luckily, binutils, which is responsible to create .a files, has a configure-time flag --enable-deterministic-archives. When binutils is configured with this flag, it can generate .a files with zero timestamps, so the .a files can be reproducible. Gentoo maintainer for binutils has added but commented out this option. It can be activated using package-level environment variables in portage:

root #cat ${PORTAGE_CONFIG}/env/binutils-deterministic
EXTRA_ECONF="--enable-deterministic-archives"
root #cat ${PORTAGE_CONFIG}/package.env/binutils
sys-devel/binutils binutils-deterministic

The use of SOURCE_DATE_EPOCH and PYTHONHASHSEED

C/C++ has __DATE__ and __TIME__ macros which are expanded into the date and time when preprocessor is invoked.

Also, Python inject the time of the source file into .pyc files. The mtime is used to check if the original source file is modified. These timestamps makes reproducible build impossible.

reproducible-build.org has purposed to use SOURCE_DATE_EPOCH to get rid of such problems. GCC and LLVM have taken this envariable into account [2] [3].

On the other hand, when python build .pyc with SOURCE_DATE_EPOCH set to a valid timestamp, Python will use hash rather than mtime of file to check whether the .pyc is still valid.

Note: This bahavior will cause performance degrade, since it is much faster to check if the mtime is still correct.

Fedora Linux is using a different technique, in which they still use mtime, but make the timestamp of all source files at SOURCE_DATE_EPOCH [4].

In addition, Python has randomized order for set/map/etc. the order can be controlled by PYTHONHASHSEED.

Following patch injects SOURCE_DATE_EPOCH with mtime of the ebuild file into the building process. The patch also sets PYTHONHASHSEED with SOURCE_DATE_EPOCH.

root #cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/source-time-epoch.patch
diff -aurN portage-3.0.44.orig/lib/portage/package/ebuild/_config/special_env_vars.py portage-3.0.44/lib/portage/package/ebuild/_config/special_env_vars.py
--- portage-3.0.44.orig/lib/portage/package/ebuild/_config/special_env_vars.py  2023-01-15 22:11:19.000000000 -0000
+++ portage-3.0.44/lib/portage/package/ebuild/_config/special_env_vars.py       2023-06-06 23:16:46.706213851 -0000
@@ -187,6 +187,8 @@
         "ROOT",
         "ROOTPATH",
         "SANDBOX_LOG",
+        "SOURCE_DATE_EPOCH",
+        "PYTHONHASHSEED",
         "SYSROOT",
         "T",
         "TMP",
diff -aurN portage-3.0.44.orig/lib/portage/package/ebuild/doebuild.py portage-3.0.44/lib/portage/package/ebuild/doebuild.py
--- portage-3.0.44.orig/lib/portage/package/ebuild/doebuild.py  2023-01-15 22:11:19.000000000 -0000
+++ portage-3.0.44/lib/portage/package/ebuild/doebuild.py       2023-06-06 17:13:08.494561529 -0000
@@ -346,6 +346,8 @@
     mysettings = settings
     mydbapi = db
     ebuild_path = os.path.abspath(myebuild)
+    source_date_epoch = os.stat(ebuild_path).st_mtime_ns
+    source_date_epoch = source_date_epoch // 1000 ** 3
     pkg_dir = os.path.dirname(ebuild_path)
     mytree = os.path.dirname(os.path.dirname(pkg_dir))
     mypv = os.path.basename(ebuild_path)[:-7]
@@ -420,6 +422,8 @@
         mysettings["PORTAGE_DEBUG"] = "1"

     mysettings["EBUILD"] = ebuild_path
+    mysettings["SOURCE_DATE_EPOCH"] = str(source_date_epoch)
+    mysettings["PYTHONHASHSEED"] = str(source_date_epoch)
     mysettings["O"] = pkg_dir
     mysettings.configdict["pkg"]["CATEGORY"] = cat
     mysettings["PF"] = mypv

Debuginfo stripping

Portage performs executable stripping parallel. However, build process will make some executables as hardlinks of other executables, e.g. binutils and gcc. Current portage implements will make nondeterministic results. This can be fix by following patch:

root #cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/deterministic-strip.patch
diff -aurN portage-3.0.63.orig/bin/estrip portage-3.0.63/bin/estrip
--- portage-3.0.63.orig/bin/estrip      2024-02-25 08:29:43.000000000 +0000
+++ portage-3.0.63/bin/estrip   2024-05-02 10:53:39.562353114 +0000
@@ -508,11 +508,13 @@
 while IFS= read -d '' -r x ; do
        inode_link=$(get_inode_number "${x%.estrip}") || die "stat failed unexpectedly"
        echo "${x%.estrip}" >> "${inode_link}" || die "echo failed unexpectedly"
-done < <(find "${ED}" -name '*.estrip' -delete -print0)
+done < <(find "${ED}" -name '*.estrip' -delete -print0 | sort -z)
 fi
 
 # Now we look for unstripped binaries.
 for inode_link in $(shopt -s nullglob; echo *) ; do
+(
+__multijob_child_init
 while read -r x
 do
 
@@ -521,8 +523,6 @@
                banner=true
        fi
 
-       (
-       __multijob_child_init
        f=$(file -S "${x}") || exit 0
        [[ -z ${f} ]] && exit 0
 
@@ -570,10 +570,9 @@
        if ${was_not_writable} ; then
                chmod u-w "${x}"
        fi
-       ) &
-       __multijob_post_fork
-
 done < "${inode_link}"
+) &
+__multijob_post_fork
 done
 
 # With a bit more work, we could run the rsync processes below in

Portage metadata

Portage has various metadata exported to /var/db/pkg/<category>/<P>, some of them are identified:

environment.bz2 from bash

Bash has several environment variables, like ${RANDOM} ${SRANDOM} ${EPOCHREALTIME} ${EPOCHSECONDS}. These variables would be exported to environment.bz2. Use following patch to drop them:

root #cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/trim-bash-var.patch
diff -aurN portage-3.0.44.orig/bin/save-ebuild-env.sh portage-3.0.44/bin/save-ebuild-env.sh
--- portage-3.0.44.orig/bin/save-ebuild-env.sh  2023-01-15 22:11:19.000000000 -0000
+++ portage-3.0.44/bin/save-ebuild-env.sh       2023-06-10 04:53:29.835669374 -0000
@@ -115,6 +115,9 @@
        # user config variables
        unset DOC_SYMLINKS_DIR INSTALL_MASK PKG_INSTALL_MASK

+       # Always changing variables, causing build nonreproducible
+       unset EPOCHREALTIME EPOCHSECONDS RANDOM SRANDOM
+
        declare -p
        declare -fp
        if [[ ${BASH_VERSINFO[0]} == 3 ]]; then

FEATURES variable

Updated on 2024-09-30: This is also observed by the Chromium guys and a similar patch is already merged. [5]

Under certain circumstances, file FEATURES and variable FEATURES in environment.bz2 are not ordered deterministically, even though PYTHONHASHSEED is set.

Following patch can make FEATURES ordered:

root #cat ${PORTAGE_CONFIG}/patches/sys-apps/portage/predictable-features-with-test.patch
diff -aurN portage-3.0.44.orig/lib/portage/package/ebuild/config.py portage-3.0.44/lib/portage/package/ebuild/config.py
--- portage-3.0.44.orig/lib/portage/package/ebuild/config.py    2023-01-15 22:11:19.000000000 -0000
+++ portage-3.0.44/lib/portage/package/ebuild/config.py 2023-06-11 23:16:01.171073495 -0000
@@ -2206,7 +2206,7 @@
                 # "test" is in IUSE and USE=test is masked, so execution
                 # of src_test() probably is not reliable. Therefore,
                 # temporarily disable FEATURES=test just for this package.
-                self["FEATURES"] = " ".join(x for x in self.features if x != "test")
+                self["FEATURES"] = " ".join(x for x in sorted(self.features) if x != "test")

         # Allow _* flags from USE_EXPAND wildcards to pass through here.
         use.difference_update(

Per package

Perl

Perl has perlbug and perlthank, which are generated at build time with timestamp embedded. Following patch can be used to get rid of it:

root #cat ${PORTAGE_CONFIG}/patches/dev-lang/perl/reproducible.patch
--- perl-5.36.0.orig/Configure  2023-06-11 12:08:49.131553625 -0000
+++ perl-5.36.0/Configure       2023-06-11 12:11:24.282596273 -0000
@@ -3867,7 +3867,11 @@
 . ./posthint.sh

 : who configured the system
-cf_time=`LC_ALL=C; LANGUAGE=C; export LC_ALL; export LANGUAGE; $date 2>&1`
+if $test -n "${SOURCE_DATE_EPOCH}" ;then
+       cf_time=`LC_ALL=C; LANGUAGE=C; export LC_ALL; export LANGUAGE; $date --date="@${SOURCE_DATE_EPOCH}" 2>&1`
+else
+       cf_time=`LC_ALL=C; LANGUAGE=C; export LC_ALL; export LANGUAGE; $date 2>&1`
+fi
 case "$cf_by" in
 "")
        cf_by=`(logname) 2>/dev/null`
diff -aurN perl-5.36.0.orig/utils/perlbug.PL perl-5.36.0/utils/perlbug.PL
--- perl-5.36.0.orig/utils/perlbug.PL   2020-12-28 16:57:44.000000000 -0000
+++ perl-5.36.0/utils/perlbug.PL        2023-06-11 12:11:55.788746681 -0000
@@ -31,6 +31,9 @@
     or die "Can't find patchlevel.h: $!";

 my $patchlevel_date = (stat _)[9];
+if ( exists $ENV{'SOURCE_DATE_EPOCH'} && defined $ENV{'SOURCE_DATE_EPOCH'} ) {
+    $patchlevel_date = $ENV{'SOURCE_DATE_EPOCH'};
+}

 # TO DO (perhaps): store/embed $Config::config_sh into perlbug. When perlbug is
 # used, compare $Config::config_sh with the stored version. If they differ then

Perl also embeds kernel string into Errno.pm, and hostname of building machine into Config.h

root #cat ${PORTAGE_CONFIG}/patches/dev-lang/perl/reproducible-hostinfo.patch
diff -aurN perl-5.38.2.orig/config.over perl-5.38.2/config.over
--- perl-5.38.2.orig/config.over        2024-05-02 11:05:55.379906469 +0000
+++ perl-5.38.2/config.over     2024-05-02 13:00:22.592634643 +0000
@@ -9,3 +9,8 @@
 done
 ccdlflags="$tmp"
 lddlflags="$lddlflags $LDFLAGS"
+
+osvers="6.6.21-gentoo"
+myuname="gentoolinux"
+myhostname="gentoolinux"
+

This diff is based on [6]. Arch Linux also specified cf_time, but it is still unclear to the writer why it works.

vim

Similar to perl, vim embeds hostname of the building node into the executable. It can be fixed by setting emerge-time environment:

root #cat ${PORTAGE_CONFIG}/env/vim-reproducible
EXTRA_ECONF="--with-compiledby='Gentoo Linux'"
root #cat ${PORTAGE_CONFIG}/package.env/vim
app-editors/vim vim-reproducible

lsof

lsof traditionally embeds kernel version string into its executable. This behavior was intended to help dealing the behavior of different versions of the same OS. But Linux has stable procfs interface, so this mechanism is never used. Recently a PR is merged into the upstream [7].

kubelet

This package is not in @world of stage3, but it is still tested and found non-reproducible. The reason is still unknown.

Again, about timestamps and binpkgs

There was some discuss about if it is possible to make portage install files using a deterministic timestamp, but the developer of portages said the timestamp is actually used to check if a file needs to be updated. (The detail may be required by the history of IRC). The developers said it may be possible to make binpkgs reproducible, though.

A related question is, which time should be specified to be the timestamp of the emerge process? In this wiki, the timestamp is chosen to be the mtime of ebuild, but is it reasonable? Should the timestamp be updated if any of the dependencies is updated?

References