s6 and s6-rc-based init system

From Gentoo Wiki
Jump to: navigation, search

An s6 and s6-rc-based init system is an init system built using components from the s6 and s6-rc packages, following a general design supported by the s6-linux-init-maker program from package s6-linux-init (sys-apps/s6-linux-init). It can be used as alternative to sysvinit (sys-apps/sysvinit) + OpenRC, or systemd.

General setup

Warning
While Gentoo does offer s6, s6-rc and s6-linux-init packages in its official repository, it does not completely support using them to make an init system. Gentoo users wanting to do that might need to use alternative ebuild repositories and/or do some local tweaking.

The general setup of an s6 and s6-rc based init system is as follows:

  1. When the machine boots, all initialization tasks needed to bring it to its stable, normal 'up and running' state, are split into a stage1 init and a a stage2 init. The stage1 init runs as process 1, and replaces itself with the s6-svscan program from s6 when its work is done. The stage2 init runs as a child of process 1, blocks until s6-svscan starts to execute, and exits when its work is done.
  2. During most of the machine's uptime, s6-svscan runs as process 1 with signal diversion turned on, and there is an s6 supervision tree rooted in process 1, that is launched as soon as s6-svscan starts to execute.
  3. A supervised catch-all logger is launched as part of the supervision tree. The catch-all logger logs messages sent by supervision tree processes to s6-svscan's standard output and error.
  4. The stage2 init initializes the s6-rc service manager and starts a subset of the services defined in its compiled services database. Some of them might carry out part of the machine's initialization tasks.
  5. While s6-svscan is running as process 1, supervised processes and s6-rc-managed services can be controlled with s6 programs, and the s6-rc program's subcommands.
  6. When the administrator wants to initiate the machine's shutdown sequence, a signal is sent to process 1. The BusyBox (sys-apps/busybox) halt, poweroff and reboot applets, or the s6-halt, s6-poweroff and s6-reboot programs from s6-linux-init, can be used for this.
  7. s6-svscan then executes an appropriate diverted signal handler as a child process, which in turn executes a stage2_finish program that performs some of the tasks needed to shut the machine down, and stops all s6-rc-managed services.
  8. When the stage2_finish program exits, the s6-svscan diverted signal handler invokes the s6-svscanctl program, which makes s6-svscan perform its finish procedure, and results in execution of the .s6-svscan/finish file in process 1's scan directory.
  9. The finish process makes the catch-all logger exit cleanly, if it didn't when the supervision tree was brought down by s6-svscan's finish procedure, and then replaces itself with a stage3 init.
  10. The stage3 init runs as process 1 and performs all remaining tasks needed to shut the machine down.
  11. When the stage3 init's work is done, it halts, powers off or reboots the machine as requested.

The boot sequence

The stage1 init

When the machine starts booting (if an initramfs is being used, after it passes control to the 'main' init), a stage1 init executes as process 1. Therefore, if the stage1 init is named, for example, s6-gentoo-init, and placed in /sbin, to use an s6 and s6-rc-based init system, an init=/sbin/s6-gentoo-init argument can be added to the kernel's command line using the bootloader's available mechanisms (e.g. a linux command in some 'Gentoo with s6 + s6-rc' menu entry for GRUB2). It is possible to go back to sysvinit + OpenRC at any time, or to any other init system, by reverting the change.

The stage1 init runs with its standard input, output and error redirected to the machine's console. It must do all necessary setup for s6-svscan to be able to run. This includes setting up its scan directory, and because at that point the root filesystem might be the only mounted filesystem, and possibly read-only, the stage1 init must also mount a read-write filesystem to hold s6-svscan and s6-supervise control files that need to be written to. The customary setup of an s6 and s6-rc-based init system uses a run image containing the initial scan directory, that is copied to a tmpfs that the stage1 init mounts read-write, normally on /run. When s6-svscan starts running as process 1, it uses as its scan directory the copy in the tmpfs. The run image can be in a read-only filesystem.

Also, all special files that might be needed by s6-svscan and the stage1 and stage2 inits, such as the /dev/null and /dev/console device nodes, must be made available by the stage1 init before they are needed. Because of this and requirements of programs and libc functions that might be used for machine initialization, the Linux /dev and /proc filesystems will likely have to be mounted by the stage1 init.

Because the stage1 init runs as process 1, if it exits or is killed, there will be a kernel panic and the machine will hang. Therefore, it must be simple enough and not fail, because recovery in this stage of initialization is almost impossible. So s6 and s6-rc-based init systems split initialization into a stage1 init and a stage2 init. The stage2 init is spawned as a child process by the stage1 init, which, as soon as it finishes its work, replaces itself with s6-svscan using a POSIX exec...() call.

The author of s6 has designed the execline package (dev-lang/execline) so that the stage1 init can be an execline script. The general structure of an execline stage1 script is as follows, or a variation thereof:

CODE Execline stage1 script
#!/bin/execlineb -S0
# 'execlineb -S0' allows the script to use arguments supplied by the kernel as $1, $2, etc.
# If no arguments are used, '-P' can be specified instead of '-S0'.

# Adjust the environment set up by the kernel:
# /bin/s6-envdir -I -- ${stage1_envdir}
# Or at least set a suitable PATH environment variable:
# /bin/export PATH xxx

cd /
s6-setsid -qb
# Set umask:
# umask xxx

ifelse -nX { 
# Initialization.
# ...

# This includes mounting a read-write tmpfs.
# Using mount from util-linux; s6-mount from s6-linux-utils works too:
# if { mount -t tmpfs -o rw,xxx tmpfs ${tmpfsdir} }

# This also includes copying the run image to the tmpfs.
# Using cp from GNU Coreutils; s6-hiercopy from s6-portable-utils works too:
# if { cp -a -- ${run_image} ${tmpfsdir} }
}
{
# Do something if anything in the ifelse block failed, e.g. call sulogin(8) or sh(1).
# ...
}

# Can be done here for both s6-svscan and the stage2 init, or later:
# redirfd -r 0 /dev/null
redirfd -wnb 1 ${logger_fifo}
background
{
   s6-setsid
   redirfd -w 1 ${logger_fifo}
   # stdin: /dev/null or /dev/console
   # stdout: the catch-all logger's FIFO
   # stderr: /dev/console
   # Further file descriptor adjustments can be done here with execline's fdmove,
   # or left to the stage2 init to do it.
   ${stage2_init}
}
# If it hasn't been done yet:
# redirfd -r 0 /dev/null

emptyenv -p
# Set up the supervision tree's environment if desired:
# s6-envdir -I -- ${s6_svscan_envdir}

fdmove -c 2 1
# stdin: /dev/null
# stdout: the catch-all logger's FIFO
# stderr: the catch-all logger's FIFO
s6-svscan -st 0 -- ${tmpfsdir}/${scandir_relpath}

Where:

  • ${stage1_envdir} is the absolute pathname of an environment directory to be used by the stage1 and stage2 init (e.g. /lib/s6-init/env).
  • ${tmpfsdir} is the absolute pathname of the directory where the read-write tmpfs will be mounted (normally /run).
  • ${run_image} is the absolute pathname of the directory where the run image is stored (e.g. /lib/s6-init/run-image in the rootfs).
  • ${logger_fifo} is the absolute pathname of the catch-all logger's FIFO (e.g. ${tmpfsdir}/${scandir_relpath}/s6-svscan-log/fifo).
  • ${stage2_init} is the name (if PATH search would find it) or absolute pathname of the stage2 init (e.g. /lib/s6-init/init-stage2).
  • ${s6_svscan_envdir} is the absolute pathname of an environment directory used to set up the supervision tree's initial environment (e.g. /etc/s6-svscan/env).
  • ${scandir_relpath} is the pathname, relative to ${tmpfsdir}, of process 1's scan directory (e.g. s6/service, so absolute pathname would be /run/s6/service).

Gentoo's official repository does not supply any package with a stage1 init for s6 and s6-rc-based init systems. Users must create one from scratch or take it from somewhere else (e.g. alternative ebuild repositories). The s6-linux-init-maker program from s6-linux-init can create a minimal execline stage1 script with the aforementioned structure, that uses programs from packages s6-portable-utils (sys-apps/s6-portable-utils) and s6-linux-utils (sys-apps/s6-linux-utils), and can be used as a basis for writing a custom or more elaborate one, if so desired. The scan directory set up by the s6-linux-init-maker stage1 script is named service (so by default, its absolute pathname would be /run/service), and all additional initialization the script does is optionally mounting a devtmpfs on /dev, and optionally dumping the kernel's enviroment in an environment directory using s6-dumpenv from s6-portable-utils.

The stage2 init

The stage2 init is spawned by the stage1 init as a child process, and is blocked from running until the latter replaces itself with s6-svscan. To achieve this, the child process of the stage1 init opens the catch-all logger's FIFO for writing using the POSIX open() call. The call will block until some other process opens the FIFO for reading. The catch-all logger is a supervised process, so it starts executing when s6-svscan does, and opens the FIFO for reading, thereby unblocking the process, which then replaces itself with the stage2 init.

The stage2 init executes with s6-svscan as process 1, and performs all remaining initialization tasks needed to bring the machine to its stable, normal 'up and running' state. It can execute with a few vital supervised long-lived processes already running, started as part of process 1's supervision tree, including the catch-all logger. Part of the remaining initialization is creating the s6-rc service manager's live state directory using the s6-rc-init program, which can't be done until s6-svscan is running. This program takes the pathname of a compiled services database as an argument (or defaults it to /etc/s6-rc/compiled), as well as the pathname of process 1's scan directory. So a suitable services database must exist and be available at least in a read-only filesystem. This is the boot-time services database. The live state directory must be in a read-write filesystem, and the customary setup of an s6 and s6-rc-based init system has s6-rc-init create it in the read-write tmpfs mounted by the stage1 init.

s6-rc-init also copies to the live state directory all s6-rc longruns' compiled s6 service directories, creates symbolic links to them in process 1's scan directory, and uses an s6-svscanctl -a command to trigger a scan. The scan makes process 1 spawn an s6-supervise child for each longrun, but because s6-rc-compile produces s6 service directories that contain a down file, the longrun doesn't execute yet.

The initial state of all s6-rc services, as set by s6-rc-init, is 'down'. So the the stage2 init must also start all atomic services (oneshots and longruns) that are needed to complete the machine's initialization, if any, and the longruns that are wanted up at the end of the boot sequence. This is performed by defining a service bundle in the boot-time services database that groups these atomic services, and having the stage2 init start them with an s6-rc -u change command naming the bundle. This bundle would be the s6-rc counterpart to OpenRC's default runlevel, systemd's default.target unit, or nosh's normal target bundle directory.

When the stage2 init finishes its work, it exits and gets reaped by s6-svscan. The stage2 init can be, and normally is, an execline or shell script. Gentoo's official repository does not supply any package with a stage2 init for s6 and s6-rc-based init systems. Users must create one from scratch or take it from somewhere else (e.g. alternative ebuild repositories). The s6-linux-init package contains an example execline stage2 script, it is the examples/rc.init file in the package's /usr/share/doc subdirectory.

The catch-all logger

In the context of an s6 and s6-rc-based init system, the catch-all logger is a supervised long-lived process that logs messages sent by supervision tree processes to s6-svscan's standard output and error, normally in an automatically rotated logging directory. In a logging chain arrangement, the leaf processes of a supervision tree normally have dedicated loggers that collect and store messages sent to the process' standard output and error in per-service logs. Messages from s6-svscan, s6-supervise processes, logger processes themselves, and leaf processes that exceptionally don't have logger, are printed on process 1's standard output or error, which, at the beginning of the boot sequence, are redirected to the machine's console. It is possible to redirect them later so that the messages are delivered to the catch-all logger, using a setup that involves a FIFO. Only the catch-all logger's standard error remains redirected to the machine's console, as a last resort.

An s6 and s6-rc-based init system has a FIFO some place in the filesystem, reserved for the catch-all logger. The FIFO is owned by root and has permissions 0600 (i.e. the output of ls -l displays prw-------). The run image that is copied to the read-write tmpfs mounted by the stage1 init contains s6-svscan's initial scan directory, with at least a service directory for the catch-all logger already present, and possibly an additional service directory for an agetty process or similar also present. The former, so that the catch-all logger is launched as soon as s6-svscan starts executing as process 1, and the latter, so that it is possible to log in to the machine if the supervision tree starts successfully, even if something else fails (e.g. s6-rc's setup). The code of the catch-all logger's run file opens the FIFO for reading, redirects its standard input to it, its standard error to /dev/console, drops privileges (e.g. by invoking s6-setuidgid or s6-applyuidgid if it is a script) and calls the logger program, which is normally s6-log. The logging directory is owned by the logger's effective user after dropping privileges, and normally has permissions 2700 (i.e. the output of ls -l displays drwx--S---). Because it is possible to have a setup where a read-only rootfs is the only filesystem available, the logging directory is also normally placed in the read-write tmpfs mounted by the stage1 init, unless a different read-write filesystem can be guaranteed to exist before s6-svscan starts executing as process 1 (e.g. /var/log/s6-svscan is used, but /var is guaranteed to be in the rootfs, and either the kernel mounts the rootfs read-write or the stage1 init remounts it read-write, or /var is a filesystem mounted read-write by the stage1 init or the initramfs, etc.). If the logging directory is in the aforementioned tmpfs, it must be created with appropriate owner and permissions by the code of the catch-all logger's run file, or be present as an empty directory with appropriate owner and permissions in the run image copied to the tmpfs.

The stage1 init redirects its standard output and error to the catch-all logger's FIFO before replacing itself with s6-svscan. However, opening a FIFO for writing is an operation that blocks until some other process opens it for reading, and a POSIX non-blocking open() call fails with an error status if it specifies the 'open for writing only' flag (O_WRONLY) and there is no reader. Execline's readirfd program was written in a way that specifically addresses this problem: it is a chain loading program that, if invoked with options -w, -n and -b, will execute the next program in the chain with the specified file descriptor open for writing and without blocking, even if the specified pathname corresponds to a FIFO and there is no reader.

The s6-log program supports a -p option that makes it ignore the SIGTERM signal, so that it can't get killed that way. If s6-log is being used as the catch-all logger program and, to minimize the risk of losing logs, was invoked with this option, a special procedure is used by the code of process 1's finish file to make it exit cleanly. When the parent s6-supervise process receives a SIGTERM signal while the supervision tree is being brought down by s6-svscan's finish procedure, it sends s6-log a SIGTERM signal followed by a SIGCONT signal. But because s6-supervise doesn't exit until its supervised process does, and s6-log ignores SIGTERM and keeps running, the s6-svc program supports a special option, -X (capital 'x'), that works like -x (small 'x'), but also makes s6-supervise redirect its standard input, output and error to /dev/null. The code of process 1's finish file uses an s6-svc -X command with the catch-all logger's service directory as the argument, so that when finish runs, this would leave the catch-all logger's FIFO with no writers, because s6-svscan and all other s6-supervise processes would normally have exited by then, causing s6-log to detect end-of-file on its standard input and exit.

Gentoo's official repository does not supply any package with a catch-all logger service directory for s6 and s6-rc-based init systems. Users must create one from scratch or take it from somewhere else (e.g. alternative ebuild repositories). The s6-linux-init-maker program from s6-linux-init can create a catch-all logger service directory named s6-svscan-log, that can be used as a basis for writing a custom or more elaborate one, if so desired. The s6-linux-init-maker catch-all logger uses s6-log with the -p option, and logs to a subdirectory named uncaught-logs of the tmpfs mounted by the s6-linux-init-maker stage1 script. The logger's FIFO is named fifo and is located in its service directory.

Shutdown and reboot

Signals and the stage2_finish program

An s6 and s6-rc-based init system is asked to initiate the shutdown sequence by sending signals to process 1. Because the program running as process 1 is s6-svscan with signal diversion turned on, the signals must be chosen from the set it can divert. The BusyBox halt, poweroff and reboot applets, and the s6-halt, s6-poweroff and s6-reboot programs from s6-linux-init, are capable of sending suitable signals to process 1:

Operation BusyBox signal s6-linux-init signal
Halt SIGUSR1 SIGUSR2
Poweroff SIGUSR2 SIGUSR1
Reboot SIGTERM SIGINT


When process 1 receives such a signal, the corresponding diverted signal handler is executed as a child process. The handler then calls a stage2_finish program that performs part of the tasks needed to shut the machine down. Generally speaking, the stage2_finish program undoes what the stage2 init has done at boot time. This part of the machine's shutdown sequence can be carried out by s6-rc services and can use s6 tools, since s6-svscan is still running. However, all s6-rc-managed services have to be stopped (normally with a s6-rc -da change command) before the stage2_finish program exits, because s6-svscan will stop running after it does, and s6-rc does not work without an s6 supervision tree. The stage2_finish program can be, and normally is, an execline or shell script.

The general structure of an execline diverted signal handler script is as follows, or a variation thereof:

FILE ${tmpfsdir}/${scandir_relpath}/.s6-svscan/SIGxxxExecline diverted signal handler script
#!/bin/execlineb -P
foreground { ${stage2_finish} }
s6-svscanctl ${option} .

Where:

  • ${tmpfsdir} is the absolute pathname of the directory where the stage1 init mounted the read-write tmpfs (normally /run).
  • ${scandir_relpath} is the pathname, relative to ${tmpfsdir}, of process 1's scan directory (e.g. s6/service, so absolute pathname would be /run/s6/service).
  • ${stage2_finish} is the name (if PATH search would find it) or absolute pathname of the stage2_finish program (e.g. /lib/s6-init/stage2_finish).
  • ${option} is the s6-svscanctl option for the operation corresponding to the signal:
    • -0 or -st for halt.
    • -7 or -pt for poweroff.
    • -6 or -rt for reboot.

The s6-linux-init-maker program from s6-linux-init can create execline handler scripts for all s6-svscan diverted signals, compatible with s6-halt, s6-poweroff and s6-reboot. They can currently work without modifications for BusyBox halt, poweroff and reboot, by swapping the SIGUSR1 and SIGUSR2 handlers.

Gentoo's official repository does not supply any package with a stage2_finish program for s6 and s6-rc-based init systems. Users must create one from scratch or take it from somewhere else (e.g. alternative ebuild repositories). The s6-linux-init package contains an example execline stage2_finish script, it is the examples/rc.tini file in the package's /usr/share/doc subdirectory.

This means that s6-svscan is not directly compatible with sysvinit's telinit, halt, poweroff, reboot, and shutdown commands. However, many programs (e.g. those from desktop environments) expect to be able to call programs with those names during operation, so if such thing is needed, it is possible to use compatibility execline scripts:

FILE shutdown
#!/bin/execlineb -P
# For BusyBox:
# busybox poweroff
# For s6-linux-init:
# s6-poweroff
FILE reboot
#!/bin/execlineb -P
# For BusyBox:
# busybox reboot
# For s6-linux-init:
# s6-reboot

The stage3 init

When the stage2_finish exits, the s6-svscan diverted signal handler that invoked it then calls the s6-svscanctl program with an appropriate option to make s6-svscan perform its finish procedure. s6-svscan executes the finish file in the .s6-svscan control subdirectory of its scan directory, using the POSIX execve() call, passing a halt, poweroff or reboot argument to it. finish executes as process 1, redirects its standard output and error to /dev/console, uses the s6-svc -X command to make the catch-all logger exit cleanly, and replaces itself with a stage3 init, again using a POSIX exec...() call, passing along the argument supplied by s6-svscan. Alternatively, the stage3 init code might be part of the finish file, in which case that file would be considered the stage3 init.

The general structure of a process 1 execline finish script is as follows, or a variation thereof:

FILE ${tmpfsdir}/${scandir_relpath}/.s6-svscan/finishExecline finish script
#!/bin/execlineb -S0
cd /
redirfd -w 2 /dev/console
fdmove -c 1 2
foreground { s6-svc -X -- ${tmpfsdir}/${scandir_relpath}/${logger_servicedir} }
unexport ?
wait -r { }
${stage3_init} $@

Where:

  • ${tmpfsdir} is the absolute pathname of the directory where the stage1 init mounted the read-write tmpfs (normally /run).
  • ${scandir_relpath} is the pathname, relative to ${tmpfsdir}, of process 1's scan directory (e.g. s6/service, so absolute pathname would be /run/s6/service).
  • ${logger_servicedir} is the name of the catch-all logger's service directory (e.g. s6-svscan-log, so absolute pathname would be /run/s6/service/s6-svscan-log).
  • ${stage3_init} is the name (if PATH search would find it) or absolute pathname of the stage3 init (e.g. /lib/s6-init/init-stage3).

The s6-linux-init-maker program from s6-linux-init can create a suitable process 1 execline finish script.

The stage3 init runs as process 1 to perform all remaining tasks needed to shut the machine down. It must also kill all other processes that are still running at that point, after a grace period to allow them to exit on their own, so that filesystems can be synced and unmounted, or remounted read-only. This can be done with a POSIX kill() call specifying -1 as the process ID argument, usually to send a SIGTERM signal followed by a SIGCONT signal first, waiting for a short period of time, and then sending a SIGKILL signal. Because the stage3 init runs as process 1, and process 1 does not get killed by a kill(-1, SIGKILL) call, it continues executing after that. Sending a SIGKILL signal to all processes from a non-PID 1 process that is expected to continue running is much harder. The stage3 init can be, and normally is, an execline or shell script. The kill program provided by either the GNU Core Utilities package (sys-apps/coreutils), the util-linux package (sys-apps/util-linux) or the procps package (sys-process/procps), can be used in such a script as kill -TERM -1, kill -CONT -1 and kill -KILL -1 (the last form will also kill itself, but not the stage3 init). The s6-nuke program from the s6-portable-utils package can also be used in such a script, as s6-nuke -t (SIGTERM + SIGCONT) and s6-nuke -k (SIGKILL). And a shell stage3 script that invokes a shell with a builtin kill utility works too. In that case, process 1 will be a shell process that sends the signals itself. A wait -r {} command can be used in an execline stage3 script to reap all resulting zombie processes.

When the stage3 init finishes its work, it performs the halt, poweroff or reboot operation with a Linux reboot() call. If it is a script, it can use the BusyBox halt, poweroff and reboot applets, or the s6-halt, s6-poweroff and s6-reboot programs from s6-linux-init, passing them an -f (force) option and the argument supplied by s6-svscan:

CODE Execline stage3 script
#!/bin/execlineb -S0
# Shutdown tasks
# ...
# For BusyBox:
# busybox $1 -f
# For s6-linux-init:
# s6-$1 -f

Gentoo's official repository does not supply any package with a stage3 init for s6 and s6-rc-based init systems. Users must create one from scratch or take it from somewhere else (e.g. alternative ebuild repositories). The s6-linux-init package contains an example execline stage3 script, it is the examples/rc.shutdown file in the package's /usr/share/doc subdirectory.

Service management

On an s6 and s6-rc-based init system, the s6-rc package is used for service management. In particular, the administrator can replace the init system's compiled services database with a new one using the s6-rc-update program, and can create a new boot-time services database, to be used next time the machine boots, with the s6-rc-compile program and a set of service definitions in the program's supported source format. It is best to have the s6-rc-init invocation in the stage2 init use a symbolic link as the compiled services database pathname, so that the boot-time database can be changed by modifying the symlink instead of the stage2 init code, e.g. by having an /etc/s6-rc/db directory for storing one or more compiled databases, making /etc/s6-rc/boot a symbolic link to one of those databases, and using the symlink in the s6-rc-init invocation.

It is possible to have long-lived processes not managed by s6-rc but supervised by process 1, by directly managing s6 service directories, placing them (or symbolic links to them) in process 1's scan directory, and using s6-svscanctl -a, s6-svscanctl -n or s6-svscanctl -N commands as needed. It is also possible to use s6-svscan as process 1 and just s6 tools, without s6-rc, but then the init system becomes more like runit. In that case, executing s6-svscan with signal diversion turned on is not necessary.

s6 service directories and s6-rc service definitions for anything not supplied in packages from Gentoo's official repository must be created by the administrator, either from scratch or taken from somewhere else (e.g. alternative ebuild repositories).

See also