User:Doskanoness/Unprivileged LXC containers

Unprivileged containers
Unprivileged containers are the safest containers. Usual privileged LXC should be considered unsafe because while running in a separate namespace, UID 0 in the container is still equal to UID 0 (root) outside of the container, meaning that if you somehow get access to any host resource through proc, sys or some random syscalls, you can potentially escape the container and then you'll be root on the host. That's what user namespaces were designed for. Each user that's allowed to use them on the system gets assigned a range of unused UIDs and GIDs. So, unprivileged LXC map, for instance, user and group ids 0 through 65,000 in the container to the ids 100,000 through 165,000 on the host. That means that UID 0 (root) in the container maps into UID 100,000 outside the container. So, in case something goes wrong and an attacker manages to escape the container, one finds himself with as many rights as a nobody user.

The standard paths also have their unprivileged equivalents:


 * /etc/lxc/lxc.conf => ~/.config/lxc/lxc.conf
 * /etc/lxc/default.conf => ~/.config/lxc/default.conf
 * /var/lib/lxc => ~/.local/share/lxc
 * /var/lib/lxcsnaps => ~/.local/share/lxcsnaps
 * /var/cache/lxc => ~/.cache/lxc

Your user, while it can create new user namespaces in which it'll be UID 0 and will have some of the root's privileges against resources tied to that namespace will not be granted any extra privilege on the host. Unfortunately, this also means that the following common operations aren't allowed:


 * Mounting most of the filesystems.
 * Creating device nodes.
 * Any operation against a UID/GID outside of the mapped set.

This also means that your user will be limited to creating new network devices on the host or changing bridge configuration. To work around that, the LXC team wrote a tool called “lxc-user-nic” which is the only setuid binary part of LXC 1.0 and which performs one simple task. It parses a configuration file and based on its content creates network devices for the user and bridges them. To prevent abuse, you can restrict the number of devices a user can request and to what bridge they may be added by editing the file.

Prerequisites
Prerequisites for well working unprivileged containers include:


 * Kernel: 3.13 + a couple of staging patches or later version
 * User namespaces enabled in the kernel (CONFIG_USER_NS=y)
 * A very recent version of shadow that supports subuid/subgid (sys-apps/shadow-4.2.1 or later)
 * Per-user cgroups on all controllers
 * LXC 1.0 or higher
 * A version of PAM with a loginuid patch (it's a dependency of a recent version of shadow mentioned above, so it installs automatically with recent shadow-4.2.1)

LXC pre-built containers
Because of the limitations mentioned above, you won't be allowed to use to create a block or character device in a user namespace as being allowed to do so would let you access anything on the host. The same thing goes with some filesystems, you won’t, for example, be allowed to do loop mounts or mount an ext partition, even if you can access the block device. Those limitations are a big problem during the initial bootstrap of a container as tools like debootstrap, yum, … usually try to do some of those restricted actions and will fail pretty badly.

Some templates may be tweaked to work and a workaround such as a modified fakeroot could be used to bypass some of those limitations but the current state is that most distribution templates (including Gentoo) simply won't work with those. Instead, you should use the "download" template which will provide you with pre-built images of the distributions that are known to work in such an environment. This template is used to contact a server which contains daily pre-built rootfs and configuration for most common templates instead of assembling the rootfs and local configuration.

Those images are built from LXC project's Jenkins server. The actual build process is pretty straightforward, a basic chroot is assembled, then the current git master is downloaded, built and the standard templates are run with the right release and architecture, the resulting rootfs is compressed, a basic config and metadata (expiry, files to template, …) is saved, the result is pulled by LXC project's main server, signed with a dedicated GPG key and published on the public webserver.

The client-side is a simple template that contacts the server over HTTPS (the domain is also DNSSEC enabled and available over IPv6), grabs signed indexes of all the available images, checks if the requested combination of distribution, release, and architecture is supported and if it is, grabs the rootfs and metadata tarballs, validates their signature and stores them in a local cache. Any container creation after that point is done using that cache until the time the cache entries expire at which point it'll grab a new copy from the server. You can also use "--flush-cache" parameter to flush the local copy (if present).

The template has been carefully written to work on any system that has a POSIX-compliant shell with. gpg is recommended but can be disabled if your host doesn't have it (at your own risk). The current list of images can be requested by passing the  parameter (click "Expand" to see the full output):

While the template was designed to work around the limitations of unprivileged containers, it works just as well with system containers, so even on a system that doesn’t support unprivileged containers you can do:

And you'll get a new container running the latest build of Ubuntu 15.04 Vivid Vervet amd64.

Configuring unprivileged LXC
Install the required packages:

Create files necessary for assigning subuids and subgids:

Create a new user if not yet created, set its password, and log in. In this example we are using the username "lxc":

Make sure your user has a UID and GID map defined in and :

On Gentoo, a default allocation of 65536 UIDs and GIDs is given to every new user on the system, so you should already have one. If not, you'll have to assign a set of subuids and subgids for a user manually:

That last one is required because LXC needs it to access after it switched to the mapped UIDs. If you’re using ACLs, you may instead use “u:100000:x” as a more specific ACL.

Now create  with the following content:

The last two strings are mean that you have one UID map and one GID map defined for the container which will map UIDs and GIDs 0 through 65,536 in the container to UIDs and GIDs 100,000 through 165,536 on the host. Those values should match those found in and, the values above are just illustrative ones.

And with:

This declares that the user “lxc” is allowed up to 2 veth type devices to be created and added to the bridge called br0.1.

Don't forget to add into the PATH environment variable either inside the  for all users to take effect or inside  the for current user. Otherwise lxc-* commands will not work under your user environment (it is not the case for lxc-1.1.0-r5, lxc-1.1.1 and later versions because they use standard  path for command files). Example:

Now let’s create our first unprivileged container with:

Don't forget to change the root password of unprivileged LXC with the following commands by running under your user:

Then you can log in easily with your new password as usual under your user:

If you get error: "Permission denied, can't create directory /sys/fs/cgroup/alpha", please, see section LXC

P.S. To be accomplished. "Creating cgroups" section has to be added with or without cgmanager through OpenRC/systemd accordingly (See "Creating cgroups" paragraph there as an example at the moment).

OpenRC configuration pre-check
For systems, that are booted by OpenRC, check that OpenRC mounts cgroups v2.

Open  and check those line:

By default (with commented line rc_cgroup_mode) is set to "hybrid"

Namespace create script (cgroupv2)
On systems without systemd, the external script should create the user cgroup namespace manually. In our case, we should create all required dirs for the lxc user, give permission for it and move the current active bash shell to the cgroup user namespace.

Instead of creating cgroup namespaces manually, we can use libcgroup which will make managing the cgroup namespaces much easier.

Install the required packages:

Add the following code to the file :

Make sure file  contains line

Now, lxc should manage cgroups by itself and both systemd and non-systemd containers should work.

Validate configuration
After re-login to user lxc, lxc should have user namespace lxc Let's recheck it:

Should be:

Create container example
Now, we can execute any lxc-* command from the lxc user without any permission problems. For example: