Discussion:
Why does nspawn need two child processes?
Add Reply
Luke Shumaker
2017-06-01 00:40:38 UTC
Reply
Permalink
Raw Message
Hi all,

I have a question about `systemd-nspawn` internals.

When creating the child process, it does something like:

parent
|
clone(MOUNT)
| `------------,
| outer_child()
| |
| clone(rest)
| | `------------,
| return inner_child()
| ,-----------' |
wait() |
| exec()
| |||
| exit()
| ,----------------------------'
wait()

where in the first `clone()` it unshares the mount namespace, and in
the second `clone()` it unshares all of the other namespaces (except
for the cgroup namespace).

Initially, I was confused by the awkward dance with having two
children; I couldn't imagine a reason why it is necessary to do this
with a separate `inner_child` and `outer_child`; why can't everything
be done in a single child process?:

parent
|
clone(MOUNT)
| `------------,
| child()
| |
| unshare(rest)
| |
| exec()
| |||
| exit()
| ,------------'
wait()

It has used the current two-child approach since user-namespace
support was first completed in 03cfe0d5, which only has the brief
commit message "nspawn: finish user namespace support"; so there
aren't too many clues to be found in the commit log.

Part of the answer lies in the behavior of `unshare(CLONE_NEWPID)`.
Unlike all of the other namespaces that may be unshared, calling
`unshare(CLONE_NEWPID)` doesn't actually unshare the PID namespace in
*this* process, it says to unshare the PID namespace at the next
`fork()`/`clone()` call. So even if we changed `systemd-nspawn` to
the `clone(MOUNT)/unshare(rest)` model, it would still have to
`clone()` (or plain `fork()` at that point) a second, inner, child
process.

So then, I'm left wondering why unsharing the PID namespace can't be
moved up to the initial `clone()`, allowing everything else to be
`unshare`(2)ed in the initial child process:

parent
|
clone(MOUNT|PID)
| `------------,
| child()
| |
| unshare(rest)
| |
| exec()
| |||
| exit()
| ,------------'
wait()

So my question becomes: what has to be done *after* unsharing the
mount namespace, but *before* unsharing the PID namespace?
--
Happy hacking,
~ Luke Shumaker
Lennart Poettering
2017-06-07 08:04:49 UTC
Reply
Permalink
Raw Message
Post by Luke Shumaker
So my question becomes: what has to be done *after* unsharing the
mount namespace, but *before* unsharing the PID namespace?
The various types of namespaces are not orthogonal even if they are
exposed in supposedly independent bits in the clone() flags parameter:
if a new namespace (in particular a file system namespace CLONE_NEWNS
and a PID namespace CLONE_NEWPID) is created at the same time as a
CLONE_USER user namespace, then those namespaces will be "owned" by
the user namespace. That has various effects, in particular on who may
mount/umount mount points in that namespace and on what is exposed in
/proc. There are some mounts we never want the host to see, but which
also shall not be able to be modified by the container itself, for
example the container's root directory (which is mounted to a
temporary subdirectory of /tmp), hence we do it in a new file system
namespace that is not the host's, but also not the container's but
inherited into it: i.e. between the two CLONE_NEWNS.

I hope that makes sense?

Lennart
--
Lennart Poettering, Red Hat
Loading...