Discussion:
[systemd-devel] systemd-nspawn containers
Michał Zegan
2016-11-09 17:24:39 UTC
Permalink
Hello.

Does systemd-nspawn intent to be a full secure container technology? or
it maybe already is? what is missing?
Lennart Poettering
2016-11-11 12:52:32 UTC
Permalink
Post by Michał Zegan
Hello.
Does systemd-nspawn intent to be a full secure container technology? or
it maybe already is? what is missing?
I am not sure what "full secure container technology" realls is
supposed to mean.

nspawn right now is great for two things:

a) full OS containers (think VMs, except based on container
technology. This means that inside the container you have a proper
PID 1 running, and a network configuration daemon and most other
things that would run on a normal, physical system, except one
thing: no device manager, as the kernel does not virtualize
devices)

b) as a building block for whatever you want it to be. It's a pretty
generic tool, you can use as base for anything you like. The "rkt"
container manager makes use of this facet.

There are a number of things nspawn is better at than other container
managers, for example in conjunction with networkd networking happens
pretty much entirely automatically out of the box. It also ships
userns support that is relatively usable without much manual
intervention. OTOH it clearly doesn't do a lot of stuff that other
container managers do and we have no intention to ever do: do IP level
configuration in the manager itself, support for ZFS and other exotic
(possibly out-of-tree) storage technology, and so on.

So it really depends what you mean by "full secure container
technology". We do a lot, we will add more, but there are also things
I don't see on our list at all.

(And "secure" is a difficult thing anyway, currently security of
containers on Linux is pretty limited in general, due to kernel
limitations.)

Lennart
--
Lennart Poettering, Red Hat
Michał Zegan
2016-11-11 15:41:25 UTC
Permalink
Thank you for your answers!

What I meant by secure containers is mostly, containers that are or will
be secure enough to use them for things like virtual private server
hosting. Is nspawn intended to be usable for such things in the future,
or maybe it already is, or whatever?
What kernel limitations do you mean when you say about security?
For now I know that in full containers with userns file capabilities do
not work (I think), you have no virtualized /proc/meminfo and friends
(do cgroup namespaces give a chance to change that?), you cannot mknod
devices (no whitelist possible at this level), no fuse support, no
automatic uid shifting kernel level, no possibility to mount physical
filesystems in userns, and no possibility to have selinux/etc per
container. Do you mean such limitations or something else? I am
interested in this topic but it is quite hard for me to track progress
in that area (kernel side) even though I subscribe in some kernel ml's
and know at least about submitted patches, or some of them. What else is
missing that I didn't say about that would be good to have?

Also what about setting cgroup parameters per container? nspawn does not
allow doing that, and you probably do not intent it to be done by
overriding container's scope unit settings, for example?
Post by Lennart Poettering
Post by Michał Zegan
Hello.
Does systemd-nspawn intent to be a full secure container technology? or
it maybe already is? what is missing?
I am not sure what "full secure container technology" realls is
supposed to mean.
a) full OS containers (think VMs, except based on container
technology. This means that inside the container you have a proper
PID 1 running, and a network configuration daemon and most other
things that would run on a normal, physical system, except one
thing: no device manager, as the kernel does not virtualize
devices)
b) as a building block for whatever you want it to be. It's a pretty
generic tool, you can use as base for anything you like. The "rkt"
container manager makes use of this facet.
There are a number of things nspawn is better at than other container
managers, for example in conjunction with networkd networking happens
pretty much entirely automatically out of the box. It also ships
userns support that is relatively usable without much manual
intervention. OTOH it clearly doesn't do a lot of stuff that other
container managers do and we have no intention to ever do: do IP level
configuration in the manager itself, support for ZFS and other exotic
(possibly out-of-tree) storage technology, and so on.
So it really depends what you mean by "full secure container
technology". We do a lot, we will add more, but there are also things
I don't see on our list at all.
(And "secure" is a difficult thing anyway, currently security of
containers on Linux is pretty limited in general, due to kernel
limitations.)
Lennart
Lennart Poettering
2016-11-11 17:28:59 UTC
Permalink
Post by Michał Zegan
Thank you for your answers!
What I meant by secure containers is mostly, containers that are or will
be secure enough to use them for things like virtual private server
hosting. Is nspawn intended to be usable for such things in the future,
or maybe it already is, or whatever?
I run my own server this way, already as an exercise of dogfooding.

So, yes, running a VPS like this certainly works, but do note that
nspawn doesn't do orchestration or anything. It's good enough for me,
but if you needy fancy orchestration tools then nspawn won't be
sufficient.
Post by Michał Zegan
What kernel limitations do you mean when you say about security?
Well, a lot of subsystems cannot be locked down properly for use in
containers yet. You can lock down a lot, in particular if you use
userns, but there are still a lot of holes in there, and in particular
userns itself has been a major source of CVEs alone in the most recent
kernels.

Right now, "containers" in general are not about security. Some
companies claim they were secure, but they really aren't. And that's
not a bug in nspawn, or docker, or lxc for that matter, it's simply a
limiation of the kernel.

Or to say this differently: we'll do in nspawn everything we can to
lock things down properly, but there are limits based on what the
kernel provides... As the kernel gets improved in this area, we'll
update nspawn to make use of it. We are sitting in the same boat in
this regard as others container managers, and they have the same
limits more or less we have.
Post by Michał Zegan
For now I know that in full containers with userns file capabilities do
not work (I think), you have no virtualized /proc/meminfo and friends
(do cgroup namespaces give a chance to change that?), you cannot mknod
devices (no whitelist possible at this level), no fuse support, no
automatic uid shifting kernel level, no possibility to mount physical
filesystems in userns, and no possibility to have selinux/etc per
container. Do you mean such limitations or something else?
Well, devices are not virtualized at all (with the exception of
network devices), that means no udev, not hotplug events and so
on. Some container managers ignore this, and provide access to
selected device nodes anyway, but we don't do something like that in
nspawn, since it's pretty broken (as /sys wouldn't match what you see
in /dev). In general, I think people should just accept that
containers mean "you don't get physical device access". And if you
want physical device access, then don't use containers...
Post by Michał Zegan
I am interested in this topic but it is quite hard for me to track
progress in that area (kernel side) even though I subscribe in some
kernel ml's and know at least about submitted patches, or some of
them. What else is missing that I didn't say about that would be
good to have?
Well, a lot of stuff is still not properly virtualized. To mind come
audit, autofs, keyring, cgroups, …
Post by Michał Zegan
Also what about setting cgroup parameters per container? nspawn does not
allow doing that, and you probably do not intent it to be done by
overriding container's scope unit settings, for example?
You can actually do that just fine. Simply set it in the nspawn service
file. Or if you run nspawn from the cmdline with the "-p" switch. Or
make your changes dynamically via "systemctl set-property". It's all
supported and works well.

Lennart
--
Lennart Poettering, Red Hat
Michał Zegan
2016-11-11 18:21:19 UTC
Permalink
audit/autofs are not properly virtualized, I know. But I thought
keyrings and cgroups are.
Post by Lennart Poettering
Post by Michał Zegan
Thank you for your answers!
What I meant by secure containers is mostly, containers that are or will
be secure enough to use them for things like virtual private server
hosting. Is nspawn intended to be usable for such things in the future,
or maybe it already is, or whatever?
I run my own server this way, already as an exercise of dogfooding.
So, yes, running a VPS like this certainly works, but do note that
nspawn doesn't do orchestration or anything. It's good enough for me,
but if you needy fancy orchestration tools then nspawn won't be
sufficient.
Post by Michał Zegan
What kernel limitations do you mean when you say about security?
Well, a lot of subsystems cannot be locked down properly for use in
containers yet. You can lock down a lot, in particular if you use
userns, but there are still a lot of holes in there, and in particular
userns itself has been a major source of CVEs alone in the most recent
kernels.
Right now, "containers" in general are not about security. Some
companies claim they were secure, but they really aren't. And that's
not a bug in nspawn, or docker, or lxc for that matter, it's simply a
limiation of the kernel.
Or to say this differently: we'll do in nspawn everything we can to
lock things down properly, but there are limits based on what the
kernel provides... As the kernel gets improved in this area, we'll
update nspawn to make use of it. We are sitting in the same boat in
this regard as others container managers, and they have the same
limits more or less we have.
Post by Michał Zegan
For now I know that in full containers with userns file capabilities do
not work (I think), you have no virtualized /proc/meminfo and friends
(do cgroup namespaces give a chance to change that?), you cannot mknod
devices (no whitelist possible at this level), no fuse support, no
automatic uid shifting kernel level, no possibility to mount physical
filesystems in userns, and no possibility to have selinux/etc per
container. Do you mean such limitations or something else?
Well, devices are not virtualized at all (with the exception of
network devices), that means no udev, not hotplug events and so
on. Some container managers ignore this, and provide access to
selected device nodes anyway, but we don't do something like that in
nspawn, since it's pretty broken (as /sys wouldn't match what you see
in /dev). In general, I think people should just accept that
containers mean "you don't get physical device access". And if you
want physical device access, then don't use containers...
Post by Michał Zegan
I am interested in this topic but it is quite hard for me to track
progress in that area (kernel side) even though I subscribe in some
kernel ml's and know at least about submitted patches, or some of
them. What else is missing that I didn't say about that would be
good to have?
Well, a lot of stuff is still not properly virtualized. To mind come
audit, autofs, keyring, cgroups, 

Post by Michał Zegan
Also what about setting cgroup parameters per container? nspawn does not
allow doing that, and you probably do not intent it to be done by
overriding container's scope unit settings, for example?
You can actually do that just fine. Simply set it in the nspawn service
file. Or if you run nspawn from the cmdline with the "-p" switch. Or
make your changes dynamically via "systemctl set-property". It's all
supported and works well.
Lennart
Lennart Poettering
2016-11-11 18:24:11 UTC
Permalink
Post by Michał Zegan
audit/autofs are not properly virtualized, I know. But I thought
keyrings and cgroups are.
most container managers turn off keyrings entirely (as we do in nspawn
actually).

delegating controllers in cgroupsv1 is unsafe, if you do it the
container can make the system hang easily.

delegating controllers in cgroupvs2 is safe, but cgroupsv2 are
incomplete as of now, the most relevant controller (cpu) is not
available for it yet.

Lennart
--
Lennart Poettering, Red Hat
Michał Zegan
2016-11-11 18:36:02 UTC
Permalink
Why do you turn off keyrings? at least manpages say that userns
virtualizes keyrings or something similar...
Post by Lennart Poettering
Post by Michał Zegan
audit/autofs are not properly virtualized, I know. But I thought
keyrings and cgroups are.
most container managers turn off keyrings entirely (as we do in nspawn
actually).
delegating controllers in cgroupsv1 is unsafe, if you do it the
container can make the system hang easily.
delegating controllers in cgroupvs2 is safe, but cgroupsv2 are
incomplete as of now, the most relevant controller (cpu) is not
available for it yet.
Lennart
Lennart Poettering
2016-11-11 18:41:43 UTC
Permalink
Post by Michał Zegan
Why do you turn off keyrings? at least manpages say that userns
virtualizes keyrings or something similar...
That'd be a new feature then...

Lennart
--
Lennart Poettering, Red Hat
Michał Zegan
2016-11-11 18:49:02 UTC
Permalink
well you can read user_namespaces(7), the beginning of it at least. it
probably says something about keyrings. so either this info is
incorrect, or I for example understand it wrongly, or whatever.
Also, you know, when you say that currently containers have holes and so
are still not really secure I don't actually see any example of that
except this small number of things you just cannot do there at all (for
example use/access audit or use fuse/file capabilities), and those like
cgroups that are work in progress at this very moment. Well, file caps
are also work in progress at the moment I believe, I saw some patches
lately. I don't see such problems probably because I am not a security
expert and I am not working with any kind of servers/containers in
production, this technology is just extremely interesting for me.
Post by Lennart Poettering
Post by Michał Zegan
Why do you turn off keyrings? at least manpages say that userns
virtualizes keyrings or something similar...
That'd be a new feature then...
Lennart
Loading...