Discussion:
[HEADSUP] cgroup changes
(too old to reply)
Lennart Poettering
2013-06-21 17:36:03 UTC
Permalink
Heya,

On monday I posted this mail:

http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html

Here's an update and a bit on the bigger picture:

Half of what I mentioned there is now in place. There's now a new
"slice" unit type in place in git, and everything is hooked up to
it. logind will now also keep track of running containers/VMs. The
various container/VM managers have to register with logind now. This
serves the purpose of better integration of containers/VMs everywhere
(so that "ps" can show for each process where it belongs to). However,
the main reason for this is that this is eventually going to be the only
way how containers/VMs can get a cgroup of their own.

So, in that context, a bit of the bigger picture:

It took us a while to realize the full extent how awfully unusable
cgroups currently are. The attributes have way more interdependencies
than people might think and it is trivial to create non-sensical
configurations...

Of course, understanding how awful the status quo is a good first
step. But we really needed to figure out what we can do about this to
clean this up in the long run, and how we can get to something useful
quickly. So, after much discussion between Tejun (the kernel cgroup
maintainer) and various other folks here's the new scheme that we want
to go for:

1) In the long run there's only going to be a single kernel cgroup
hierarchy, the per-controller hierarchies will go away. The single
hierarchy will allow controllers to be individually enabled for each
cgroup. The net effect is that the hierarchies the controllers see are
not orthogonal anymore, they are always subtrees of the full single
hierarchy.

2) This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd's APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The "Pax Cgroup" document is a thing of the
past, it is dead.

3) systemd will hide the fact that cgroups are internally used almost
entirely. In fact, we will take away the unit configuration options
ControlGroup=, ControlGroupModify=, ControlGroupPersistent=,
ControlGroupAttribute= in their entirety. The high-level options
CPUShares=, MemoryLimit=, .. and so on will continue to exist and we'll
add additional ones like them. The system.conf setting
DefaultControllers=cpu will go away too. Basically, you'll get more
high-level settings, but all the low level bits will go away without
replacement. We will take away the ability for the admin to set
arbitrary low-level attributes, to arrange things in completely
arbitrary cgroup trees or to enable arbitrary controllers for a service.

4) systemd git introduced a new unit type called "slice" (see
above). This is for partitioning up resources of the system into
slices. Slices are hierarchial, and other units (such as services, but
also containers/VMs and logged in users) can then be assigned to these
slices. Slices internally map to cgroups, but they are a very high-level
construct. Slices will expose the same CPUShares=, MemoryLimit=
properties as the other units do. This means resource management will
become a first-class, built-in functionality of systemd. You can create
slices for your customers, and in them subslices for their departments,
and then run services, users, vms in them. In the long run these will by
dynamically moveable even (while they are running), but that'll take
more kernel work. By default there will three slices: "system.slice"
(where all system services are located by default), "user.slice" (where
all logged in users are located by default), "machine.slice" (where all
running VMs/containers are located by default). However, the admin will
have full freedom to create arbitary slices and then move the other
units into them.

5) systemd's logind daemon already kept track of logged in
users/sessions. It is now extended to also keep track of virtual
machines/containers. In fact, this is how libvirt/nspawn and friends
will now get their own cgroups. They register as a machine, which means
passing a bit of meta info to systemd, and getting a cgroup assigned in
response. This registration ensures that "ps" and friends can show to
which VM/container a process belongs, but easily allows other tools to
query container/VM info too, so that we'll be able to provide an
integration level of containers/VMs like solaris zones can do it in the
long run.

So, this all together sounds like an awful lot of change. #1 and #2 are
long term changes. However #3, #4, #5 are something we can do now and
should do now, as prepartion for the single-writer, unified cgroup
tree. We really, really shouldn't ship the cgroup mess for longer, so
that people make use of the current systemd APIs that expose way too
many internal guts, stuff that we *know* right now is broken and will
cease to exist. We don't want to expose low-level details we already
know *now* we cannot support for long.

Even though #3, #4, #5 sound like major work they are not. In fact #4
and #5 are fully implemented on the systemd side already now upstream. I
am working on #3. I am confident that I'll have this finished in a few
days too, since this is really actually just about deleting code more
than writing code.

With #3, #4, #5 we have something in place that should do the basic
things and first and foremost will hide all the lower-level details of
cgroups. This has the big benefit of allowing us to rearrange these
details later without having to break the user or
programming interfaces, and that's what I really care about here.

Now, what does this mean for other projects using cgroups? So basically,
since we won't implement #1 + #2 immediately the cgroup tree stays
relatively open for other cgroup users. They can continue to fiddle with
it for now, but it must be clear that this is temporary, and that they
don't attempt too fancy things. Direct access to the cgroup tree is on
is way out and that must be clear to everybody.

More specifically: libcgroup is out of the game with
this. libvirt/openshift/lxc/.. can continue to do what they do for now,
however they should be updated sooner rather than later to do things the
systemd way, i.e. rely on systemd VM/container registration and user
cgroup management.

And to make one last thing clear: this time, it's not Kay and me who are
taking away the cgroup tree from everybody else, it's actually all
Tejun's fault as the kernel cgroup maintainer... ;-) He wants a unified,
single-writer hierarchy, and it took us a while to agree to that, but
we're now fully on the same page with him.

If you are using non-trivial cgroup setups with systemd right now, then
things will change for you. We will provide you with similar
functionality as before, but things will be different and less
low-level. As long as you only used the high-level options such as
CPUShares, MemoryLimit and so on you should be on the safe side.

I hope this makes some sense,

Lennart
--
Lennart Poettering - Red Hat, Inc.
Kok, Auke-jan H
2013-06-21 19:59:01 UTC
Permalink
On Fri, Jun 21, 2013 at 10:36 AM, Lennart Poettering
Post by Lennart Poettering
Heya,
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html
Thanks for doing this - I am really looking forward to seeing this all
take shape, and I hope to be able to leverage this in the future :^)

All the points below are great, and problems that I've encountered in
the past have all hinted towards this being the right way forward.

#2 below has my interest - when you have some ideas about how the API
will look I'd like to review it and match against our use cases...

Auke
Lennart Poettering
2013-06-21 20:10:08 UTC
Permalink
Post by Kok, Auke-jan H
Post by Lennart Poettering
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html
Thanks for doing this - I am really looking forward to seeing this all
take shape, and I hope to be able to leverage this in the future :^)
All the points below are great, and problems that I've encountered in
the past have all hinted towards this being the right way forward.
#2 below has my interest - when you have some ideas about how the API
will look I'd like to review it and match against our use cases...
Point #2 is precisely about not having APIs for this... ;-)

So, in the future, when you have some service, and that service wants to
alter some cgroup resource limits for itself (let's say: set its own cpu
shares value to 1500), this is what should happen: the service should
use a call like sd_pid_get_unit() to get its own unit name, and then use
dbus to invoke SetCPUShares(1500) for that service. systemd will then do
the rest. (*)

Lennart

(*) to make this even simpler we have been thinking of defining a new
"virtual" bus object path /org/freedesktop/systemd1/self/ or so which
will always points to the callers own unit. This would be similar to
/proc/self/ which also points to its own PID dir for each
process... With that in place you could then set any resource setting
you want with a single bus method call.
--
Lennart Poettering - Red Hat, Inc.
Kok, Auke-jan H
2013-06-21 21:10:43 UTC
Permalink
On Fri, Jun 21, 2013 at 1:10 PM, Lennart Poettering
Post by Lennart Poettering
Post by Kok, Auke-jan H
Post by Lennart Poettering
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html
Thanks for doing this - I am really looking forward to seeing this all
take shape, and I hope to be able to leverage this in the future :^)
All the points below are great, and problems that I've encountered in
the past have all hinted towards this being the right way forward.
#2 below has my interest - when you have some ideas about how the API
will look I'd like to review it and match against our use cases...
Point #2 is precisely about not having APIs for this... ;-)
So, in the future, when you have some service, and that service wants to
alter some cgroup resource limits for itself (let's say: set its own cpu
shares value to 1500), this is what should happen: the service should
use a call like sd_pid_get_unit() to get its own unit name, and then use
dbus to invoke SetCPUShares(1500) for that service. systemd will then do
the rest. (*)
Lennart
(*) to make this even simpler we have been thinking of defining a new
"virtual" bus object path /org/freedesktop/systemd1/self/ or so which
will always points to the callers own unit. This would be similar to
/proc/self/ which also points to its own PID dir for each
process... With that in place you could then set any resource setting
you want with a single bus method call.
This is fine for applications that manage themselves, but I'm seeing
more interest in use cases where we want external influence on cgroup
hierarchies, for instance:

- foreground/background priorities - a window manager marks background
applications and puts them in the freezer, changes oom_score_adj so
that old apps can get automatically cleaned up in case memory
availability is low.
- detecting runaway apps and taking cpu slices away from them.
- thermally constraining classes of applications

Those would be tasks that an external process would do by manipulating
properties of cgroups, not something each task would do on it's own.

Do you suggest these manipulations should be implemented without high
level systemd API's and the "controller" just manipulates the cgroups
directly?

Auke
Lennart Poettering
2013-06-21 21:17:59 UTC
Permalink
Post by Kok, Auke-jan H
Post by Lennart Poettering
So, in the future, when you have some service, and that service wants to
alter some cgroup resource limits for itself (let's say: set its own cpu
shares value to 1500), this is what should happen: the service should
use a call like sd_pid_get_unit() to get its own unit name, and then use
dbus to invoke SetCPUShares(1500) for that service. systemd will then do
the rest. (*)
Lennart
(*) to make this even simpler we have been thinking of defining a new
"virtual" bus object path /org/freedesktop/systemd1/self/ or so which
will always points to the callers own unit. This would be similar to
/proc/self/ which also points to its own PID dir for each
process... With that in place you could then set any resource setting
you want with a single bus method call.
This is fine for applications that manage themselves, but I'm seeing
more interest in use cases where we want external influence on cgroup
- foreground/background priorities - a window manager marks background
applications and puts them in the freezer, changes oom_score_adj so
that old apps can get automatically cleaned up in case memory
availability is low.
- detecting runaway apps and taking cpu slices away from them.
- thermally constraining classes of applications
Those would be tasks that an external process would do by manipulating
properties of cgroups, not something each task would do on it's own.
Do you suggest these manipulations should be implemented without high
level systemd API's and the "controller" just manipulates the cgroups
directly?
All changes to cgroup attributes must go through systemd. If the WM
wants to freeze or adjust OOM he needs to issue systemd bus calls for
that.

The run-away stuff I can't follow? the kernel will distribute CPU
evenly among running apps if all want it, so not seeing why there's more
monitoring needed.

The thermal stuff is probably best done in-kernel i guess... Too
dangerous/subject-to-latency for userspace, no?

Lennart
--
Lennart Poettering - Red Hat, Inc.
Kok, Auke-jan H
2013-06-21 21:47:34 UTC
Permalink
On Fri, Jun 21, 2013 at 2:17 PM, Lennart Poettering
Post by Lennart Poettering
Post by Kok, Auke-jan H
Post by Lennart Poettering
So, in the future, when you have some service, and that service wants to
alter some cgroup resource limits for itself (let's say: set its own cpu
shares value to 1500), this is what should happen: the service should
use a call like sd_pid_get_unit() to get its own unit name, and then use
dbus to invoke SetCPUShares(1500) for that service. systemd will then do
the rest. (*)
Lennart
(*) to make this even simpler we have been thinking of defining a new
"virtual" bus object path /org/freedesktop/systemd1/self/ or so which
will always points to the callers own unit. This would be similar to
/proc/self/ which also points to its own PID dir for each
process... With that in place you could then set any resource setting
you want with a single bus method call.
This is fine for applications that manage themselves, but I'm seeing
more interest in use cases where we want external influence on cgroup
- foreground/background priorities - a window manager marks background
applications and puts them in the freezer, changes oom_score_adj so
that old apps can get automatically cleaned up in case memory
availability is low.
- detecting runaway apps and taking cpu slices away from them.
- thermally constraining classes of applications
Those would be tasks that an external process would do by manipulating
properties of cgroups, not something each task would do on it's own.
Do you suggest these manipulations should be implemented without high
level systemd API's and the "controller" just manipulates the cgroups
directly?
All changes to cgroup attributes must go through systemd. If the WM
wants to freeze or adjust OOM he needs to issue systemd bus calls for
that.
The run-away stuff I can't follow? the kernel will distribute CPU
evenly among running apps if all want it, so not seeing why there's more
monitoring needed.
The thermal stuff is probably best done in-kernel i guess... Too
dangerous/subject-to-latency for userspace, no?
Only userspace can distinguish between e.g. a foreground and
background application (WM) and decide that CPU consumption of certain
apps in the background is excessive, and throttle it down further,
which is somewhat similar to using freezer to just SIGSTOP them
entirely basically.

Thermal throttling from userspace allows you to distinguish between
"never make my SETI turn the fan on" and "throttle the entire system
when I reach high fan speeds". You can't do that in the kernel. [1]
Arguably this could be done in-task and not by an external controller,
but you're still trusting the task to do the right thing, which may
not be something you want to do.


Auke


[1] Note that the new Intel P-state driver by Dirk Brandewie changes
how things work with nice(). The old behaviour was abused by folks
running bitcoin miners at nice values which caused ondemand to do
something irrational: nice-only tasks would keep the CPU in lowest
frequencies, which is terrible from a power perspective - now every
daemon running at nice value takes much longer to complete its task,
burning more power then when it had raced-to-idle.
Kay Sievers
2013-06-21 22:07:07 UTC
Permalink
On Fri, Jun 21, 2013 at 11:47 PM, Kok, Auke-jan H
Post by Kok, Auke-jan H
Only userspace can distinguish between e.g. a foreground and
background application (WM) and decide that CPU consumption of certain
apps in the background is excessive, and throttle it down further,
This would probably be some bus call to the systemd --user instance
managing the services in the session, if that's what you mean?

Kay
Kok, Auke-jan H
2013-06-21 23:26:05 UTC
Permalink
Post by Kay Sievers
On Fri, Jun 21, 2013 at 11:47 PM, Kok, Auke-jan H
Post by Kok, Auke-jan H
Only userspace can distinguish between e.g. a foreground and
background application (WM) and decide that CPU consumption of certain
apps in the background is excessive, and throttle it down further,
This would probably be some bus call to the systemd --user instance
managing the services in the session, if that's what you mean?
for instance, yes.

Auke
Lennart Poettering
2013-06-24 13:33:51 UTC
Permalink
Post by Kok, Auke-jan H
Post by Lennart Poettering
Post by Kok, Auke-jan H
Do you suggest these manipulations should be implemented without high
level systemd API's and the "controller" just manipulates the cgroups
directly?
All changes to cgroup attributes must go through systemd. If the WM
wants to freeze or adjust OOM he needs to issue systemd bus calls for
that.
The run-away stuff I can't follow? the kernel will distribute CPU
evenly among running apps if all want it, so not seeing why there's more
monitoring needed.
The thermal stuff is probably best done in-kernel i guess... Too
dangerous/subject-to-latency for userspace, no?
Only userspace can distinguish between e.g. a foreground and
background application (WM) and decide that CPU consumption of certain
apps in the background is excessive, and throttle it down further,
which is somewhat similar to using freezer to just SIGSTOP them
entirely basically.
Yes, userspace can do that via systemd, there will be high-level
operations on the bus for this. For example: SetCPUShares() to alter the
cpu.shares value, and so on. This method call will do much more though
thatn just write this value. One of the complexities of the cgroup stuff
here is that adding a unit to a controller like "cpu" means you have to
do the same for all its immediately siblings (i.e. other units in the
same slice) plus all its parent slices (and recurisvely their
siblings). Why? Because otherwise you might end up granting the service
that is in the "cpu" controller the same amount of CPU *in total* as the
other services in the same slice get *individually* for each
process. And that would be grossly unfair...
Post by Kok, Auke-jan H
Thermal throttling from userspace allows you to distinguish between
"never make my SETI turn the fan on" and "throttle the entire system
when I reach high fan speeds". You can't do that in the kernel. [1]
Arguably this could be done in-task and not by an external controller,
but you're still trusting the task to do the right thing, which may
not be something you want to do.
So, if userspace needs to communicate something to kernel space about
what kind of cooling stragegy it would prefer and that is done via
cgroups, them I am sure we can add similar high-level per-unit control
calls to systemd, too.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Lennart Poettering
2013-06-24 13:27:15 UTC
Permalink
1. I put all the entire world into a separate, highly constrained
cgroup. My real-time code runs outside that cgroup. This seems to
exactly what slices are for, but I need kernel threads to go in to
the constrained cgroup. Will systemd support this?
I am not sure whether the ability to move kernel threads into cgroups
will stay around at all, from the kernel side. Tejun, can you comment on this?
2. I manage services and tasks outside systemd (for one thing, I
currently use Ubuntu, but even if I were on Fedora, I have a bunch
of fine-grained things that figure out how they're supposed to
allocate resources, and porting them to systemd just to keep working
in the new world order would be a PITA [1]).
(cgroups have the odd feature that they are per-task, not per thread
group, and the systemd proposal seems likely to break anything that
actually wants task granularity. I may actually want to use this,
even though it's a bit evil -- my real-time thread groups have
non-real-time threads.)
Here too, Tejun is pretty keen on removing the ability of splitting up
threads into cgroups from the kernel, and will only allow this
per-process. Tejun, please comment!
I think that what I want are something like sub-unit cgroups -- I
want to be able to ask systemd to further subdivide the group for my
unit, login session, or whatever. Would this be reasonable?
(Another way of thinking of this is that a unit would have a whole
cgroup hierarchy instead of just one cgroup.)
The idea is not even to allow this. Basically, if you want to partitions
your daemon into different cgroups you need to do that through systemd's
abstractions: slices and services. To make this more palatable we'll
introduce "throw-away" units though, so that you can dynamically run
something as a workload and don't need to be concerned about naming
this, or cleaning it up.
I think that the single-hierarchy model will require that I
subdivide my user session so that the default sub-unit cgroup is
constrained similarly to the default slice. I'll lose
functionality, but I don't think this is a showstopper.
A different approach would be to allow units to (with systemd's
cooperation) escape into their own, dynamically created unit. This
seems kind of awful.
This is basically what I meant with "throw-away" units.
3. My code runs unprivileged, but it still wants to configure
itself. If needed, I can write a little privileged daemon to handle
the systemd calls.
So, at least in the beginning I am pretty sure that manipulating the
resource parameters we'll restrict to root-only, since this is much more
security than one might assume and we simply don't oversee this all.
I think I can get away without anything fancy if a unit (login
session?) grant the right to manipulate sub-unit cgroups to a
non-root user.
As mentioned, this will not be possible.
4. As mentioned, I'm on Ubuntu some of the time. I'd like to keep
the same code working on systemd and non-systemd systems.
How hard would it be to run systemd as just a cgroup controller?
That is, have systemd create its slices, run exactly one unit that
represents the whole system, and let other things use the cgroup
API.
I have no idea, I don't develop Ubuntu. They will have to come up with
some cgroup maintenance daemon of their own. As I know them they'll
either do a "port" of the systemd counter part (but that's going to be
tough!), or they'll stick something half-baked into Upstart...

Sorry, if this all sounds a bit disappointing. But yeah, this all is not
a trivial change...

Lennart
--
Lennart Poettering - Red Hat, Inc.
Daniel P. Berrange
2013-06-24 13:39:53 UTC
Permalink
Post by Lennart Poettering
1. I put all the entire world into a separate, highly constrained
cgroup. My real-time code runs outside that cgroup. This seems to
exactly what slices are for, but I need kernel threads to go in to
the constrained cgroup. Will systemd support this?
I am not sure whether the ability to move kernel threads into cgroups
will stay around at all, from the kernel side. Tejun, can you comment on this?
KVM uses the vhost_net device for accelerating guest network I/O
paths. This device creates a new kernel thread on each open(),
and that kernel thread is attached to the cgroup associated
with the process that open()d the device.

If systemd allows for a process to be moved between cgroups, then
it must also be capable of moving any associated kernel threads to
the new cgroup at the same time. This co-placement of vhost-net
threads with the KVM process, is very critical for I/O performance
of KVM networking.

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Tejun Heo
2013-06-24 18:19:33 UTC
Permalink
Hello,
Post by Daniel P. Berrange
Post by Lennart Poettering
1. I put all the entire world into a separate, highly constrained
cgroup. My real-time code runs outside that cgroup. This seems to
exactly what slices are for, but I need kernel threads to go in to
the constrained cgroup. Will systemd support this?
I am not sure whether the ability to move kernel threads into cgroups
will stay around at all, from the kernel side. Tejun, can you comment on this?
KVM uses the vhost_net device for accelerating guest network I/O
paths. This device creates a new kernel thread on each open(),
and that kernel thread is attached to the cgroup associated
with the process that open()d the device.
If systemd allows for a process to be moved between cgroups, then
it must also be capable of moving any associated kernel threads to
the new cgroup at the same time. This co-placement of vhost-net
threads with the KVM process, is very critical for I/O performance
of KVM networking.
Yeah, the way virt drivers use cgroups right now is pretty hacky. I
was thinking about adding per-process workqueue which follows the
cgroup association of the process after the unified hierarchy and then
convert virt to use that.

At any rate, those kthreads can be moved via cgroup.procs, so unified
hierarchy wouldn't break it from kernel side. Not sure how the
interface would look from systemd side tho.

Thanks.
--
tejun
Andy Lutomirski
2013-06-24 15:21:31 UTC
Permalink
On Mon, Jun 24, 2013 at 6:27 AM, Lennart Poettering
Post by Lennart Poettering
2. I manage services and tasks outside systemd (for one thing, I
currently use Ubuntu, but even if I were on Fedora, I have a bunch
of fine-grained things that figure out how they're supposed to
allocate resources, and porting them to systemd just to keep working
in the new world order would be a PITA [1]).
[...]
Post by Lennart Poettering
I think that what I want are something like sub-unit cgroups -- I
want to be able to ask systemd to further subdivide the group for my
unit, login session, or whatever. Would this be reasonable?
(Another way of thinking of this is that a unit would have a whole
cgroup hierarchy instead of just one cgroup.)
The idea is not even to allow this. Basically, if you want to partitions
your daemon into different cgroups you need to do that through systemd's
abstractions: slices and services. To make this more palatable we'll
introduce "throw-away" units though, so that you can dynamically run
something as a workload and don't need to be concerned about naming
this, or cleaning it up.
Hmm. My particular software can maybe live with this with unpleasant
modifications, but this will break anything that, say, accepts a
connection from a client, forks into a (possibly new) cgroup based on
the identity of that client, and then does something.

How can this support containers or the use of cgroups in a
non-systemwide systemd instance? Containers may no longer be allowed
to escape from the cgroup they start in, but there should (IMO) still
be a way for things to subdivide their cgroup-controlled resources.

If I want to have a hierarchy more than two levels deep, I suspect I'm
SOL under this model. If I'm understanding correctly, there will be
slices, then units, and that's it.
Post by Lennart Poettering
4. As mentioned, I'm on Ubuntu some of the time. I'd like to keep
the same code working on systemd and non-systemd systems.
How hard would it be to run systemd as just a cgroup controller?
That is, have systemd create its slices, run exactly one unit that
represents the whole system, and let other things use the cgroup
API.
I have no idea, I don't develop Ubuntu. They will have to come up with
some cgroup maintenance daemon of their own. As I know them they'll
either do a "port" of the systemd counter part (but that's going to be
tough!), or they'll stick something half-baked into Upstart...
Sorry, if this all sounds a bit disappointing. But yeah, this all is not
a trivial change...
I'm worried that the impedance mismatch between systemd and any other
possible API is going to be enormous. On systemd, I'll have to:

- Create a throwaway unit
- Figure out how to wire up stdout and stderr correctly (I use them
for communication between processes)
- Translate the current directory, the environment, etc. into systemd
configuration
- Translate my desired resource controls into systemd's "let's
pretend that there aren't really cgroups underlying it" configuration
- Start the throwaway unit
- Figure out how to get notified when it finishes

Without systemd, I'll have to:

- fork()
- Ask whatever is managing cgroups to switch me to a different cgroup
- exec()

This is going to suck, I think.

--Andy
Tejun Heo
2013-06-24 18:38:32 UTC
Permalink
Hello,
Post by Lennart Poettering
1. I put all the entire world into a separate, highly constrained
cgroup. My real-time code runs outside that cgroup. This seems to
exactly what slices are for, but I need kernel threads to go in to
the constrained cgroup. Will systemd support this?
I am not sure whether the ability to move kernel threads into cgroups
will stay around at all, from the kernel side. Tejun, can you comment on this?
Any kernel threads with PF_NO_SETAFFINITY set already can't be removed
from the root cgroup. In general, I don't think moving kernel threads
into !root cgroups is a good idea. They're in most cases shared
resources and userland doesn't really have much idea what they're
actually doing, which is the fundmental issue.

Which kthreads are running on the kernel side and what they're doing
is strict implementation detail from the kernel side. There's no
effort from kernel side in keeping them stable and userland is likely
to get things completely wrong - e.g. many kernel threads named after
workqueues in any recent kernels don't actually do anything until the
system is under heavy memory pressure. Userland can't tell and has no
control over what's being executed where at all and that's the way it
should be.

That said, there are cases where certain async executions are
concretely bound to userland processes - say, (planned) aio updates,
virt drivers and so on. Right now, virt implements something pretty
hacky but I think they'll have to be tied closer to the usual process
mechanism - ie. they should be saying that these kthreads are serving
this process and should be treated as such in terms of resource
control rather than the current "move this kthread to this set of
cgroups, don't ask why" thing. Another not-well-thought-out aspect of
the current cgroup. :(

I have an idea where it should be headed in the long term but am not
sure about short-term solution. Given that the only sort wide-spread
use case is virt kthreads, maybe it just needs to be special cased for
now. Not sure.
Post by Lennart Poettering
2. I manage services and tasks outside systemd (for one thing, I
currently use Ubuntu, but even if I were on Fedora, I have a bunch
of fine-grained things that figure out how they're supposed to
allocate resources, and porting them to systemd just to keep working
in the new world order would be a PITA [1]).
(cgroups have the odd feature that they are per-task, not per thread
group, and the systemd proposal seems likely to break anything that
actually wants task granularity. I may actually want to use this,
even though it's a bit evil -- my real-time thread groups have
non-real-time threads.)
Here too, Tejun is pretty keen on removing the ability of splitting up
threads into cgroups from the kernel, and will only allow this
per-process. Tejun, please comment!
Yes, again, the biggest issue is how much of low-level cgroup details
become known to individual programs. Splitting threads into different
cgroup would in most cases mean that the binary itself would become
aware of cgroup and it's akin to burying sysctl knob tunings into
individual binaries. cgroup is not an interface for each individual
program to fiddle with. If certain thread-granular control is
absolutely necessary and justifiable, it's something to be added to
the existing thread API, not something to be bolted on using cgroups.

So, I'm quite strongly against allowing allowing splitting threads of
the same process into different cgroups.

Thanks.
--
tejun
Andy Lutomirski
2013-06-24 18:49:05 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Lennart Poettering
1. I put all the entire world into a separate, highly constrained
cgroup. My real-time code runs outside that cgroup. This seems to
exactly what slices are for, but I need kernel threads to go in to
the constrained cgroup. Will systemd support this?
I am not sure whether the ability to move kernel threads into cgroups
will stay around at all, from the kernel side. Tejun, can you comment on this?
Any kernel threads with PF_NO_SETAFFINITY set already can't be removed
from the root cgroup. In general, I don't think moving kernel threads
into !root cgroups is a good idea. They're in most cases shared
resources and userland doesn't really have much idea what they're
actually doing, which is the fundmental issue.
Which kthreads are running on the kernel side and what they're doing
is strict implementation detail from the kernel side. There's no
effort from kernel side in keeping them stable and userland is likely
to get things completely wrong - e.g. many kernel threads named after
workqueues in any recent kernels don't actually do anything until the
system is under heavy memory pressure. Userland can't tell and has no
control over what's being executed where at all and that's the way it
should be.
That said, there are cases where certain async executions are
concretely bound to userland processes - say, (planned) aio updates,
virt drivers and so on. Right now, virt implements something pretty
hacky but I think they'll have to be tied closer to the usual process
mechanism - ie. they should be saying that these kthreads are serving
this process and should be treated as such in terms of resource
control rather than the current "move this kthread to this set of
cgroups, don't ask why" thing. Another not-well-thought-out aspect of
the current cgroup. :(
I have an idea where it should be headed in the long term but am not
sure about short-term solution. Given that the only sort wide-spread
use case is virt kthreads, maybe it just needs to be special cased for
now. Not sure.
I'll be okay (I think) if I can reliably set affinities of these
threads. I'm currently doing it with cgroups.

That being said, I don't like the direction that kernel thread magic
affinity is going. It may be great for cache performance and reducing
random bounding, but I have a scheduling-jitter-sensitive workload and
I don't care about overall system throughput. I need the kernel to
stay the f!&k off my important cpus, and arranging for this to happen
is becoming increasingly complicated.
Post by Tejun Heo
Post by Lennart Poettering
2. I manage services and tasks outside systemd (for one thing, I
currently use Ubuntu, but even if I were on Fedora, I have a bunch
of fine-grained things that figure out how they're supposed to
allocate resources, and porting them to systemd just to keep working
in the new world order would be a PITA [1]).
(cgroups have the odd feature that they are per-task, not per thread
group, and the systemd proposal seems likely to break anything that
actually wants task granularity. I may actually want to use this,
even though it's a bit evil -- my real-time thread groups have
non-real-time threads.)
Here too, Tejun is pretty keen on removing the ability of splitting up
threads into cgroups from the kernel, and will only allow this
per-process. Tejun, please comment!
Yes, again, the biggest issue is how much of low-level cgroup details
become known to individual programs. Splitting threads into different
cgroup would in most cases mean that the binary itself would become
aware of cgroup and it's akin to burying sysctl knob tunings into
individual binaries. cgroup is not an interface for each individual
program to fiddle with. If certain thread-granular control is
absolutely necessary and justifiable, it's something to be added to
the existing thread API, not something to be bolted on using cgroups.
cgroups are most certainly something that a binary can be aware of.
It's not like a sysctl knob at all -- it's per process. I have lots
of binaries that have worked quite well for a couple years that move
themselves into different cgroups. I have no problem with a unified
hierarchy, but I need control of my little piece of the hierarchy.

I don't care if the interface to do so changes, but the basic
functionality is important.
Post by Tejun Heo
So, I'm quite strongly against allowing allowing splitting threads of
the same process into different cgroups.
I don't need that feature. (Which is not to say that no one else does.)

--Andy
Tejun Heo
2013-06-24 19:10:00 UTC
Permalink
Hello, Andy.
Post by Andy Lutomirski
Post by Tejun Heo
I have an idea where it should be headed in the long term but am not
sure about short-term solution. Given that the only sort wide-spread
use case is virt kthreads, maybe it just needs to be special cased for
now. Not sure.
I'll be okay (I think) if I can reliably set affinities of these
threads. I'm currently doing it with cgroups.
That being said, I don't like the direction that kernel thread magic
affinity is going. It may be great for cache performance and reducing
random bounding, but I have a scheduling-jitter-sensitive workload and
I don't care about overall system throughput. I need the kernel to
stay the f!&k off my important cpus, and arranging for this to happen
is becoming increasingly complicated.
Why is it becoming increasingly complicated? The biggest change
probably was the shared workqueue pool implementation but that was
years ago and workqueue has grown pool attributes recently adding more
properly designed flexibility and, for example, adding default
affinity for !per-cpu workqueues should be pretty easy now. But
anyways, if it's an issue, it should be examined and properly solved
rather than hacking up hacky solution with cgroup.
Post by Andy Lutomirski
cgroups are most certainly something that a binary can be aware of.
It's not like a sysctl knob at all -- it's per process. I have lots
No, it definitely is not. Sure it is more granular than sysctl but
that's it. It exposes control knobs which are directly tied into
kernel implementation details. It is not a properly designed
programming API by any stretch of imagination. It is an extreme
failure on the kernel side that that part hasn't been made crystal
clear from the beginning. I don't know how intentional it was but the
whole thing is completely botched.

cgroup *never* was held to the standard necessary for any widely
available API and many of the controls it exposes are exactly at the
level of sysctls. As the interface was filesystem, it could evade
scrutiny and with the hierarchical organization also gave the
impression that it's something which can be used directly by
individual applications. It found a loophole in the way we implement
and police kernel APIs and then exploited it like there's no tomorrow.

We are firmly bound to maintain what already has been exposed from the
kernel side and I'm not gonna break any of them but the free-for-all
cgroup is broken and deprecated. It's gonna wither and fade away and
any attempt to reverse that will be met with extreme prejudice.
Post by Andy Lutomirski
of binaries that have worked quite well for a couple years that move
themselves into different cgroups. I have no problem with a unified
hierarchy, but I need control of my little piece of the hierarchy.
I don't care if the interface to do so changes, but the basic
functionality is important.
Whether you care or not is completely irrelevant. Individual binaries
widely incorporating cgroup details automatically binds the kernel.
It becomes excruciatingly painful to back out after certain point. I
don't think we're there yet given the overall immaturity and brokeness
of cgroups and it's imperative that we back the hell out as fast as
possible before this insanity spreads any wider.

Thanks.
--
tejun
Andy Lutomirski
2013-06-24 19:24:38 UTC
Permalink
Post by Tejun Heo
Hello, Andy.
Post by Andy Lutomirski
Post by Tejun Heo
I have an idea where it should be headed in the long term but am not
sure about short-term solution. Given that the only sort wide-spread
use case is virt kthreads, maybe it just needs to be special cased for
now. Not sure.
I'll be okay (I think) if I can reliably set affinities of these
threads. I'm currently doing it with cgroups.
That being said, I don't like the direction that kernel thread magic
affinity is going. It may be great for cache performance and reducing
random bounding, but I have a scheduling-jitter-sensitive workload and
I don't care about overall system throughput. I need the kernel to
stay the f!&k off my important cpus, and arranging for this to happen
is becoming increasingly complicated.
Why is it becoming increasingly complicated? The biggest change
probably was the shared workqueue pool implementation but that was
years ago and workqueue has grown pool attributes recently adding more
properly designed flexibility and, for example, adding default
affinity for !per-cpu workqueues should be pretty easy now. But
anyways, if it's an issue, it should be examined and properly solved
rather than hacking up hacky solution with cgroup.
Because more things are becoming per cpu without the option of moving
of per-cpu things on behalf of one cpu to another cpu. RCU is a nice
exception.
Post by Tejun Heo
Post by Andy Lutomirski
cgroups are most certainly something that a binary can be aware of.
It's not like a sysctl knob at all -- it's per process. I have lots
No, it definitely is not. Sure it is more granular than sysctl but
that's it. It exposes control knobs which are directly tied into
kernel implementation details. It is not a properly designed
programming API by any stretch of imagination. It is an extreme
failure on the kernel side that that part hasn't been made crystal
clear from the beginning. I don't know how intentional it was but the
whole thing is completely botched.
cgroup *never* was held to the standard necessary for any widely
available API and many of the controls it exposes are exactly at the
level of sysctls. As the interface was filesystem, it could evade
scrutiny and with the hierarchical organization also gave the
impression that it's something which can be used directly by
individual applications. It found a loophole in the way we implement
and police kernel APIs and then exploited it like there's no tomorrow.
We are firmly bound to maintain what already has been exposed from the
kernel side and I'm not gonna break any of them but the free-for-all
cgroup is broken and deprecated. It's gonna wither and fade away and
any attempt to reverse that will be met with extreme prejudice.
The functionality I care about is that a program can reliably and
hierarchically subdivide system resources -- think rlimits but
actually useful. I, and probably many other things, want this
functionality. Yes, the current cgroup interface is awful, but it
gets one thing right: it's a hierarchy.

Back when my software ran on Windows, I used the awful "job" interface
to allocate resources among different parts of my software. When I
switched to Linux, I lost some of that functionality and replaced
other bits with cgroups. It's hackish, but it works.

Now we're apparently moving toward having a unified hierarchy
(great!), a more sane API (great!), and a nasty userspace situation
where systemd-using systems control the hierarchy through a highly
limiting systemd-specific interface and non-systemd systems do
something else which will presumably look nothing like what systemd
does.

I would argue that designing a kernel interface that requires exactly
one userspace component to manage it and ties that one userspace
component to something that can't easily be deployed everywhere (the
init system) is as big a cheat as the old approach of sneaking bad
APIs in through a filesystem was.

IOW, please, when designing this, please specify an API that programs
are permitted to use, and let that API be reviewed.

--Andy
Tejun Heo
2013-06-24 19:37:43 UTC
Permalink
Hello,
Post by Andy Lutomirski
Because more things are becoming per cpu without the option of moving
of per-cpu things on behalf of one cpu to another cpu. RCU is a nice
exception.
Hmm... but in most cases it's per-cpu on the same cpu that initiated
the task. If a given CPU is just crunching numbers and IRQ affinity
is properly configured, the CPU shouldn't be bothered too much by
per-cpu work items. If there are, please let us know. We can hunt
them down.
Post by Andy Lutomirski
The functionality I care about is that a program can reliably and
hierarchically subdivide system resources -- think rlimits but
actually useful. I, and probably many other things, want this
functionality. Yes, the current cgroup interface is awful, but it
gets one thing right: it's a hierarchy.
And the hierarchy support was completely broken for many resource
controllers up until only several releases ago.
Post by Andy Lutomirski
I would argue that designing a kernel interface that requires exactly
one userspace component to manage it and ties that one userspace
component to something that can't easily be deployed everywhere (the
init system) is as big a cheat as the old approach of sneaking bad
APIs in through a filesystem was.
In terms of API, it is firmly at the level of sysctl. That's it.

While I agree that having a proper kernel API for hierarchical
resource management could be nice. That currently is out of scope.
We're already knee-deep in shit with the limited capabilities we're
trying to implement. Also, I really don't think cgroup is the right
interface for such thing even if we get to that. It should be part of
the usual process/thread model, not this completely separate thing on
the side.
Post by Andy Lutomirski
IOW, please, when designing this, please specify an API that programs
are permitted to use, and let that API be reviewed.
cgroup is not that API and it's never gonna be in all likelihood. As
for systemd vs. non-systemd compatibility, I'm afraid I don't have a
good answer. This is still all in a pretty earlly phase and the
proper abstractions and APIs are being figured out. Hopefully, we'll
converge on a mostly compatible high-level abstraction which can be
presented regardless of the actual base system implementation.

Thanks.
--
tejun
Andy Lutomirski
2013-06-24 23:01:07 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Andy Lutomirski
Because more things are becoming per cpu without the option of moving
of per-cpu things on behalf of one cpu to another cpu. RCU is a nice
exception.
Hmm... but in most cases it's per-cpu on the same cpu that initiated
the task. If a given CPU is just crunching numbers and IRQ affinity
is properly configured, the CPU shouldn't be bothered too much by
per-cpu work items. If there are, please let us know. We can hunt
them down.
I'm not just crunching numbers -- I do (nonblocking) I/O as well.
Post by Tejun Heo
Post by Andy Lutomirski
The functionality I care about is that a program can reliably and
hierarchically subdivide system resources -- think rlimits but
actually useful. I, and probably many other things, want this
functionality. Yes, the current cgroup interface is awful, but it
gets one thing right: it's a hierarchy.
And the hierarchy support was completely broken for many resource
controllers up until only several releases ago.
Post by Andy Lutomirski
I would argue that designing a kernel interface that requires exactly
one userspace component to manage it and ties that one userspace
component to something that can't easily be deployed everywhere (the
init system) is as big a cheat as the old approach of sneaking bad
APIs in through a filesystem was.
In terms of API, it is firmly at the level of sysctl. That's it.
While I agree that having a proper kernel API for hierarchical
resource management could be nice. That currently is out of scope.
We're already knee-deep in shit with the limited capabilities we're
trying to implement. Also, I really don't think cgroup is the right
interface for such thing even if we get to that. It should be part of
the usual process/thread model, not this completely separate thing on
the side.
Post by Andy Lutomirski
IOW, please, when designing this, please specify an API that programs
are permitted to use, and let that API be reviewed.
cgroup is not that API and it's never gonna be in all likelihood. As
for systemd vs. non-systemd compatibility, I'm afraid I don't have a
good answer. This is still all in a pretty earlly phase and the
proper abstractions and APIs are being figured out. Hopefully, we'll
converge on a mostly compatible high-level abstraction which can be
presented regardless of the actual base system implementation.
So what is cgroup for? That is, what's the goal for what the new API
should be able to do?

AFAICT the main reason that systemd uses cgroup is to efficiently
track which service various processes came from and to send signals,
and it seems like that use case could be handled without cgroups at
all by creative use of subreapers and a syscall to broadcast a signal
to everything that has a given subreaper as an ancestor. In that
case, systemd could be asked to stay away from cgroups even in the
single-hierarchy case.

--Andy
Tejun Heo
2013-06-24 23:19:52 UTC
Permalink
Hello,
Post by Andy Lutomirski
So what is cgroup for? That is, what's the goal for what the new API
should be able to do?
It is a for controlling and distributing resources. That part doesn't
change. It's just not built to be used directly by individual
applications. It's an admin tool just like sysctl - be that admin be
a human or userland base system.

There's a huge chasm between something which can be generally used by
normal applications and something which is restricted to admins and
base systems in terms of interface generality and stability, security,
how the abstractions fit together with the existing APIs and so on.
cgroup firmly belongs to the former. It still serves the same purpose
but isn't, in a way, developed enough to be used directly by
individual applications and I'm not even sure we want or need to
develop it to such a level.

Thanks.
--
tejun
Andy Lutomirski
2013-06-24 23:27:17 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Andy Lutomirski
So what is cgroup for? That is, what's the goal for what the new API
should be able to do?
It is a for controlling and distributing resources. That part doesn't
change. It's just not built to be used directly by individual
applications. It's an admin tool just like sysctl - be that admin be
a human or userland base system.
There's a huge chasm between something which can be generally used by
normal applications and something which is restricted to admins and
base systems in terms of interface generality and stability, security,
how the abstractions fit together with the existing APIs and so on.
cgroup firmly belongs to the former. It still serves the same purpose
but isn't, in a way, developed enough to be used directly by
individual applications and I'm not even sure we want or need to
develop it to such a level.
My application is running on a single-purpose system I administer.

I guess what I'm trying to say here is that many systems will rather
fundamentally use systemd. Admins of those systems should still have
access to a reasonably large subset of cgroup functionality. If the
single-hierarchy model is going to prevent going around systemd and if
systemd isn't going to expose all of the useful cgroup functionality,
then perhaps there should be a way to separate systemd's hierarchy
from the cgroup hierarchy.

Looking at http://0pointer.de/blog/projects/cgroups-vs-cgroups.html,
it looks like systemd doesn't actually need the cgroup resource
control functionality. Maybe there's a way to disentangle this stuff.
The /proc/<pid>/children feature that CRIU added seems like a decent
start.

--Andy
Tejun Heo
2013-06-24 23:37:16 UTC
Permalink
Hello, Andy.
Post by Andy Lutomirski
I guess what I'm trying to say here is that many systems will rather
fundamentally use systemd. Admins of those systems should still have
access to a reasonably large subset of cgroup functionality. If the
single-hierarchy model is going to prevent going around systemd and if
systemd isn't going to expose all of the useful cgroup functionality,
then perhaps there should be a way to separate systemd's hierarchy
from the cgroup hierarchy.
I don't think systemd will prevent you from buildling your own
hierarchy on the side. It sure won't be properly supported and things
might break in corener cases / over time but if you're willing to take
such risks anyway... In the long term tho, what should happen
probably is examining use cases like yours and then incorporating
sensible mechanisms to support that into the base system
infrastructure. It might not be completely identical but I'm sure
over time we'll be able to find what are the fundamental pieces and
proper abstractions. Right now, we're exposing way too much without
even clearly understanding what are being enabled. It is
unsustainable.

Thanks.
--
tejun
Andy Lutomirski
2013-06-24 23:38:15 UTC
Permalink
Post by Tejun Heo
Hello, Andy.
Post by Andy Lutomirski
I guess what I'm trying to say here is that many systems will rather
fundamentally use systemd. Admins of those systems should still have
access to a reasonably large subset of cgroup functionality. If the
single-hierarchy model is going to prevent going around systemd and if
systemd isn't going to expose all of the useful cgroup functionality,
then perhaps there should be a way to separate systemd's hierarchy
from the cgroup hierarchy.
I don't think systemd will prevent you from buildling your own
hierarchy on the side. It sure won't be properly supported and things
might break in corener cases / over time but if you're willing to take
such risks anyway... In the long term tho, what should happen
probably is examining use cases like yours and then incorporating
sensible mechanisms to support that into the base system
infrastructure. It might not be completely identical but I'm sure
over time we'll be able to find what are the fundamental pieces and
proper abstractions. Right now, we're exposing way too much without
even clearly understanding what are being enabled. It is
unsustainable.
Now I'm confused. I thought that support for multiple hierarchies was
going away. Is it here to stay after all?

--Andy
Tejun Heo
2013-06-24 23:40:11 UTC
Permalink
Hello,
Post by Andy Lutomirski
Now I'm confused. I thought that support for multiple hierarchies was
going away. Is it here to stay after all?
It is going to be deprecated but also stay around for quite a while.
That said, I didn' t mean to use multiple hierarchies. I was saying
that if you build a sub-hierarchy in the unified hierarchy, you're
likely to get away with it in most cases.

Thanks.

--
tejun
Andy Lutomirski
2013-06-24 23:42:51 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Andy Lutomirski
Now I'm confused. I thought that support for multiple hierarchies was
going away. Is it here to stay after all?
It is going to be deprecated but also stay around for quite a while.
That said, I didn' t mean to use multiple hierarchies. I was saying
that if you build a sub-hierarchy in the unified hierarchy, you're
likely to get away with it in most cases.
Isn't that exactly what I was originally asking for? Quoting from
earlier in the thread:

On Mon, Jun 24, 2013 at 6:27 AM, Lennart Poettering
Post by Tejun Heo
Post by Andy Lutomirski
2. I manage services and tasks outside systemd (for one thing, I
currently use Ubuntu, but even if I were on Fedora, I have a bunch
of fine-grained things that figure out how they're supposed to
allocate resources, and porting them to systemd just to keep working
in the new world order would be a PITA [1]).
[...]
Post by Tejun Heo
Post by Andy Lutomirski
I think that what I want are something like sub-unit cgroups -- I
want to be able to ask systemd to further subdivide the group for my
unit, login session, or whatever. Would this be reasonable?
(Another way of thinking of this is that a unit would have a whole
cgroup hierarchy instead of just one cgroup.)
The idea is not even to allow this. Basically, if you want to partitions
your daemon into different cgroups you need to do that through systemd's
abstractions: slices and services. To make this more palatable we'll
introduce "throw-away" units though, so that you can dynamically run
something as a workload and don't need to be concerned about naming
this, or cleaning it up.
If I can subdivide my service in the hierarchy, then I'm happy. If
this gets lost *and* systemd insists on controlling the one and only
cgroup hierarchy, then I think I have serious problems with the new
regime.

--Andy
Lennart Poettering
2013-06-24 23:57:20 UTC
Permalink
Post by Andy Lutomirski
AFAICT the main reason that systemd uses cgroup is to efficiently
track which service various processes came from and to send signals,
and it seems like that use case could be handled without cgroups at
all by creative use of subreapers and a syscall to broadcast a signal
to everything that has a given subreaper as an ancestor. In that
case, systemd could be asked to stay away from cgroups even in the
single-hierarchy case.
systemd uses cgroups to manage services. Managing services means many
things. Among them: keeping track of processes, listing processes of a
service, killing processes of a service, doing per-service logging
(which means reliably, immediately, and race-freely tracing back
messages to the service which logged them), about 55 other things, and
also resource management.

I don't see how I can do anything of this without something like
cgroups, i.e. hierarchial, resource management involved systemd which
allows me to securely put labels on processes.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Andy Lutomirski
2013-06-25 00:09:23 UTC
Permalink
On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering
Post by Lennart Poettering
Post by Andy Lutomirski
AFAICT the main reason that systemd uses cgroup is to efficiently
track which service various processes came from and to send signals,
and it seems like that use case could be handled without cgroups at
all by creative use of subreapers and a syscall to broadcast a signal
to everything that has a given subreaper as an ancestor. In that
case, systemd could be asked to stay away from cgroups even in the
single-hierarchy case.
systemd uses cgroups to manage services. Managing services means many
things. Among them: keeping track of processes, listing processes of a
service, killing processes of a service, doing per-service logging
(which means reliably, immediately, and race-freely tracing back
messages to the service which logged them), about 55 other things, and
also resource management.
I don't see how I can do anything of this without something like
cgroups, i.e. hierarchial, resource management involved systemd which
allows me to securely put labels on processes.
Boneheaded straw-man proposal: two new syscalls and a few spare processes.

int sys_task_reaper(int tid): Returns the reaper for the task tid
(which is 1 if there's no subreaper). (This could just as easily be a
file in /proc.)

int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts
sig to all tasks under subreaper (excluding subreaper). Guarantees
that, even if those tasks are forking, they all get the signal.

Then, when starting a service, systemd forks, sets the child to be a
subreaper, then forks that child again to exec the service.

Does this do everything that's needed? sys_task_reaper is trivial to
implement (that functionality is already there in the reparenting
code), and sys_killall_under_subreaper is probably not so bad.


This has one main downside I can think of: it wastes a decent number
of processes (one subreaper per service).

--Andy
Lennart Poettering
2013-06-25 09:43:31 UTC
Permalink
Post by Andy Lutomirski
On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering
Post by Lennart Poettering
Post by Andy Lutomirski
AFAICT the main reason that systemd uses cgroup is to efficiently
track which service various processes came from and to send signals,
and it seems like that use case could be handled without cgroups at
all by creative use of subreapers and a syscall to broadcast a signal
to everything that has a given subreaper as an ancestor. In that
case, systemd could be asked to stay away from cgroups even in the
single-hierarchy case.
systemd uses cgroups to manage services. Managing services means many
things. Among them: keeping track of processes, listing processes of a
service, killing processes of a service, doing per-service logging
(which means reliably, immediately, and race-freely tracing back
messages to the service which logged them), about 55 other things, and
also resource management.
I don't see how I can do anything of this without something like
cgroups, i.e. hierarchial, resource management involved systemd which
allows me to securely put labels on processes.
Boneheaded straw-man proposal: two new syscalls and a few spare processes.
int sys_task_reaper(int tid): Returns the reaper for the task tid
(which is 1 if there's no subreaper). (This could just as easily be a
file in /proc.)
int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts
sig to all tasks under subreaper (excluding subreaper). Guarantees
that, even if those tasks are forking, they all get the signal.
Then, when starting a service, systemd forks, sets the child to be a
subreaper, then forks that child again to exec the service.
Does this do everything that's needed?
No. It doesn't do anything that's needed. How do I list all PIDs in a
service with this? How do I determine the service of a PID? How do i do
resource manage with this?
Post by Andy Lutomirski
sys_task_reaper is trivial to
implement (that functionality is already there in the reparenting
code), and sys_killall_under_subreaper is probably not so bad.
This has one main downside I can think of: it wastes a decent number
of processes (one subreaper per service).
Yeah, also the downside that it doesn't do what we need.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Andy Lutomirski
2013-06-25 12:28:26 UTC
Permalink
Post by Lennart Poettering
Post by Andy Lutomirski
On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering
Post by Lennart Poettering
Post by Andy Lutomirski
AFAICT the main reason that systemd uses cgroup is to efficiently
track which service various processes came from and to send signals,
and it seems like that use case could be handled without cgroups at
all by creative use of subreapers and a syscall to broadcast a signal
to everything that has a given subreaper as an ancestor. In that
case, systemd could be asked to stay away from cgroups even in the
single-hierarchy case.
systemd uses cgroups to manage services. Managing services means many
things. Among them: keeping track of processes, listing processes of a
service, killing processes of a service, doing per-service logging
(which means reliably, immediately, and race-freely tracing back
messages to the service which logged them), about 55 other things, and
also resource management.
I don't see how I can do anything of this without something like
cgroups, i.e. hierarchial, resource management involved systemd which
allows me to securely put labels on processes.
Boneheaded straw-man proposal: two new syscalls and a few spare processes.
int sys_task_reaper(int tid): Returns the reaper for the task tid
(which is 1 if there's no subreaper). (This could just as easily be a
file in /proc.)
int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts
sig to all tasks under subreaper (excluding subreaper). Guarantees
that, even if those tasks are forking, they all get the signal.
Then, when starting a service, systemd forks, sets the child to be a
subreaper, then forks that child again to exec the service.
Does this do everything that's needed?
No. It doesn't do anything that's needed. How do I list all PIDs in a
service with this?
Walk /proc/<subreaper>/children recursively. A kernel patch to make that
field show up unconditionally instead of hiding under EXPERT would help.
Post by Lennart Poettering
How do I determine the service of a PID?
Call sys_task_reaper, then look up what service that subreaper comes from.
Post by Lennart Poettering
How do i do
resource manage with this?
With cgroups, unless the admin has configured systemd not to use cgroups,
in which case you don't. (The whole point would be to keep
DefaultControllers= without using the one and only cgroup hierarchy.)

--Andy
Post by Lennart Poettering
Post by Andy Lutomirski
sys_task_reaper is trivial to
implement (that functionality is already there in the reparenting
code), and sys_killall_under_subreaper is probably not so bad.
This has one main downside I can think of: it wastes a decent number
of processes (one subreaper per service).
Yeah, also the downside that it doesn't do what we need.
Lennart
--
Lennart Poettering - Red Hat, Inc.
Andy Lutomirski
2013-06-22 22:19:28 UTC
Permalink
Post by Lennart Poettering
2) This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd's APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The "Pax Cgroup" document is a thing of the
past, it is dead.
If you are using non-trivial cgroup setups with systemd right now, then
things will change for you. We will provide you with similar
functionality as before, but things will be different and less
low-level. As long as you only used the high-level options such as
CPUShares, MemoryLimit and so on you should be on the safe side.
Hmm. This may be tricky for my use case. Here are a few issues. For
all I know, they may already be supported (or planned), but I don't want
to get caught.

1. I put all the entire world into a separate, highly constrained
cgroup. My real-time code runs outside that cgroup. This seems to
exactly what slices are for, but I need kernel threads to go in to the
constrained cgroup. Will systemd support this?

2. I manage services and tasks outside systemd (for one thing, I
currently use Ubuntu, but even if I were on Fedora, I have a bunch of
fine-grained things that figure out how they're supposed to allocate
resources, and porting them to systemd just to keep working in the new
world order would be a PITA [1]).

(cgroups have the odd feature that they are per-task, not per thread
group, and the systemd proposal seems likely to break anything that
actually wants task granularity. I may actually want to use this, even
though it's a bit evil -- my real-time thread groups have non-real-time
threads.)

I think that what I want are something like sub-unit cgroups -- I want
to be able to ask systemd to further subdivide the group for my unit,
login session, or whatever. Would this be reasonable? (Another way of
thinking of this is that a unit would have a whole cgroup hierarchy
instead of just one cgroup.)

I think that the single-hierarchy model will require that I subdivide my
user session so that the default sub-unit cgroup is constrained
similarly to the default slice. I'll lose functionality, but I don't
think this is a showstopper.

A different approach would be to allow units to (with systemd's
cooperation) escape into their own, dynamically created unit. This
seems kind of awful.

3. My code runs unprivileged, but it still wants to configure itself.
If needed, I can write a little privileged daemon to handle the systemd
calls.

I think I can get away without anything fancy if a unit (login session?)
grant the right to manipulate sub-unit cgroups to a non-root user.

4. As mentioned, I'm on Ubuntu some of the time. I'd like to keep the
same code working on systemd and non-systemd systems.

How hard would it be to run systemd as just a cgroup controller? That
is, have systemd create its slices, run exactly one unit that represents
the whole system, and let other things use the cgroup API.


[1] Some day, I might convert my code to use a session systemd instance.
I'm not holding my breath, but it could be nice.

--Andy
David Strauss
2013-06-24 15:38:20 UTC
Permalink
On Fri, Jun 21, 2013 at 10:36 AM, Lennart Poettering
Post by Lennart Poettering
As long as you only used the high-level options such as
CPUShares, MemoryLimit and so on you should be on the safe side.
This is already representative of how we're doing thing in large-scale
production and how we recommend other users use cgroups on
systemd-based distributions.

So, +1.

--
David Strauss
| ***@davidstrauss.net
| +1 512 577 5827 [mobile]
Brian Bockelman
2013-06-25 02:21:14 UTC
Permalink
Post by Lennart Poettering
2) This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd's APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The "Pax Cgroup" document is a thing of the
past, it is dead.
Hi [1],

I currently contribute cgroup support to a batch system
(http://research.cs.wisc.edu/htcondor/) and
am trying to figure out how this will affect me.

Right now, I take the resources provided by the cgroup setup by the
sysadmin and sub-divide them amongst the running jobs.
Cgroups are used for resource management, resource accounting, and job
management (using the freezer controller to deliver signals to all
processes at once). Jobs last between seconds to hours; it is
acceptable for a setup time of, say, several hundred milliseconds - as
long as we can easily create and destroy many jobs.

A few questions came to mind which may provide interesting input
to your design process:
1) I use cgroups heavily for resource accounting. Do you envision
me querying via dbus for each accounting attribute? Or do you
envision me querying for the cgroup name, then accessing the
controller statistics directly?
2) I currently fork and setup the resource environment (namespaces,
environment, working directory, etc). Can an appropriately privileged
process create a sub-slice, place itself in it, and then drop privs
/ exec?
3) More generally, will I be able to interact with slices directly, or
will I need to create throw-away units and launch them via systemd
(versus a "normal" fork/exec)?
- The latter causes quite a bit of anxiety for me - we currently
support many POSIX platforms plus Windows (hey - at least
we dropped HPUX) and I'd like to avoid a completely independent
code path for spawning jobs on Linux.
4) Will many short-lived jobs cause any heartache? Would anything
untoward happen to my system if I spawned / destroyed jobs (and
corresponding units or slices) at, say, 1Hz?
5) Will I be able to delegate management of a subslice to a non-privileged user?

I'm excited to see new ideas (again, having system tools be aware of
the batch system activity is intriguing [2]), but am a bit worried about
losing functionality and the cost of porting things to the new era!

Thanks!

Brian

[1] apologies if the reply comes through mangled; posting through
the gmane web interface.
[2] Hopefully something that works better than
"ps xawf -eo pid,user,cgroup,args" which currently segfaults for me :(
Lennart Poettering
2013-06-25 09:56:26 UTC
Permalink
Post by Brian Bockelman
A few questions came to mind which may provide interesting input
1) I use cgroups heavily for resource accounting. Do you envision
me querying via dbus for each accounting attribute? Or do you
envision me querying for the cgroup name, then accessing the
controller statistics directly?
Good question. Tejun wants systemd to cover that too. I am not entirely
sure. I don't like the extra roundtrip for measuring the accounting
bits. But maybe we can add a library that avoids the roundtrip, and
simply provides you with high-level accounting values for cgroups. That
way, for *changing* things you'd need to go via the bus, for *reading*
things we'd give you a library that goes directly to the cgroupfs and
avoids the roundtrip.
Post by Brian Bockelman
2) I currently fork and setup the resource environment (namespaces,
environment, working directory, etc). Can an appropriately privileged
process create a sub-slice, place itself in it, and then drop privs
/ exec?
We'll probably have a way how you can take an existing set of processes
and turn them dynamically into a new unit in systemd. These units would
be mostly like service units, except that systemd wouldn't start the
processes, but they would be "foreign" created. We are not sure about
the name for this yet (i.e. whether to cover it under the ".service"
suffix, but we'll probably call it "Scopes" instead, with the suffix
".scope").

The scope units could then be manipulated at runtime for (cgroup based)
resource management the way normal services are too.

So basically, a service unit could be assigned to a slice unit, and
could then create "scope" units which detach subprocesses from the
original service unit, and get their own cgroup in the same slice or any
other.
Post by Brian Bockelman
3) More generally, will I be able to interact with slices directly, or
will I need to create throw-away units and launch them via systemd
(versus a "normal" fork/exec)?
Basically, with this "scope" concept in place, you'd create a throw-away
scope. In fact, "scope" units can only be created as throw-away units.
Post by Brian Bockelman
- The latter causes quite a bit of anxiety for me - we currently
support many POSIX platforms plus Windows (hey - at least
we dropped HPUX) and I'd like to avoid a completely independent
code path for spawning jobs on Linux.
4) Will many short-lived jobs cause any heartache? Would anything
untoward happen to my system if I spawned / destroyed jobs (and
corresponding units or slices) at, say, 1Hz?
Well, the idea is that these "scopes" are very lightweight. And we need
to make them scale (but I don't see why they shouldn't).
Post by Brian Bockelman
5) Will I be able to delegate management of a subslice to a
non-privileged user?
Unlikely, at least for the beginning.
Post by Brian Bockelman
I'm excited to see new ideas (again, having system tools be aware of
the batch system activity is intriguing [2]), but am a bit worried about
losing functionality and the cost of porting things to the new era!
There's certainly going to be some lost flexibility. But of course we'll
try to cover all interesting usecases.
Post by Brian Bockelman
[2] Hopefully something that works better than
"ps xawf -eo pid,user,cgroup,args" which currently segfaults for me :(
Hmm, could you file a bug, please?

Lennart
--
Lennart Poettering - Red Hat, Inc.
Brian Bockelman
2013-06-25 13:31:27 UTC
Permalink
Post by Lennart Poettering
Post by Brian Bockelman
A few questions came to mind which may provide interesting input
1) I use cgroups heavily for resource accounting. Do you envision
me querying via dbus for each accounting attribute? Or do you
envision me querying for the cgroup name, then accessing the
controller statistics directly?
Good question. Tejun wants systemd to cover that too. I am not entirely
sure. I don't like the extra roundtrip for measuring the accounting
bits. But maybe we can add a library that avoids the roundtrip, and
simply provides you with high-level accounting values for cgroups. That
way, for *changing* things you'd need to go via the bus, for *reading*
things we'd give you a library that goes directly to the cgroupfs and
avoids the roundtrip.
I like this idea. Hopefully single-writer, multiple-reader is more sustainable path forward.

What about the notification APIs? We currently use the memory.oom_control to get a notification when a job hits limits (this allows us to know the job died due to memory issues, as the user code itself typically just SIGSEGV's). Is subscribing to notifications considered reading or writing in this case?
Post by Lennart Poettering
Post by Brian Bockelman
2) I currently fork and setup the resource environment (namespaces,
environment, working directory, etc). Can an appropriately privileged
process create a sub-slice, place itself in it, and then drop privs
/ exec?
We'll probably have a way how you can take an existing set of processes
and turn them dynamically into a new unit in systemd. These units would
be mostly like service units, except that systemd wouldn't start the
processes, but they would be "foreign" created. We are not sure about
the name for this yet (i.e. whether to cover it under the ".service"
suffix, but we'll probably call it "Scopes" instead, with the suffix
".scope").
The scope units could then be manipulated at runtime for (cgroup based)
resource management the way normal services are too.
So basically, a service unit could be assigned to a slice unit, and
could then create "scope" units which detach subprocesses from the
original service unit, and get their own cgroup in the same slice or any
other.
This sounds manageable.
Post by Lennart Poettering
Post by Brian Bockelman
5) Will I be able to delegate management of a subslice to a
non-privileged user?
Unlikely, at least for the beginning.
(Very) long-term, this is attractive for us. We prefer the batch system to run as unprivileged when possible (and to sacrifice the minimal amount of functionality to do so!).
Post by Lennart Poettering
Post by Brian Bockelman
I'm excited to see new ideas (again, having system tools be aware of
the batch system activity is intriguing [2]), but am a bit worried about
losing functionality and the cost of porting things to the new era!
There's certainly going to be some lost flexibility. But of course we'll
try to cover all interesting usecases.
I'll try to lurk and provide guidance about how us nutty batch system folks may try to use it.
Post by Lennart Poettering
Post by Brian Bockelman
[2] Hopefully something that works better than
"ps xawf -eo pid,user,cgroup,args" which currently segfaults for me :(
Hmm, could you file a bug, please?
Couldn't figure out a patch -- too little time. However, I at least tracked down the offending code. Bug report is here:

https://bugzilla.redhat.com/show_bug.cgi?id=977854

Thanks,

Brian
Lennart Poettering
2013-07-17 00:43:00 UTC
Permalink
Post by Brian Bockelman
Post by Lennart Poettering
Post by Brian Bockelman
A few questions came to mind which may provide interesting input
1) I use cgroups heavily for resource accounting. Do you envision
me querying via dbus for each accounting attribute? Or do you
envision me querying for the cgroup name, then accessing the
controller statistics directly?
Good question. Tejun wants systemd to cover that too. I am not entirely
sure. I don't like the extra roundtrip for measuring the accounting
bits. But maybe we can add a library that avoids the roundtrip, and
simply provides you with high-level accounting values for cgroups. That
way, for *changing* things you'd need to go via the bus, for *reading*
things we'd give you a library that goes directly to the cgroupfs and
avoids the roundtrip.
I like this idea. Hopefully single-writer, multiple-reader is more sustainable path forward.
What about the notification APIs? We currently use the
memory.oom_control to get a notification when a job hits limits (this
allows us to know the job died due to memory issues, as the user code
itself typically just SIGSEGV's). Is subscribing to notifications
considered reading or writing in this case?
That sounds like another case for the library, i.e. is more considered
reading. That said I think the current notification infrastructure of
cgroup attributes is really really awful, so I am not to keen to support
that right-away.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Loading...