Discussion:
dbus interface for disabling default cgroups
(too old to reply)
Daniel Poelzleithner
2011-02-10 11:44:12 UTC
Permalink
Hi,

I just released ulatencyd[1] 0.4.5 which now works nicely under systemd
under two conditions:

* DefaultControllers should be unset
* pam_systemd should also get an empty controllers=.

systemd seems currently not to have a dbus interface which allows to
change the DefaultControllers, so I suggest something like this

org.freedesktop.systemd1.Manager.ChangeController(string subsystem,
boolean enable, boolean clear)

- subsystem is the subsystem name.
- enable allows to enable/disable the subsystem. So a daemon that exits
can give back the control to systemd.
- clear causes the tree to be flushed to empty state.


about the pam_systemd, I'm not so sure, i guess it will need adjustment
as it is not, and should not talk to dbus or so. Maybe it could read a
file like /var/run/systemd/pam_disabled_controllers which simply could
be changed by the started daemon, and should be cleanup on exit.

What do you think ?


PS:
service file is not in the current release, will be the next and is here:
https://github.com/poelzi/ulatencyd/blob/master/conf/ulatencyd.service

kind regards
Daniel
Lennart Poettering
2011-02-13 21:36:53 UTC
Permalink
Post by Daniel Poelzleithner
Hi,
heya,
Post by Daniel Poelzleithner
I just released ulatencyd[1] 0.4.5 which now works nicely under systemd
* DefaultControllers should be unset
* pam_systemd should also get an empty controllers=.
systemd seems currently not to have a dbus interface which allows to
change the DefaultControllers, so I suggest something like this
I think disabling this should be left to the admin and not be done
programmatically.

Note that pam_systemd in git now explicitly resets the "cpu" cgroup of
all sessions to the root group since otherwise RT scheduling will not be
available for any user daemons. This is a general limitation of the
"cpu" scheduler right now, and makes it impractical to muck around with
it for generic sessions.
Post by Daniel Poelzleithner
org.freedesktop.systemd1.Manager.ChangeController(string subsystem,
boolean enable, boolean clear)
Nah, if at all then the controller property should just become
writable. But to be frank I am not sure we actually want to make this
changeable at runtime.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Daniel Poelzleithner
2011-02-13 23:38:23 UTC
Permalink
Post by Lennart Poettering
Note that pam_systemd in git now explicitly resets the "cpu" cgroup of
all sessions to the root group since otherwise RT scheduling will not be
available for any user daemons. This is a general limitation of the
"cpu" scheduler right now, and makes it impractical to muck around with
it for generic sessions.
I stumbled on this one too, but got it fixed by giving the most cgroups
a rt_sched_slice of 1. this way they can be set to rt and then, move the
process to a group with more rt slice if there is a rule to move it there.
Post by Lennart Poettering
Post by Daniel Poelzleithner
org.freedesktop.systemd1.Manager.ChangeController(string subsystem,
boolean enable, boolean clear)
Nah, if at all then the controller property should just become
writable. But to be frank I am not sure we actually want to make this
changeable at runtime.
If the admin decides he wants to run a program that optimizes the system
better, he already made this decision.
If it is required for the user to edit a config file so one service can
run, he doesn't have the freedom to run it when it fits best.
Another, maybe more important reason is, that the most optimisations
will be on a running, not starting system. If the config must be
changed, ulatencyd for example should be started very early, while it
may do a poorly job at at. It optimises extremely good when the user is
using the computer, not booting it. If the property can be changed at
runtime, ulatency could simply be started when most stuff is up, that
would be perfect.


kind regards
daniel
Lennart Poettering
2011-02-14 09:42:59 UTC
Permalink
Post by Daniel Poelzleithner
Post by Lennart Poettering
Note that pam_systemd in git now explicitly resets the "cpu" cgroup of
all sessions to the root group since otherwise RT scheduling will not be
available for any user daemons. This is a general limitation of the
"cpu" scheduler right now, and makes it impractical to muck around with
it for generic sessions.
I stumbled on this one too, but got it fixed by giving the most cgroups
a rt_sched_slice of 1. this way they can be set to rt and then, move the
process to a group with more rt slice if there is a rule to move it there.
I am not sure what "rt_sched_slice" is supposed to be, but in case you
are referring to cpu.rt_runtime_us:

That's not a fix. That's a hack.

The time slices need to add up and you break assumptions in the programs
with that. The us you hand out need to add up. And if you limit the RT
runtime to 1us per 1s then this is barely enough time to do almost
anything. Usually, if apps ask for RT they actually want to do
non-trivial work in that timeslice too... The whole point of RT sched is
too be able to monopolize the CPU until the app gives it up
voluntarily. Mucking with cpu.rt_runtime_us intreferes with that and
allows to install upper limits to the RT time the apps get, in a "safety
net" sense, to ensure that apps don't monopolize the CPU for too
long. But if you unconditionally just set this value to the smallest
value possible then your upper limit basically makes RT entirely
useless.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Post by Daniel Poelzleithner
org.freedesktop.systemd1.Manager.ChangeController(string subsystem,
boolean enable, boolean clear)
Nah, if at all then the controller property should just become
writable. But to be frank I am not sure we actually want to make this
changeable at runtime.
If the admin decides he wants to run a program that optimizes the system
better, he already made this decision.
Well, if he doesn't want systemd to much with the "cpu" controller, then
he can easily disable that in a config file.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Daniel Poelzleithner
2011-02-14 10:41:21 UTC
Permalink
Post by Lennart Poettering
Post by Daniel Poelzleithner
I stumbled on this one too, but got it fixed by giving the most cgroups
a rt_sched_slice of 1. this way they can be set to rt and then, move the
process to a group with more rt slice if there is a rule to move it there.
I am not sure what "rt_sched_slice" is supposed to be, but in case you
That's not a fix. That's a hack.
Hell yeah. I got hack rules to fix the bad behaviour of gnome and kde
because they do not set the setpgid on started task as they are supposed
to...
Optimising a system seems like one big workaround ;-)
Post by Lennart Poettering
The time slices need to add up and you break assumptions in the programs
with that. The us you hand out need to add up. And if you limit the RT
runtime to 1us per 1s then this is barely enough time to do almost
anything. Usually, if apps ask for RT they actually want to do
non-trivial work in that timeslice too... The whole point of RT sched is
too be able to monopolize the CPU until the app gives it up
voluntarily. Mucking with cpu.rt_runtime_us intreferes with that and
allows to install upper limits to the RT time the apps get, in a "safety
net" sense, to ensure that apps don't monopolize the CPU for too
long. But if you unconditionally just set this value to the smallest
value possible then your upper limit basically makes RT entirely
useless.
Of course 1ms is useless. Thats the point. What processes do a normal
user run that need realtime rt. I know exactly 2, thats pulseaudio and
jackd. For both exist rules that move the process to a group with higher
cpu.rt_runtime_us, enough that they should work properly but can not
bring down your system. It's enough to start, but not enough to use.
Maybe I will add a rule that matches rt tasks and move them to a special
group, I will think about that.

If you are doing highend audio for example, the default desktop
configuration will not suite you, so you need simply to switch to an
configuration that fits better. Thats one root dbus call away, or with
the new gui one click and password. It is simply not possible to
configure a system that will fit all workloads.
Post by Lennart Poettering
Well, if he doesn't want systemd to much with the "cpu" controller, then
he can easily disable that in a config file.
Then the package install scripts will have to change that...


Daniel
Lennart Poettering
2011-02-14 10:54:27 UTC
Permalink
Post by Daniel Poelzleithner
Post by Lennart Poettering
Post by Daniel Poelzleithner
I stumbled on this one too, but got it fixed by giving the most cgroups
a rt_sched_slice of 1. this way they can be set to rt and then, move the
process to a group with more rt slice if there is a rule to move it there.
I am not sure what "rt_sched_slice" is supposed to be, but in case you
That's not a fix. That's a hack.
Hell yeah. I got hack rules to fix the bad behaviour of gnome and kde
because they do not set the setpgid on started task as they are supposed
to...
Optimising a system seems like one big workaround ;-)
I am not sure what you mean by "do not set the setpgid"? Do you want
gnome-session to become its own session or the desktop services
themselves?
Post by Daniel Poelzleithner
Post by Lennart Poettering
The time slices need to add up and you break assumptions in the programs
with that. The us you hand out need to add up. And if you limit the RT
runtime to 1us per 1s then this is barely enough time to do almost
anything. Usually, if apps ask for RT they actually want to do
non-trivial work in that timeslice too... The whole point of RT sched is
too be able to monopolize the CPU until the app gives it up
voluntarily. Mucking with cpu.rt_runtime_us intreferes with that and
allows to install upper limits to the RT time the apps get, in a "safety
net" sense, to ensure that apps don't monopolize the CPU for too
long. But if you unconditionally just set this value to the smallest
value possible then your upper limit basically makes RT entirely
useless.
Of course 1ms is useless. Thats the point. What processes do a normal
So, are you setting things to 1us, or 1ms?
Post by Daniel Poelzleithner
user run that need realtime rt. I know exactly 2, thats pulseaudio and
jackd. For both exist rules that move the process to a group with higher
cpu.rt_runtime_us, enough that they should work properly but can not
bring down your system. It's enough to start, but not enough to use.
Maybe I will add a rule that matches rt tasks and move them to a special
group, I will think about that.
Well, ideally your entire pipeline should be RT if you do audio. For
example, all Jack clients should have an RT thread. And that's already
quite a few programs.
Post by Daniel Poelzleithner
If you are doing highend audio for example, the default desktop
configuration will not suite you, so you need simply to switch to an
configuration that fits better. Thats one root dbus call away, or with
the new gui one click and password. It is simply not possible to
configure a system that will fit all workloads.
Well, I don't buy that. I am working on something that is equally well
suitable for all uses. I don't think that scalability in your solutions
is impossible. The Linux kernel itself has already shown that it scales
equally well to supercomputers and embedded devices.

The need for configuration is a bad thing, not a good thing. Where we
can we should create systems that just work.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Well, if he doesn't want systemd to much with the "cpu" controller, then
he can easily disable that in a config file.
Then the package install scripts will have to change that...
Thankfully most distributions don't allow mucking around with other
package's configurations from the install scripts of a different
package.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Daniel Poelzleithner
2011-02-14 12:24:39 UTC
Permalink
Post by Lennart Poettering
I am not sure what you mean by "do not set the setpgid"? Do you want
gnome-session to become its own session or the desktop services
themselves?
gnome-panel for example should do a setpgid on the programs it starts as
they are logical new process group, that have nothing to do with ui
itself. gnome-session is a little bit different, some tasks should have
a new group, like the processes started from autostart, but others
belong to logically to the session.

https://bugzilla.gnome.org/show_bug.cgi?id=641387
Post by Lennart Poettering
So, are you setting things to 1us, or 1ms?
1us, maybe i will increase it to 1ms. I will think about that. If the
process gets into a realtime group he gets of course a lot more.
Post by Lennart Poettering
Well, ideally your entire pipeline should be RT if you do audio. For
example, all Jack clients should have an RT thread. And that's already
quite a few programs.
I will look into that, writing a helper rule that marks all processes
from a jack graph.
Post by Lennart Poettering
Well, I don't buy that. I am working on something that is equally well
suitable for all uses. I don't think that scalability in your solutions
is impossible. The Linux kernel itself has already shown that it scales
equally well to supercomputers and embedded devices.
The scheduler is scaling extremely well, no objection there. The point
is simply that it needs to be thought what's important to an user and
what's not. If I do a make -j whatever in an console, I don't want my
desktop ui to become slugish just because the scheduler is giving all
processes the same cpu shares. There are and ever will be processes that
are more important to the user. The program currently having focus is
more important then some random background task. Even assuming all
programs running behave nicely is just unworldly. When I run some very
cpu bound tasks, I still want my video to run without glitches. If
someone compiles a new kernel and wants to play a newer game while it's
running the priorities are clear: everything to the stuff necessary for
the game and whats left goes to the compiler. Thats what the user
expects and should get.
I strongly doubt that a solution for this works without lots of heuristics.
Post by Lennart Poettering
The need for configuration is a bad thing, not a good thing. Where we
can we should create systems that just work.
I really think that "one configuration fits all" will just not work
until proven otherwise :-)
It may work, but it will just not be perfect, and for perfect you need
configurations for at least different workloads.
Post by Lennart Poettering
Thankfully most distributions don't allow mucking around with other
package's configurations from the install scripts of a different
package.
Yes and thats a good thing. There is a reason why I wanted this dbus
call/property set. With it, there would be no need to change the
configuration when some program starts that will handle that. If a
program has root, he will be able to manipulate the cgroup tree anyway,
but when init always moves processes there first, it's just an annoying
clash. Nothing gained for anyone.

If you accept a patch making the DefaultGroups writeable, I will write
one. Even making a config variable to disable writeable.

kind regards
daniel
Andrey Borzenkov
2011-02-14 12:40:14 UTC
Permalink
Post by Daniel Poelzleithner
If you accept a patch making the DefaultGroups writeable, I will write
one. Even making a config variable to disable writeable.
If you could start with adding framework for setting properties via
D-Bus, it would be quite helpful - there are some other (systemd
global) properties that are definitely worth changing into rw instead
of ro, like log level or log destination.
Lennart Poettering
2011-02-14 16:04:39 UTC
Permalink
Post by Daniel Poelzleithner
Post by Lennart Poettering
I am not sure what you mean by "do not set the setpgid"? Do you want
gnome-session to become its own session or the desktop services
themselves?
gnome-panel for example should do a setpgid on the programs it starts as
they are logical new process group, that have nothing to do with ui
itself. gnome-session is a little bit different, some tasks should have
a new group, like the processes started from autostart, but others
belong to logically to the session.
https://bugzilla.gnome.org/show_bug.cgi?id=641387
I am not sure I actually buy that. I kinda like the fact that sending a
signal to the gnome-shell process group delivers a signal to all
processes it spawned.

That said as soon as systemd takes over session management we actualy go
much further: systemd services not only run in their own process group,
but also in their own session.
Post by Daniel Poelzleithner
Post by Lennart Poettering
So, are you setting things to 1us, or 1ms?
1us, maybe i will increase it to 1ms. I will think about that. If the
process gets into a realtime group he gets of course a lot more.
Well, this remains a hack, and the higher your raise that value the more
obvious it becomes: since all your slices summed up need to be < your
period length you limit how many services/apps you can start. Since the
period length by default is 1s, sending the runtime length to 1ms means
you enforce a hard limit of 1000 groups controlled by you.

In short: this doesn't work. Fix the kernel. Don't try to work around
broken kernel APIs in userspace.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Well, ideally your entire pipeline should be RT if you do audio. For
example, all Jack clients should have an RT thread. And that's already
quite a few programs.
I will look into that, writing a helper rule that marks all processes
from a jack graph.
Uh, this sounds wrong. You apply "rules" asynchronously on processes?
How do you do that? With cn_proc? This is ugly and broken. Wasn't right
in Upstart and is still broken outside of Upstart.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Well, I don't buy that. I am working on something that is equally well
suitable for all uses. I don't think that scalability in your solutions
is impossible. The Linux kernel itself has already shown that it scales
equally well to supercomputers and embedded devices.
The scheduler is scaling extremely well, no objection there. The point
is simply that it needs to be thought what's important to an user and
what's not. If I do a make -j whatever in an console, I don't want my
desktop ui to become slugish just because the scheduler is giving all
processes the same cpu shares. There are and ever will be processes that
are more important to the user. The program currently having focus is
more important then some random background task. Even assuming all
programs running behave nicely is just unworldly. When I run some very
cpu bound tasks, I still want my video to run without glitches. If
someone compiles a new kernel and wants to play a newer game while it's
running the priorities are clear: everything to the stuff necessary for
the game and whats left goes to the compiler. Thats what the user
expects and should get.
I strongly doubt that a solution for this works without lots of heuristics.
Well, I disagree.
Post by Daniel Poelzleithner
Post by Lennart Poettering
The need for configuration is a bad thing, not a good thing. Where we
can we should create systems that just work.
I really think that "one configuration fits all" will just not work
until proven otherwise :-)
I believe we can get very very far without any heuristics and configured
rules. So far you haven't listed any valid usecase where we couldnt come
up with a generic logic that would work for everybody.
Post by Daniel Poelzleithner
It may work, but it will just not be perfect, and for perfect you need
configurations for at least different workloads.
Well, since you advocate workarounds your solutions are by definition
imperfect. I think reaching perfection goes by fixing problems where
they are...
Post by Daniel Poelzleithner
Post by Lennart Poettering
Thankfully most distributions don't allow mucking around with other
package's configurations from the install scripts of a different
package.
Yes and thats a good thing. There is a reason why I wanted this dbus
call/property set. With it, there would be no need to change the
configuration when some program starts that will handle that. If a
program has root, he will be able to manipulate the cgroup tree anyway,
but when init always moves processes there first, it's just an annoying
clash. Nothing gained for anyone.
If you accept a patch making the DefaultGroups writeable, I will write
one. Even making a config variable to disable writeable.
Well, i don't agree with your reasons, but yes, I would accept a patch
that makes some properties of the Manager object writable, including
defaultgroups. As Andrey pointed out making the log level stuff
configurable during runtime with this would be a very good thing. If
that by side-effect makes you happy then great!

Lennart
--
Lennart Poettering - Red Hat, Inc.
Daniel Poelzleithner
2011-02-14 17:12:03 UTC
Permalink
Post by Lennart Poettering
I am not sure I actually buy that. I kinda like the fact that sending a
signal to the gnome-shell process group delivers a signal to all
processes it spawned.
But every program that execs another process may call setgrp on it's
child. Most shells for example do that, and console apps as well. Some
processes started by gnome-session run setsid and end up with different
task groups anyway, so you will never catch all.
Post by Lennart Poettering
That said as soon as systemd takes over session management we actualy go
much further: systemd services not only run in their own process group,
but also in their own session.
thats perfect like it should be :-)
Post by Lennart Poettering
Well, this remains a hack, and the higher your raise that value the more
obvious it becomes: since all your slices summed up need to be < your
period length you limit how many services/apps you can start. Since the
period length by default is 1s, sending the runtime length to 1ms means
you enforce a hard limit of 1000 groups controlled by you.
In short: this doesn't work. Fix the kernel. Don't try to work around
broken kernel APIs in userspace.
As long as it's not fixed, I don't see any other solution :-)
But I'm thinking about moving all rt processes into one group, that will
be enough I think.
Post by Lennart Poettering
Uh, this sounds wrong. You apply "rules" asynchronously on processes?
How do you do that? With cn_proc? This is ugly and broken. Wasn't right
in Upstart and is still broken outside of Upstart.
I use cn_proc as notifications for new processes and exits. I don't use
it for tracking where a process originated from, so it works quite nice
for this case. Using cgroups to find the origin is of course a much
better solution and cn_proc will not work there. But as I currently
don't care about origins, and in fact only schedule it when a process is
at least one second old, the cn_proc interface is much more suited then
cgroups.

rules only flag processes to help the scheduler making the best
decisions. Who the flags are weighted depends on the scheduler
configuration. New processes are scheduled after a delay of 1 second in
the default configuration, as many processes die very quickly with
absolutely no gain in running them through the pipeline. So, most
processes started will just stay in the cgroup of their parent and if
they stay longer, they will be optimized.
Post by Lennart Poettering
Post by Daniel Poelzleithner
Post by Lennart Poettering
The need for configuration is a bad thing, not a good thing. Where we
can we should create systems that just work.
I really think that "one configuration fits all" will just not work
until proven otherwise :-)
I believe we can get very very far without any heuristics and configured
rules. So far you haven't listed any valid usecase where we couldnt come
up with a generic logic that would work for everybody.
Example: i use the process group as the default grouping value which
works extremely good. Processes that belong together are usually in one
group, so one group of processes can't overload the system.
gnome-panel & kde don't set the process group, so i need a rule for
that. With the current configuration i can run a make -j 40 on the linux
kernel on my dual core and do not notice any slowdown, it's still smooth
everywhere.

Another example is the beloved swap of death. vlc leaks on some movies
and in the middle of the movie it stucks, I look down and see only the
harddisk driving mad. Sure i can press ctl+alt+f1 try to login, and
after like 10 minutes I will be able to kill it. Even worse: running a
group of programs that spawn may small tasks, everyone allocating a
little bit of memory. No oom will save you from that, instead he will
just kill some large important process like X :-)
And the desktop becomes very bad to use as all important stuff is
swapped out.
The current heuristics in ulatencyd protect me in both cases.

And again, of course it's better when everything is behaving good, but
if not, they never should be able to bring down a system, or even make a
impact that large that the desktop feels not right any more.
Post by Lennart Poettering
Well, since you advocate workarounds your solutions are by definition
imperfect. I think reaching perfection goes by fixing problems where
they are...
The implemention may be imperfect, not the result. If I have to
implement a workaround for a bad behaving userspace program, thats fine
by me, but seening the program get fixed is of course the better solution.
Post by Lennart Poettering
Well, i don't agree with your reasons, but yes, I would accept a patch
that makes some properties of the Manager object writable, including
defaultgroups.
okily dokily :-)

kind regards
Daniel
Lennart Poettering
2011-02-14 17:20:49 UTC
Permalink
Post by Daniel Poelzleithner
Post by Lennart Poettering
I am not sure I actually buy that. I kinda like the fact that sending a
signal to the gnome-shell process group delivers a signal to all
processes it spawned.
But every program that execs another process may call setgrp on it's
child. Most shells for example do that, and console apps as well. Some
processes started by gnome-session run setsid and end up with different
task groups anyway, so you will never catch all.
Yeah, the chaos of what the various processes started by g-s do to
detach from g-s is awful. we hope to clean this up a little with
PR_SET_ANCHOR.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Well, this remains a hack, and the higher your raise that value the more
obvious it becomes: since all your slices summed up need to be < your
period length you limit how many services/apps you can start. Since the
period length by default is 1s, sending the runtime length to 1ms means
you enforce a hard limit of 1000 groups controlled by you.
In short: this doesn't work. Fix the kernel. Don't try to work around
broken kernel APIs in userspace.
As long as it's not fixed, I don't see any other solution :-)
Well the solution is to fix it then.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Uh, this sounds wrong. You apply "rules" asynchronously on processes?
How do you do that? With cn_proc? This is ugly and broken. Wasn't right
in Upstart and is still broken outside of Upstart.
I use cn_proc as notifications for new processes and exits. I don't use
it for tracking where a process originated from, so it works quite nice
for this case. Using cgroups to find the origin is of course a much
better solution and cn_proc will not work there. But as I currently
don't care about origins, and in fact only schedule it when a process is
at least one second old, the cn_proc interface is much more suited then
cgroups.
But that means you apply your cgroup magic only a posterio. That is by
definition unsafe and unreliable. Using async interfaces for stuff like
this is always broken.
Post by Daniel Poelzleithner
Post by Lennart Poettering
Post by Daniel Poelzleithner
Post by Lennart Poettering
The need for configuration is a bad thing, not a good thing. Where we
can we should create systems that just work.
I really think that "one configuration fits all" will just not work
until proven otherwise :-)
I believe we can get very very far without any heuristics and configured
rules. So far you haven't listed any valid usecase where we couldnt come
up with a generic logic that would work for everybody.
Example: i use the process group as the default grouping value which
works extremely good. Processes that belong together are usually in one
group, so one group of processes can't overload the system.
gnome-panel & kde don't set the process group, so i need a rule for
that. With the current configuration i can run a make -j 40 on the linux
kernel on my dual core and do not notice any slowdown, it's still smooth
everywhere.
Well, my famous shell fragment already kinda did that. As well as the
autogrouping patch that got merged.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Daniel Poelzleithner
2011-02-14 17:59:09 UTC
Permalink
Post by Lennart Poettering
Well, my famous shell fragment already kinda did that. As well as the
autogrouping patch that got merged.
The autogrouping is broken in my opinion. The session id is not
distributed well enough for good grouping. When I click on a browser
icon and the browser ends up in the same group as the ui starting it, it
is broken.
"ps xaw -O session,pgrp" shows that all tasks not started by a shell,
which does the right thing, end up in the wrong group.

And don't get me started on what else can be done better, per group
swapiness for example, io,...

kind regards
daniel

Continue reading on narkive:
Loading...