Discussion:
Possible race condition for setting cgroup sticky bit
(too old to reply)
Anders Olofsson
2013-03-26 12:43:18 UTC
Permalink
I'm seeing a problem with a service sometimes failing to start due to a missing cgroup.
After some debugging I've made the following observations:

After exec_spawn() forks, the child will set the sticky bit for the cgroup (in cg_set_task_access) but sometimes, the cgroup is missing (lstat returns "No such file or directory").

The cgroup is always created, but the main process will call cg_trim (from cgroup_bonding_trim <- cgroup_bonding_trim_list <- cgroup_notify_empty <- private_bus_message_filter ...) which will remove the cgroup if the sticky bit isn't set.

This seems to be a race condition.
If the child sets the sticky bit first, the parent will leave the cgroup alone. But if the main process gets to cg_trim first, the cgroup is removed and the child fails.

We're using systemd 197. I've tried using 198, but there the child dies with SIGSEGV so it's harder to debug what's happening.
The problem appeared when we switched from Linux 3.4 to 3.7, but as this looks like a race in systemd so I'm not sure if our local kernel tree is to blame or if the version bump just changed the timing to trigger the race in systemd.

Since I'm not familiar with the systemd internals and cgroups I would appreciate some help to resolve this.

I can reproduce this pretty easy, usually within 5-10 boots. It's always the same service that fails and the services before it never fails.

/Anders
Anders Olofsson
2013-03-27 12:58:18 UTC
Permalink
I just tested it with systemd 199 and the problem still occurs.

However it now fails with " Failed at step CGROUP spawning /etc/init.d/rc: No such file or directory" just like in 197 and not with a segfault as I saw (at least sometimes) with 198.

/Anders
-----Original Message-----
From: systemd-devel-
Of Anders Olofsson
Sent: den 26 mars 2013 13:43
Subject: [systemd-devel] Possible race condition for setting cgroup sticky bit
I'm seeing a problem with a service sometimes failing to start due to a
missing cgroup.
After exec_spawn() forks, the child will set the sticky bit for the cgroup (in
cg_set_task_access) but sometimes, the cgroup is missing (lstat returns "No
such file or directory").
The cgroup is always created, but the main process will call cg_trim (from
cgroup_bonding_trim <- cgroup_bonding_trim_list <- cgroup_notify_empty
<- private_bus_message_filter ...) which will remove the cgroup if the sticky
bit isn't set.
This seems to be a race condition.
If the child sets the sticky bit first, the parent will leave the cgroup alone. But
if the main process gets to cg_trim first, the cgroup is removed and the child
fails.
We're using systemd 197. I've tried using 198, but there the child dies with
SIGSEGV so it's harder to debug what's happening.
The problem appeared when we switched from Linux 3.4 to 3.7, but as this
looks like a race in systemd so I'm not sure if our local kernel tree is to blame
or if the version bump just changed the timing to trigger the race in systemd.
Since I'm not familiar with the systemd internals and cgroups I would
appreciate some help to resolve this.
I can reproduce this pretty easy, usually within 5-10 boots. It's always the
same service that fails and the services before it never fails.
/Anders
_______________________________________________
systemd-devel mailing list
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Anders Olofsson
2013-04-03 14:33:26 UTC
Permalink
I would really appreciate some help with this from someone who's familiar with the systemd internals.

What mechanism to prevent cg_trim from removing a cgroup before the newly created child has completed cg_set_task_access?

I've created bug 63080 for this as well.

/Anders
-----Original Message-----
From: systemd-devel-
Of Anders Olofsson
Sent: den 27 mars 2013 13:58
Subject: Re: [systemd-devel] Possible race condition for setting cgroup sticky
bit
I just tested it with systemd 199 and the problem still occurs.
No such file or directory" just like in 197 and not with a segfault as I saw (at
least sometimes) with 198.
/Anders
-----Original Message-----
From: systemd-devel-
[mailto:systemd-
Behalf
Of Anders Olofsson
Sent: den 26 mars 2013 13:43
Subject: [systemd-devel] Possible race condition for setting cgroup sticky
bit
I'm seeing a problem with a service sometimes failing to start due to a
missing cgroup.
After exec_spawn() forks, the child will set the sticky bit for the cgroup (in
cg_set_task_access) but sometimes, the cgroup is missing (lstat returns
"No
such file or directory").
The cgroup is always created, but the main process will call cg_trim (from
cgroup_bonding_trim <- cgroup_bonding_trim_list <-
cgroup_notify_empty
<- private_bus_message_filter ...) which will remove the cgroup if the
sticky
bit isn't set.
This seems to be a race condition.
If the child sets the sticky bit first, the parent will leave the cgroup alone.
But
if the main process gets to cg_trim first, the cgroup is removed and the
child
fails.
We're using systemd 197. I've tried using 198, but there the child dies with
SIGSEGV so it's harder to debug what's happening.
The problem appeared when we switched from Linux 3.4 to 3.7, but as this
looks like a race in systemd so I'm not sure if our local kernel tree is to
blame
or if the version bump just changed the timing to trigger the race in
systemd.
Since I'm not familiar with the systemd internals and cgroups I would
appreciate some help to resolve this.
I can reproduce this pretty easy, usually within 5-10 boots. It's always the
same service that fails and the services before it never fails.
/Anders
_______________________________________________
systemd-devel mailing list
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
_______________________________________________
systemd-devel mailing list
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Lennart Poettering
2013-04-05 19:11:22 UTC
Permalink
On Tue, 26.03.13 13:43, Anders Olofsson (***@axis.com) wrote:

heya, sorry for the delay.
Post by Anders Olofsson
I'm seeing a problem with a service sometimes failing to start due to a missing cgroup.
After exec_spawn() forks, the child will set the sticky bit for the
cgroup (in cg_set_task_access) but sometimes, the cgroup is missing
(lstat returns "No such file or directory").
The cgroup is always created, but the main process will call cg_trim
(from cgroup_bonding_trim <- cgroup_bonding_trim_list <-
cgroup_notify_empty <- private_bus_message_filter ...) which will
remove the cgroup if the sticky bit isn't set.
Hmm, cg_trim() will ignore groups with the sticky bit set, and the
kernel won't allow us removing groups where there's currently a process
in.

The code dealing with forked off service processes in execute.c looks
like this: after forking, we first create a group, then add us to it,
and then set the sticky bit for it. Now, there's a tiny window of
opportunity there (and we should fix it...) where cg_trim from PID 1
could run in between which is between creating a group and adding us
into it. But normally, if that fails then the exection of the servie
should be aborted right away. But that's not what you are seeing?

I will now add some code which avoids the race I pointed out, but I am
not sure that's the same one that you are actually encountering...
Post by Anders Olofsson
This seems to be a race condition. If the child sets the sticky bit
first, the parent will leave the cgroup alone. But if the main process
gets to cg_trim first, the cgroup is removed and the child fails.
We're using systemd 197. I've tried using 198, but there the child
dies with SIGSEGV so it's harder to debug what's happening. The
problem appeared when we switched from Linux 3.4 to 3.7, but as this
looks like a race in systemd so I'm not sure if our local kernel tree
is to blame or if the version bump just changed the timing to trigger
the race in systemd.
Since I'm not familiar with the systemd internals and cgroups I would
appreciate some help to resolve this.
I can reproduce this pretty easy, usually within 5-10 boots. It's
always the same service that fails and the services before it never
fails.
A temporary work-around could be to precreate the cgroup dir early and
set the sticky bit on it, so that systemd won't kill it ever...

Lennart
--
Lennart Poettering - Red Hat, Inc.
Anders Olofsson
2013-04-05 20:04:48 UTC
Permalink
Post by Anders Olofsson
Post by Anders Olofsson
I'm seeing a problem with a service sometimes failing to start due to a
missing cgroup.
Post by Anders Olofsson
After exec_spawn() forks, the child will set the sticky bit for the
cgroup (in cg_set_task_access) but sometimes, the cgroup is missing
(lstat returns "No such file or directory").
The cgroup is always created, but the main process will call cg_trim
(from cgroup_bonding_trim <- cgroup_bonding_trim_list <-
cgroup_notify_empty <- private_bus_message_filter ...) which will
remove the cgroup if the sticky bit isn't set.
Hmm, cg_trim() will ignore groups with the sticky bit set, and the
kernel won't allow us removing groups where there's currently a process
in.
I've dumped data from cg_trim and the sticky bit is not set when this occurs. In fact, the state of the sticky bit as seen by cg_trim seems to be the major difference between a proper boot or a broken one.
Post by Anders Olofsson
The code dealing with forked off service processes in execute.c looks
like this: after forking, we first create a group, then add us to it,
and then set the sticky bit for it. Now, there's a tiny window of
opportunity there (and we should fix it...) where cg_trim from PID 1
could run in between which is between creating a group and adding us
into it. But normally, if that fails then the exection of the servie
should be aborted right away. But that's not what you are seeing?
I will now add some code which avoids the race I pointed out, but I am
not sure that's the same one that you are actually encountering...
The cgroup that fails is named after the services. But the service is configured to use the same cgroup as several other services (ControlGroup= is set in the service file).
In this setup, is the child created in the default cgroup and then moved to the configured one or why is the default named cgroup existing at all and being handled?

I've noticed that there always exist cgroups for all services, regardless if they are overridden to use another.
Post by Anders Olofsson
A temporary work-around could be to precreate the cgroup dir early and
set the sticky bit on it, so that systemd won't kill it ever...
Thanks, I'll try that. I guess it can't be added to ExecStartPre of the same service though without risking the same problem.

/Anders
Lennart Poettering
2013-04-08 12:18:43 UTC
Permalink
Post by Anders Olofsson
Post by Anders Olofsson
Post by Anders Olofsson
I'm seeing a problem with a service sometimes failing to start due to a
missing cgroup.
Post by Anders Olofsson
After exec_spawn() forks, the child will set the sticky bit for the
cgroup (in cg_set_task_access) but sometimes, the cgroup is missing
(lstat returns "No such file or directory").
The cgroup is always created, but the main process will call cg_trim
(from cgroup_bonding_trim <- cgroup_bonding_trim_list <-
cgroup_notify_empty <- private_bus_message_filter ...) which will
remove the cgroup if the sticky bit isn't set.
Hmm, cg_trim() will ignore groups with the sticky bit set, and the
kernel won't allow us removing groups where there's currently a process
in.
I've dumped data from cg_trim and the sticky bit is not set when this
occurs. In fact, the state of the sticky bit as seen by cg_trim seems
to be the major difference between a proper boot or a broken one.
Well, but as long as there is a process in the group the kernel should
already refuse deletion in the group. The sticky bit is hence useful
only for *empty* cgroups, which is what I don't grok here... In your
case the child should have created the group and made itself a member of
it immediately (which a tiny window in between where the group could be
remvoed, but this should result in immediate total failure of the
forking, not just a missing cgroup).
Post by Anders Olofsson
Post by Anders Olofsson
The code dealing with forked off service processes in execute.c looks
like this: after forking, we first create a group, then add us to it,
and then set the sticky bit for it. Now, there's a tiny window of
opportunity there (and we should fix it...) where cg_trim from PID 1
could run in between which is between creating a group and adding us
into it. But normally, if that fails then the exection of the servie
should be aborted right away. But that's not what you are seeing?
I will now add some code which avoids the race I pointed out, but I am
not sure that's the same one that you are actually encountering...
The cgroup that fails is named after the services. But the service is
configured to use the same cgroup as several other services
(ControlGroup= is set in the service file). In this setup, is the
child created in the default cgroup and then moved to the configured
one or why is the default named cgroup existing at all and being
handled?
No, if you configured a cgroup name then no "default" cgroup naming
is ever attempted.

Hmm, which hierarchy are you talking of BTW? Note that cgroup
memberships in all heirarchies are pretty much orthogonal on the
kernel-side of things. And systemd will allow you that too.
Post by Anders Olofsson
I've noticed that there always exist cgroups for all services,
regardless if they are overridden to use another.
Really? Maybe in different hierarchies?

It would certainly be a bug if systemd ever creates a cgroup in the "cpu"
hierachy that is not the one you you configured for the "cpu"
hierarchy.

Any chance you can explain in a bit more detail how your cgroups are set
up and what unit configuration switches you use for that?

Lennart
--
Lennart Poettering - Red Hat, Inc.
Anders Olofsson
2013-04-08 14:57:38 UTC
Permalink
Post by Lennart Poettering
Post by Anders Olofsson
Post by Anders Olofsson
Post by Anders Olofsson
I'm seeing a problem with a service sometimes failing to start due to a
missing cgroup.
Post by Anders Olofsson
After exec_spawn() forks, the child will set the sticky bit for the
cgroup (in cg_set_task_access) but sometimes, the cgroup is missing
(lstat returns "No such file or directory").
The cgroup is always created, but the main process will call cg_trim
(from cgroup_bonding_trim <- cgroup_bonding_trim_list <-
cgroup_notify_empty <- private_bus_message_filter ...) which will
remove the cgroup if the sticky bit isn't set.
Hmm, cg_trim() will ignore groups with the sticky bit set, and the
kernel won't allow us removing groups where there's currently a process
in.
I've dumped data from cg_trim and the sticky bit is not set when this
occurs. In fact, the state of the sticky bit as seen by cg_trim seems
to be the major difference between a proper boot or a broken one.
Well, but as long as there is a process in the group the kernel should
already refuse deletion in the group. The sticky bit is hence useful
only for *empty* cgroups, which is what I don't grok here... In your
case the child should have created the group and made itself a member of
it immediately (which a tiny window in between where the group could be
remvoed, but this should result in immediate total failure of the
forking, not just a missing cgroup).
I've never seen the fork fail, the error message displayed is always: "Failed at step CGROUP spawning /etc/init.d/rc: No such file or directory" which comes from the failure in the cg_set_task_access.
Post by Lennart Poettering
Post by Anders Olofsson
Post by Anders Olofsson
The code dealing with forked off service processes in execute.c looks
like this: after forking, we first create a group, then add us to it,
and then set the sticky bit for it. Now, there's a tiny window of
opportunity there (and we should fix it...) where cg_trim from PID 1
could run in between which is between creating a group and adding us
into it. But normally, if that fails then the exection of the servie
should be aborted right away. But that's not what you are seeing?
I will now add some code which avoids the race I pointed out, but I am
not sure that's the same one that you are actually encountering...
The cgroup that fails is named after the services. But the service is
configured to use the same cgroup as several other services
(ControlGroup= is set in the service file). In this setup, is the
child created in the default cgroup and then moved to the configured
one or why is the default named cgroup existing at all and being
handled?
No, if you configured a cgroup name then no "default" cgroup naming
is ever attempted.
Hmm, which hierarchy are you talking of BTW? Note that cgroup
memberships in all heirarchies are pretty much orthogonal on the
kernel-side of things. And systemd will allow you that too.
Post by Anders Olofsson
I've noticed that there always exist cgroups for all services,
regardless if they are overridden to use another.
Really? Maybe in different hierarchies?
It would certainly be a bug if systemd ever creates a cgroup in the "cpu"
hierachy that is not the one you you configured for the "cpu"
hierarchy.
Any chance you can explain in a bit more detail how your cgroups are set
up and what unit configuration switches you use for that?
Ok, let's see if I can explain what we've done here.

To introduce systemd in our system, we've started with just wrapping rc and all the old initscripts so we can get systemd running first and then afterwards start converting to native services.
The boot is basically two services: legacy_rcS.service (which runs "/etc/init.d/rc S") and legacy_rc3.service (which runs "/etc/init.d/rc 3"). There is also a legacy_rc4.service (wanted by upgrade.target) used for firmware upgrads and similar special system actions.
Journal, udev and syslog runs as separate services outside these wrappers and the idea is to migrate boot script to services a few at a time until the legacy wrappers are empty and can be dropped.

The following is the service file for the runlevel 3 wrapper:
[Unit]
Description=Legacy runlevel 3
Wants=legacy_rcS.service
After=legacy_rcS.service
Conflicts=legacy_rc4.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/etc/init.d/rc 3
StandardOutput=tty
Environment=RUNLEVEL=3
Environment=PREVLEVEL=X
ControlGroup=systemd:/system/legacy_rc.service
ControlGroupPersistent=true
KillMode=none

The same cgroup is configured for all the legacy services (rcS, rc3 and rc4).

When looking in sysfs, I see cgroups for all the legacy services, even though the rcS and rc3 services use the configured generic cgroup:
The following is from a working system, when a failure happens, rc and rcS are present, but not rc3:
# ls -d /sys/fs/cgroup/systemd/system/legacy*
/sys/fs/cgroup/systemd/system/legacy_rc.service
/sys/fs/cgroup/systemd/system/legacy_rc3.service
/sys/fs/cgroup/systemd/system/legacy_rcS.service

This was what I meant with "cgroups for all services" exist even though it has been overridden.
Without the ControlGroup= setting, legacy_rc3 and legacy_rcS would have use the cgroups with the same names. But since we've specify that we want a different name, I'm wondering why I still see the default names that we don't want to use.

/Anders
Lennart Poettering
2013-04-08 16:51:07 UTC
Permalink
Post by Anders Olofsson
Ok, let's see if I can explain what we've done here.
To introduce systemd in our system, we've started with just wrapping rc and all the old initscripts so we can get systemd running first and then afterwards start converting to native services.
The boot is basically two services: legacy_rcS.service (which runs "/etc/init.d/rc S") and legacy_rc3.service (which runs "/etc/init.d/rc 3"). There is also a legacy_rc4.service (wanted by upgrade.target) used for firmware upgrads and similar special system actions.
Journal, udev and syslog runs as separate services outside these wrappers and the idea is to migrate boot script to services a few at a time until the legacy wrappers are empty and can be dropped.
[Unit]
Description=Legacy runlevel 3
Wants=legacy_rcS.service
After=legacy_rcS.service
Conflicts=legacy_rc4.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/etc/init.d/rc 3
StandardOutput=tty
Environment=RUNLEVEL=3
Environment=PREVLEVEL=X
ControlGroup=systemd:/system/legacy_rc.service
ControlGroupPersistent=true
KillMode=none
The same cgroup is configured for all the legacy services (rcS, rc3 and rc4).
Ah, that's the issue. You can't really do manipulations like that in
systemd's own hierarchy: sticking multiple services in the same cgroup
in the name=systemd hierarchy will break things heavily (and I am
surprised you didn't run into that pronlem earlier). It is OK to stick
muliple services into the same group on all other hierachies, but not in
systemd's own. In fact, you shouldn't really fiddle with systemd private
hierachy at all. We need that to keep track of our own service state
(i.e. for checking whether a service is still running), we will use it
to kill services and so on, we take the liberty to remove groups in that
hierarchy as we see fit... and for that we need to keep the groups
separate.

This is actually documented in the man pages:

"It is not recommended to manipulate the service control group path in
the systemd named hierarchy." (see systemd.exec(5) the part about
ControlGroup=)

I have now changed the man page to be a bit stronger here, and say that
you might get undefined behaviour if you change systemd's own hierarchy.
Post by Anders Olofsson
# ls -d /sys/fs/cgroup/systemd/system/legacy*
/sys/fs/cgroup/systemd/system/legacy_rc.service
/sys/fs/cgroup/systemd/system/legacy_rc3.service
/sys/fs/cgroup/systemd/system/legacy_rcS.service
This was what I meant with "cgroups for all services" exist even
though it has been overridden. Without the ControlGroup= setting,
legacy_rc3 and legacy_rcS would have use the cgroups with the same
names. But since we've specify that we want a different name, I'm
wondering why I still see the default names that we don't want to use.
Well, the systemd hiearchy is special. We have special semantics for it,
and you shouldn't alter it. You are free to rearrange cgroups in all
other hierarchies and drop as many services in the same cgroup as you
wish for those, but not for systemd's own name=systemd hierarchy.

I hope this makes sense,

Lennart
--
Lennart Poettering - Red Hat, Inc.
Anders Olofsson
2013-04-09 11:20:23 UTC
Permalink
Post by Anders Olofsson
Post by Anders Olofsson
To introduce systemd in our system, we've started with just wrapping rc
and all the old initscripts so we can get systemd running first and then
afterwards start converting to native services.
Post by Anders Olofsson
The boot is basically two services: legacy_rcS.service (which runs
"/etc/init.d/rc S") and legacy_rc3.service (which runs "/etc/init.d/rc 3").
There is also a legacy_rc4.service (wanted by upgrade.target) used for
firmware upgrads and similar special system actions.
Post by Anders Olofsson
Journal, udev and syslog runs as separate services outside these wrappers
and the idea is to migrate boot script to services a few at a time until the
legacy wrappers are empty and can be dropped.
Ah, that's the issue. You can't really do manipulations like that in
systemd's own hierarchy: sticking multiple services in the same cgroup
in the name=systemd hierarchy will break things heavily (and I am
surprised you didn't run into that pronlem earlier). It is OK to stick
muliple services into the same group on all other hierachies, but not in
systemd's own. In fact, you shouldn't really fiddle with systemd private
hierachy at all. We need that to keep track of our own service state
(i.e. for checking whether a service is still running), we will use it
to kill services and so on, we take the liberty to remove groups in that
hierarchy as we see fit... and for that we need to keep the groups
separate.
"It is not recommended to manipulate the service control group path in
the systemd named hierarchy." (see systemd.exec(5) the part about
ControlGroup=)
I have now changed the man page to be a bit stronger here, and say that
you might get undefined behaviour if you change systemd's own hierarchy.
Post by Anders Olofsson
When looking in sysfs, I see cgroups for all the legacy services, even though
The following is from a working system, when a failure happens, rc and rcS
# ls -d /sys/fs/cgroup/systemd/system/legacy*
/sys/fs/cgroup/systemd/system/legacy_rc.service
/sys/fs/cgroup/systemd/system/legacy_rc3.service
/sys/fs/cgroup/systemd/system/legacy_rcS.service
This was what I meant with "cgroups for all services" exist even
though it has been overridden. Without the ControlGroup= setting,
legacy_rc3 and legacy_rcS would have use the cgroups with the same
names. But since we've specify that we want a different name, I'm
wondering why I still see the default names that we don't want to use.
Well, the systemd hiearchy is special. We have special semantics for it,
and you shouldn't alter it. You are free to rearrange cgroups in all
other hierarchies and drop as many services in the same cgroup as you
wish for those, but not for systemd's own name=systemd hierarchy.
I hope this makes sense,
Yes, thank you for your help.

A follow up question. Is there some other way to accomplish what we were trying to do here?

The reason for doing this the we have a remote shell (e.g. telnet) that runs as separate services (socket activation) and if a user (or an automated test) logs in and (re)starts a daemon that still belongs to the legacy blob, it will end up running in the telnet group instead of in the legacy group where it is supposed to be. When the client disconnects, either the daemon will be killed or if KillMode=none is set the process will run but still belong to the cgroup from an old telnet session.
Is there a way to have the telnet sessions run as part of the legacy group instead?

/Anders
Lennart Poettering
2013-04-09 15:11:09 UTC
Permalink
Post by Anders Olofsson
Post by Lennart Poettering
Well, the systemd hiearchy is special. We have special semantics for it,
and you shouldn't alter it. You are free to rearrange cgroups in all
other hierarchies and drop as many services in the same cgroup as you
wish for those, but not for systemd's own name=systemd hierarchy.
I hope this makes sense,
Yes, thank you for your help.
A follow up question. Is there some other way to accomplish what we were trying to do here?
The reason for doing this the we have a remote shell (e.g. telnet)
that runs as separate services (socket activation) and if a user (or
an automated test) logs in and (re)starts a daemon that still belongs
to the legacy blob, it will end up running in the telnet group instead
of in the legacy group where it is supposed to be. When the client
disconnects, either the daemon will be killed or if KillMode=none is
set the process will run but still belong to the cgroup from an old
telnet session. Is there a way to have the telnet sessions run as
part of the legacy group instead?
Usually pam_systemd + logind are used to make sure every login session
gets its own cgroup...

Lennart
--
Lennart Poettering - Red Hat, Inc.
Loading...