Discussion:
bpfilter blocks root unmount during shutdown
(too old to reply)
Andrei Borzenkov
2018-09-23 07:38:02 UTC
Permalink
Dracut /shutdown script first tries to kill all processes still running
off old root. Unfortunately this fails for special user process that
runs bpfilter because it does not include reference to /oldroot in
places where dracut looks for in kilall_proc_mountpoint()

10:~ # ps -ef | fgrep '[none]'
root 984 2 0 09:46 ? 00:00:00 [none]

/proc/984:
total 0
dr-xr-xr-x 2 root 0 0 Sep 23 10:11 attr
-r-------- 1 root 0 0 Sep 23 10:11 auxv
-r--r--r-- 1 root 0 0 Sep 23 10:11 cgroup
--w------- 1 root 0 0 Sep 23 10:11 clear_refs
-r--r--r-- 1 root 0 0 Sep 23 10:10 cmdline
-rw-r--r-- 1 root 0 0 Sep 23 10:11 comm
-rw-r--r-- 1 root 0 0 Sep 23 10:11 coredump_filter
-r--r--r-- 1 root 0 0 Sep 23 10:11 cpuset
lrwxrwxrwx 1 root 0 0 Sep 23 10:11 cwd -> /
-r-------- 1 root 0 0 Sep 23 10:11 environ
lrwxrwxrwx 1 root 0 0 Sep 23 10:11 exe -> / (deleted)
-rw-r--r-- 1 root 0 0 Sep 23 10:11 fail-nth
dr-x------ 2 root 0 0 Sep 23 10:11 fd
dr-x------ 2 root 0 0 Sep 23 10:11 fdinfo
-rw-r--r-- 1 root 0 0 Sep 23 10:11 gid_map
-r-------- 1 root 0 0 Sep 23 10:11 io
-r--r--r-- 1 root 0 0 Sep 23 10:11 latency
-r--r--r-- 1 root 0 0 Sep 23 10:11 limits
-rw-r--r-- 1 root 0 0 Sep 23 10:11 loginuid
-rw-r--r-- 1 root 0 0 Sep 23 10:11 make-it-fail
dr-x------ 2 root 0 0 Sep 23 10:11 map_files
-r--r--r-- 1 root 0 0 Sep 23 10:10 maps
-rw------- 1 root 0 0 Sep 23 10:11 mem
-r--r--r-- 1 root 0 0 Sep 23 10:11 mountinfo
-r--r--r-- 1 root 0 0 Sep 23 10:11 mounts
-r-------- 1 root 0 0 Sep 23 10:11 mountstats
dr-xr-xr-x 6 root 0 0 Sep 23 10:11 net
dr-x--x--x 2 root 0 0 Sep 23 10:11 ns
-r--r--r-- 1 root 0 0 Sep 23 10:11 numa_maps
-rw-r--r-- 1 root 0 0 Sep 23 10:11 oom_adj
-r--r--r-- 1 root 0 0 Sep 23 10:11 oom_score
-rw-r--r-- 1 root 0 0 Sep 23 10:11 oom_score_adj
-r-------- 1 root 0 0 Sep 23 10:11 pagemap
-r-------- 1 root 0 0 Sep 23 10:11 patch_state
-r-------- 1 root 0 0 Sep 23 10:11 personality
-rw-r--r-- 1 root 0 0 Sep 23 10:11 projid_map
lrwxrwxrwx 1 root 0 0 Sep 23 10:11 root -> /
-rw-r--r-- 1 root 0 0 Sep 23 10:11 sched
-r--r--r-- 1 root 0 0 Sep 23 10:11 schedstat
-r--r--r-- 1 root 0 0 Sep 23 10:11 sessionid
-rw-r--r-- 1 root 0 0 Sep 23 10:11 setgroups
-r--r--r-- 1 root 0 0 Sep 23 10:11 smaps
-r--r--r-- 1 root 0 0 Sep 23 10:11 smaps_rollup
-r-------- 1 root 0 0 Sep 23 10:11 stack
-r--r--r-- 1 root 0 0 Sep 23 10:10 stat
-r--r--r-- 1 root 0 0 Sep 23 10:11 statm
-r--r--r-- 1 root 0 0 Sep 23 10:10 status
-r-------- 1 root 0 0 Sep 23 10:11 syscall
dr-xr-xr-x 3 root 0 0 Sep 23 10:11 task
-r--r--r-- 1 root 0 0 Sep 23 10:11 timers
-rw-rw-rw- 1 root 0 0 Sep 23 10:11 timerslack_ns
-rw-r--r-- 1 root 0 0 Sep 23 10:11 uid_map
-r--r--r-- 1 root 0 0 Sep 23 10:11 wchan

/proc/984/fd:
total 0
lr-x------ 1 root 0 64 Sep 23 10:11 0 -> pipe:[19409]
l-wx------ 1 root 0 64 Sep 23 10:11 1 -> pipe:[19410]
lrwx------ 1 root 0 64 Sep 23 10:11 2 -> /oldsys/dev/console


But it does contain reference to /oldroot in its mapped libraries list
(/proc/984/maps):

563b63002000-563b63003000 r--p 00000000 00:05 19404
/ (deleted)
563b63003000-563b63004000 r-xp 00001000 00:05 19404
/ (deleted)
563b63004000-563b63005000 r--p 00002000 00:05 19404
/ (deleted)
563b63005000-563b63006000 r--p 00002000 00:05 19404
/ (deleted)
563b63006000-563b63007000 rw-p 00003000 00:05 19404
/ (deleted)
563b63fb4000-563b63fd5000 rw-p 00000000 00:00 0
[heap]
7fa3a46cc000-7fa3a4882000 r-xp 00000000 00:2a 7728
/oldroot/lib64/libc-2.27.so
7fa3a4882000-7fa3a4a82000 ---p 001b6000 00:2a 7728
/oldroot/lib64/libc-2.27.so
7fa3a4a82000-7fa3a4a86000 r--p 001b6000 00:2a 7728
/oldroot/lib64/libc-2.27.so
7fa3a4a86000-7fa3a4a88000 rw-p 001ba000 00:2a 7728
/oldroot/lib64/libc-2.27.so
7fa3a4a88000-7fa3a4a8c000 rw-p 00000000 00:00 0
7fa3a4a8c000-7fa3a4ab1000 r-xp 00000000 00:2a 7720
/oldroot/lib64/ld-2.27.so
7fa3a4ca7000-7fa3a4ca9000 rw-p 00000000 00:00 0
7fa3a4cb1000-7fa3a4cb2000 r--p 00025000 00:2a 7720
/oldroot/lib64/ld-2.27.so
7fa3a4cb2000-7fa3a4cb3000 rw-p 00026000 00:2a 7720
/oldroot/lib64/ld-2.27.so
7fa3a4cb3000-7fa3a4cb4000 rw-p 00000000 00:00 0
7ffea03b4000-7ffea03d5000 rw-p 00000000 00:00 0
[stack]
7ffea03df000-7ffea03e2000 r--p 00000000 00:00 0
[vvar]
7ffea03e2000-7ffea03e4000 r-xp 00000000 00:00 0
[vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
[vsyscall]

So the quick fix would be to extend check for root references to also
look into /proc/$PID/maps. Something like (verified):

--- dracut-lib.sh.orig 2018-09-18 13:24:49.000000000 +0300
+++ dracut-lib.sh 2018-09-23 10:31:13.300054544 +0300
@@ -118,7 +118,7 @@ killall_proc_mountpoint() {
esac
[ -e "/proc/$_pid/exe" ] || continue
[ -e "/proc/$_pid/root" ] || continue
- strstr "$(ls -l -- "/proc/$_pid" "/proc/$_pid/fd" 2>/dev/null)"
"$1" && kill -9 "$_pid"
+ strstr "$(ls -l -- "/proc/$_pid" "/proc/$_pid/fd" 2>/dev/null;
cat "/proc/$_pid/maps" 2> /dev/null)" "$1" && kill -9 "$_pid"
done
}


Note that there are also other places that use similar check (most
obvious being /shutdown script itself) which likely need uniform fix. If
there are no objection, I would introduce helper function to do check
and use it everywhere instead of open coding.
Lennart Poettering
2018-09-24 13:20:47 UTC
Permalink
Post by Andrei Borzenkov
Dracut /shutdown script first tries to kill all processes still running
off old root. Unfortunately this fails for special user process that
runs bpfilter because it does not include reference to /oldroot in
places where dracut looks for in kilall_proc_mountpoint()
Hmm, when we invoke the /shutdown executable we already executed our
process killing spree as part of systemd-shutdown. How come your
processes even survive that long? What am I missing?

Lennart
--
Lennart Poettering, Red Hat
Andrei Borzenkov
2018-09-24 16:30:02 UTC
Permalink
Post by Lennart Poettering
Post by Andrei Borzenkov
Dracut /shutdown script first tries to kill all processes still running
off old root. Unfortunately this fails for special user process that
runs bpfilter because it does not include reference to /oldroot in
places where dracut looks for in kilall_proc_mountpoint()
Hmm, when we invoke the /shutdown executable we already executed our
process killing spree as part of systemd-shutdown. How come your
processes even survive that long?
p = procfs_file_alloca(pid, "cmdline");
f = fopen(p, "re");
if (!f)
return true; /* not really, but has the desired effect */

count = fread(&c, 1, 1, f);

/* Kernel threads have an empty cmdline */
if (count <= 0)
return true;


This process is spawned as special kernel thread, even though it is
otherwise normal user process.

net/bpfilter/bpfilter_kern.c:load_umh():


/* fork usermode process */
err = fork_usermode_blob(&bpfilter_umh_start,
&bpfilter_umh_end - &bpfilter_umh_start,
&info);
if (err)
return err;
pr_info("Loaded bpfilter_umh pid %d\n", info.pid);
Lennart Poettering
2018-09-24 16:52:55 UTC
Permalink
Post by Andrei Borzenkov
Post by Lennart Poettering
Post by Andrei Borzenkov
Dracut /shutdown script first tries to kill all processes still running
off old root. Unfortunately this fails for special user process that
runs bpfilter because it does not include reference to /oldroot in
places where dracut looks for in kilall_proc_mountpoint()
Hmm, when we invoke the /shutdown executable we already executed our
process killing spree as part of systemd-shutdown. How come your
processes even survive that long?
p = procfs_file_alloca(pid, "cmdline");
f = fopen(p, "re");
if (!f)
return true; /* not really, but has the desired effect */
count = fread(&c, 1, 1, f);
/* Kernel threads have an empty cmdline */
if (count <= 0)
return true;
This process is spawned as special kernel thread, even though it is
otherwise normal user process.
I am sorry, what? Are you saying there's now a third kind of task?
real kernel threads, real userspace processes, and weird shit running
kernel code that in turn runs userspace supplied programs, and all
that under user control?

If so, yuck...

Under which parent PID do they show up? kthreadd or somewhere further
down?

Do these processes report PF_KTHREAD in /proc/$PID/stat? i.e. do they
pass the recently reworked is_kernel_thread() tests?

We might want to update killall.c then so that it does not make
assumptions on /proc/$PID/cmdline validity anymore, but strictly uses
is_kernel_thread(). That should fix things properly for you, no? That
way dracut won't even see these new kind processes at all...

Lennart
--
Lennart Poettering, Red Hat
Andrei Borzenkov
2018-09-24 17:17:28 UTC
Permalink
Post by Lennart Poettering
Post by Andrei Borzenkov
Post by Lennart Poettering
Post by Andrei Borzenkov
Dracut /shutdown script first tries to kill all processes still running
off old root. Unfortunately this fails for special user process that
runs bpfilter because it does not include reference to /oldroot in
places where dracut looks for in kilall_proc_mountpoint()
Hmm, when we invoke the /shutdown executable we already executed our
process killing spree as part of systemd-shutdown. How come your
processes even survive that long?
p = procfs_file_alloca(pid, "cmdline");
f = fopen(p, "re");
if (!f)
return true; /* not really, but has the desired effect */
count = fread(&c, 1, 1, f);
/* Kernel threads have an empty cmdline */
if (count <= 0)
return true;
This process is spawned as special kernel thread, even though it is
otherwise normal user process.
I am sorry, what? Are you saying there's now a third kind of task?
real kernel threads, real userspace processes, and weird shit running
kernel code that in turn runs userspace supplied programs, and all
that under user control?
No, it is not exactly "user control". It runs executable embedded into
kernel module. So it is not arbitrary code. In this particular case at
least.
Post by Lennart Poettering
If so, yuck...
Under which parent PID do they show up? kthreadd or somewhere further
down?
I showed it in original post.

10:~ # ps -ef | fgrep '[none]'
root 984 2 0 09:46 ? 00:00:00 [none]

Yes, this is kthreadd.
Post by Lennart Poettering
Do these processes report PF_KTHREAD in /proc/$PID/stat? i.e. do they
pass the recently reworked is_kernel_thread() tests?
No. The flags are 4194560 == 0x400100 == PF_RANDOMIZE|PF_SUPERPRIV.

And sorry, I cannot comment on "these processes"; I have seen only one
concrete example. I have no idea how widespread use of this facility is.
Post by Lennart Poettering
We might want to update killall.c then so that it does not make
assumptions on /proc/$PID/cmdline validity anymore, but strictly uses
is_kernel_thread(). That should fix things properly for you, no? That
way dracut won't even see these new kind processes at all...
Well, I suppose there could be corner cases when executable and
libraries are from different filesystems, but this better waits for real
life example then.
Lennart Poettering
2018-09-24 17:22:05 UTC
Permalink
Post by Andrei Borzenkov
Post by Lennart Poettering
I am sorry, what? Are you saying there's now a third kind of task?
real kernel threads, real userspace processes, and weird shit running
kernel code that in turn runs userspace supplied programs, and all
that under user control?
No, it is not exactly "user control". It runs executable embedded into
kernel module. So it is not arbitrary code. In this particular case at
least.
By "user control" I meant that they are kill()-able by users (kernel
threads generally are not).
Post by Andrei Borzenkov
Post by Lennart Poettering
Do these processes report PF_KTHREAD in /proc/$PID/stat? i.e. do they
pass the recently reworked is_kernel_thread() tests?
No. The flags are 4194560 == 0x400100 == PF_RANDOMIZE|PF_SUPERPRIV.
And sorry, I cannot comment on "these processes"; I have seen only one
concrete example. I have no idea how widespread use of this facility is.
Post by Lennart Poettering
We might want to update killall.c then so that it does not make
assumptions on /proc/$PID/cmdline validity anymore, but strictly uses
is_kernel_thread(). That should fix things properly for you, no? That
way dracut won't even see these new kind processes at all...
Well, I suppose there could be corner cases when executable and
libraries are from different filesystems, but this better waits for real
life example then.
I prepped this PR:

https://github.com/systemd/systemd/pull/10159

I think this should fix your issue, could you test? (using PF_KTHREAD
checking is more correct anyway, hence regardless this should really
be the right way and be merged)

Lennart
--
Lennart Poettering, Red Hat
Cristian Rodríguez
2018-09-26 00:35:57 UTC
Permalink
Post by Andrei Borzenkov
This process is spawned as special kernel thread, even though it is
otherwise normal user process.
WUT ? So how is this new kind of task supposed to be handled by
userspace ? looks like a kernel bug to me.

Olivier Brunel
2018-09-24 14:55:31 UTC
Permalink
On Mon, 24 Sep 2018 15:20:47 +0200
Post by Lennart Poettering
Post by Andrei Borzenkov
Dracut /shutdown script first tries to kill all processes still
running off old root. Unfortunately this fails for special user
process that runs bpfilter because it does not include reference
to /oldroot in places where dracut looks for in
kilall_proc_mountpoint()
Hmm, when we invoke the /shutdown executable we already executed our
process killing spree as part of systemd-shutdown. How come your
processes even survive that long? What am I missing?
I believe it's because the bpfilter helper process is identified as a
kernel thread - since it has an empty command line - and therefore not
killed.

I personally feel this is a bug (in the kernel), but apparently
this whole bpfilter thing isn't quite ready yet and shouldn't be
used for the moment -- so hopefully it'll improve/be fixed in the mean
time.
You can see this thread[1] about the issue.

Cheers,



[1] https://www.spinics.net/lists/netdev/msg520030.html
Loading...