Discussion:
question on special configuration case
(too old to reply)
Hebenstreit, Michael
2016-06-07 15:13:56 UTC
Permalink
Sorry for directing this question here, but I did not find any mailing list that would be a better fit.

Problem: I'm running an HPC benchmarking cluster. We are evaluating RH7/CentOS7/OL7 and have a problem with system noise generated by the systemd components (v 219-19.0.2, see below).

Background: All cores of the CPU (up to 288) are utilized 99.99% by the application. Because of the tight coupling node to node (of programs running on 200+ nodes) every time an OS process wakes up this automatically delays EVERY process on EVERY node. As those small interruptions are not synchronized over the cluster, the overall effect on the effective performance is "time of the single delay" times "number of nodes in the job". Therefore we need to keep the OS of our systems are stripped down to an absolute bare minimum.

a) we have no use for any type of logging. The only log we have is kernel dmesg
b) there is only a single user at any time on the system (logging in via ssh).
c) The only demons running are those necessary for NFS, ntp and sshd.
d) we do not run Gnome or similar desktop.

Goal: For these reasons we want to shut down dbus-daemon, systemd-journald, systemd-logind and after startup also systemd-udevd. In our special case they do not serve any purpose. Unfortunately the basic configuration options do not allow this.

Questions:
Can you provide any guidance?
Will PID 1 (systemd) continue to do its work (first tests were already successful)?
What are security implications when shutting down systemd-logind?
Is there any mailing list better suited you can point me too?


Thanks for any help you can provide
Michael





Installed:
systemd-networkd-219-19.0.2.el7_2.9.x86_64
systemd-219-19.0.2.el7_2.9.x86_64
systemd-devel-219-19.0.2.el7_2.9.x86_64
systemd-sysv-219-19.0.2.el7_2.9.x86_64
systemd-libs-219-19.0.2.el7_2.9.x86_64
systemd-python-219-19.0.2.el7_2.9.x86_64
systemd-resolved-219-19.0.2.el7_2.9.x86_64

------------------------------------------------------------------------
Michael Hebenstreit                 Senior Cluster Architect
Intel Corporation, MS: RR1-105/H14  Software and Services Group/DCE
4100 Sara Road          Tel.:   +1 505-794-3144
Rio Rancho, NM 87124
UNITED STATES                       E-mail: ***@intel.com
Jóhann B. Guðmundsson
2016-06-07 15:50:49 UTC
Permalink
Post by Hebenstreit, Michael
we need to keep the OS of our systems are stripped down to an absolute bare minimum.
If you need absolute bare minimum systemd [¹] then you need to
create/maintain your entire distribution for that ( for example you
would build systemd so you just with what you need and use systemd's
built in networkd not NetworkManager, timesyncd not ntp etc, change sshd
to be socket activate, only install components necessary for the
application to run, kernel mods so fourth and so on ) .

If you need to perform benchmark on Red Hat and it's derivatives/clones
then disable this would scew the benchmark output on those would it not?

Can you clarify how dbus-daemon, systemd-journald, systemd-logind,
systemd-udevd are causing issue/impacting in the above setup some thing
more than "I dont think we need it hence we want to disable it".

JBG

1. https://freedesktop.org/wiki/Software/systemd/MinimalBuilds/
Lennart Poettering
2016-06-07 16:23:28 UTC
Permalink
Post by Hebenstreit, Michael
Sorry for directing this question here, but I did not find any
mailing list that would be a better fit.
Problem: I'm running an HPC benchmarking cluster. We are evaluating
RH7/CentOS7/OL7 and have a problem with system noise generated by
the systemd components (v 219-19.0.2, see below).
Background: All cores of the CPU (up to 288) are utilized 99.99% by
the application. Because of the tight coupling node to node (of
programs running on 200+ nodes) every time an OS process wakes up
this automatically delays EVERY process on EVERY node. As those
small interruptions are not synchronized over the cluster, the
overall effect on the effective performance is "time of the single
delay" times "number of nodes in the job". Therefore we need to keep
the OS of our systems are stripped down to an absolute bare minimum.
a) we have no use for any type of logging. The only log we have is kernel dmesg
b) there is only a single user at any time on the system (logging in via ssh).
c) The only demons running are those necessary for NFS, ntp and sshd.
d) we do not run Gnome or similar desktop.
Goal: For these reasons we want to shut down dbus-daemon,
systemd-journald, systemd-logind and after startup also
systemd-udevd. In our special case they do not serve any
purpose. Unfortunately the basic configuration options do not allow
this.
This is simply not supported on systemd. Systems without journald and
udevd are explicitly not supported, and systems without dbus-daemon
are only really supported for early boot schemes.

You can of course ignore what we support and what not, but of course,
then you really should know what you do, and you are basically on your
own.

Note that you can connect the journal to kmsg, if you like, and turn
off local storage, via ForwardToKMsg= and Storage= in journald.conf.
Post by Hebenstreit, Michael
Can you provide any guidance?
Will PID 1 (systemd) continue to do its work (first tests were already successful)?
No, it will not. The only daemon of those listed you can realistically
do without is logind, and if you do that, then you basically roll your
own distro.
Post by Hebenstreit, Michael
What are security implications when shutting down
systemd-logind?
Well, there's no tracking of sessions anymore, i.e. polkit and all
that stuff won't work anymore reasonably, and everything else that
involves anything graphical and so on.

If I were you I'd actually look what wakes up the system IRL instead
of just trying to blanket remove everything. If you do that, then you
are going to have to invest a lot of time dealing with the fallout
yourself.

Lennart
--
Lennart Poettering, Red Hat
Hebenstreit, Michael
2016-06-07 22:17:13 UTC
Permalink
Thanks for the answers
Well, there's no tracking of sessions anymore, i.e. polkit and all that stuff won't work anymore reasonably, and everything else that involves anything graphical and so on.
Nothing listed is in anyway used on our system as already laid out in the original mail. Your answer implies though there is no real security issue though (like sshd not working or being exploitable to gain access to other accounts) - is this correct?
If I were you I'd actually look what wakes up the system IRL instead of just trying to blanket remove everything.
Can you clarify how dbus-daemon, systemd-journald, systemd-logind, systemd-udevd are causing issue/impacting in the above setup some thing more than "I dont think we need it hence we want to disable it".
The approach "if you do not need it, do not run it" works for this case pretty well. Systemd demons take up cycles without doing anything useful for us. We do not do any logging, we do not change the hardware during runtime - so no matter how little time those unit consumes, it impacts scalability. As explained this is not acceptable in our environment.
If you need to perform benchmark on Red Hat and it's derivatives/clones then disable this would scew the benchmark output on those would it not?
Not if you have some "easy" steps to duplicate the environment.
If you need absolute bare minimum systemd [¹] then you need to create/maintain your entire distribution for that
I would not call it a distribution - but yes, building/configuring a new OS out of the basic components supplied by RH/Centos is similar to a new distro.


I understand this usage model cannot be compared to laptops or web servers. But basically you are saying systemd is not usable for our High Performance Computing usage case and I might better off by replacing it with sysinitV. I was hoping for some simpler solution, but if it's not possible then that's life. Will certainly make an interesting topic at HPC conferences :P

Regards
Michael

-----Original Message-----
From: Lennart Poettering [mailto:***@poettering.net]
Sent: Tuesday, June 07, 2016 11:23 PM
To: Hebenstreit, Michael
Cc: systemd-***@lists.freedesktop.org
Subject: Re: [systemd-devel] question on special configuration case
Sorry for directing this question here, but I did not find any mailing
list that would be a better fit.
Problem: I'm running an HPC benchmarking cluster. We are evaluating
RH7/CentOS7/OL7 and have a problem with system noise generated by the
systemd components (v 219-19.0.2, see below).
Background: All cores of the CPU (up to 288) are utilized 99.99% by
the application. Because of the tight coupling node to node (of
programs running on 200+ nodes) every time an OS process wakes up this
automatically delays EVERY process on EVERY node. As those small
interruptions are not synchronized over the cluster, the overall
effect on the effective performance is "time of the single delay"
times "number of nodes in the job". Therefore we need to keep the OS
of our systems are stripped down to an absolute bare minimum.
a) we have no use for any type of logging. The only log we have is kernel dmesg
b) there is only a single user at any time on the system (logging in via ssh).
c) The only demons running are those necessary for NFS, ntp and sshd.
d) we do not run Gnome or similar desktop.
Goal: For these reasons we want to shut down dbus-daemon,
systemd-journald, systemd-logind and after startup also systemd-udevd.
In our special case they do not serve any purpose. Unfortunately the
basic configuration options do not allow this.
This is simply not supported on systemd. Systems without journald and udevd are explicitly not supported, and systems without dbus-daemon are only really supported for early boot schemes.

You can of course ignore what we support and what not, but of course, then you really should know what you do, and you are basically on your own.

Note that you can connect the journal to kmsg, if you like, and turn off local storage, via ForwardToKMsg= and Storage= in journald.conf.
Can you provide any guidance?
Will PID 1 (systemd) continue to do its work (first tests were already successful)?
No, it will not. The only daemon of those listed you can realistically do without is logind, and if you do that, then you basically roll your own distro.
What are security implications when shutting down
systemd-logind?
Well, there's no tracking of sessions anymore, i.e. polkit and all that stuff won't work anymore reasonably, and everything else that involves anything graphical and so on.

If I were you I'd actually look what wakes up the system IRL instead of just trying to blanket remove everything. If you do that, then you are going to have to invest a lot of time dealing with the fallout yourself.

Lennart

--
Lennart Poettering, Red Hat
Jóhann B. Guðmundsson
2016-06-07 23:10:07 UTC
Permalink
Post by Hebenstreit, Michael
I understand this usage model cannot be compared to laptops or web servers. But basically you are saying systemd is not usable for our High Performance Computing usage case and I might better off by replacing it with sysinitV. I was hoping for some simpler solution, but if it's not possible then that's life. Will certainly make an interesting topic at HPC conferences :P
I personally would be interesting comparing your legacy sysv init setup
to an systemd one since systemd is widely deployed on embedded devices
with minimal build ( systemd, udevd and journald ) where systemd
footprint and resource usage has been significantly reduced.

Given that I have pretty much crawled in the entire mud bath that makes
up the core/baseOS layer in Fedora ( which RHEL and it's clone derive
from ) when I was working on integrating systemd in the distribution I'm
also interesting how you plan on making a minimal targeted base image
which installs and uses just what you need from that ( dependency ) mess
without having to rebuild those components first. ( I would think
systemd "tweaking" came after you had solved that problem first along
with rebuilding the kernel if your plan is to use just what you need ).

JBG
Hebenstreit, Michael
2016-06-07 23:50:36 UTC
Permalink
The base system is actually pretty large (currently 1200 packages) - I hate that myself. Still performance wise the packages are not the issue. The SSDs used can easily handle that, and library loads are only happening once at startup (where the difference van be measured, but if the runtime is 24h startup time of 1s are not an issue). Kernel is tweaked, but those changes are relatively small.

The single problem biggest problem is OS noise. Aka every cycle that the CPU(s) are working on anything but the application. This is caused by a combination of "large number of nodes" and "tightly coupled job processes".

Our current (RH6) based system runs with a minimal number of demons, none of them taking up any CPU time unless they are used. Systemd process are not so well behaved. After a few hours of running they are already at a few seconds. On a single system - or systems working independent like server farms - that is not an issue. On our systems each second lost is multiplied by the number of nodes in the jobs (let's say 200, but it could also be up to 10000 or more on large installations) due to tight coupling. If 3 demons use 1s a day each (and this is realistic on Xeon Phi Knights Landing systems), that's slowing down the performance by almost 1% (3 * 200 / 86400 = 0.7% to be exact). And - we do not gain anything from those demons after initial startup!

My worst experience with such issues was on a cluster that lost 20% application performance due to a badly configured crond demon. Now I do not expect systemd to have such a negative impact, but even 1%, or even 0.5% of expected loss are too much in our case.


-----Original Message-----
From: Jóhann B. Guðmundsson [mailto:***@gmail.com]
Sent: Wednesday, June 08, 2016 6:10 AM
To: Hebenstreit, Michael; Lennart Poettering
Cc: systemd-***@lists.freedesktop.org
Subject: Re: [systemd-devel] question on special configuration case
Post by Hebenstreit, Michael
I understand this usage model cannot be compared to laptops or web
servers. But basically you are saying systemd is not usable for our
High Performance Computing usage case and I might better off by
replacing it with sysinitV. I was hoping for some simpler solution,
but if it's not possible then that's life. Will certainly make an
interesting topic at HPC conferences :P
I personally would be interesting comparing your legacy sysv init setup to an systemd one since systemd is widely deployed on embedded devices with minimal build ( systemd, udevd and journald ) where systemd footprint and resource usage has been significantly reduced.

Given that I have pretty much crawled in the entire mud bath that makes up the core/baseOS layer in Fedora ( which RHEL and it's clone derive from ) when I was working on integrating systemd in the distribution I'm also interesting how you plan on making a minimal targeted base image which installs and uses just what you need from that ( dependency ) mess without having to rebuild those components first. ( I would think systemd "tweaking" came after you had solved that problem first along with rebuilding the kernel if your plan is to use just what you need ).

JBG
Greg KH
2016-06-08 01:53:55 UTC
Permalink
Post by Hebenstreit, Michael
The base system is actually pretty large (currently 1200 packages) - I
hate that myself. Still performance wise the packages are not the
issue. The SSDs used can easily handle that, and library loads are
only happening once at startup (where the difference van be measured,
but if the runtime is 24h startup time of 1s are not an issue). Kernel
is tweaked, but those changes are relatively small.
The single problem biggest problem is OS noise. Aka every cycle that
the CPU(s) are working on anything but the application. This is caused
by a combination of "large number of nodes" and "tightly coupled job
processes".
Then bind your applications to the cpus and don't let anything else run
on them, including the kernel. That way you will not get any jitter or
latencies and can use the CPUs to their max, without having to worry
about anything. Leave one CPU alone to have the kernel be able to
manage its housekeeping tasks (you seem to be ignoring that issue when
looking at systemd, which is odd to me as it's more noise than anything
else), and also let everything else run there as well.

That what "most" other system designers in your situation do :)
Post by Hebenstreit, Michael
Our current (RH6) based system runs with a minimal number of demons,
none of them taking up any CPU time unless they are used. Systemd
process are not so well behaved. After a few hours of running they are
already at a few seconds.
What processes are showing up in your count? Perhaps it's just a bug
that needs to be fixed.
Post by Hebenstreit, Michael
On a single system - or systems working independent like server farms
- that is not an issue. On our systems each second lost is multiplied
by the number of nodes in the jobs (let's say 200, but it could also
be up to 10000 or more on large installations) due to tight coupling.
If 3 demons use 1s a day each (and this is realistic on Xeon Phi
Knights Landing systems), that's slowing down the performance by
almost 1% (3 * 200 / 86400 = 0.7% to be exact). And - we do not gain
anything from those demons after initial startup!
Your kernel is eating more CPU time than those 1s numbers, why you
aren't complaining about that seems strange to me :)
Post by Hebenstreit, Michael
My worst experience with such issues was on a cluster that lost 20%
application performance due to a badly configured crond demon.
That's not the issue here though.

Again, what tasks are causing cpu time for "no good reason", let's see
if we can just fix them.

thanks,

greg k-h
Hebenstreit, Michael
2016-06-08 02:04:48 UTC
Permalink
Post by Greg KH
That's not the issue here though.
Nope, but an example how bad things can get.
Post by Greg KH
What processes are showing up in your count? Perhaps it's just a bug that needs to be fixed.
/bin/dbus-daemon
/usr/lib/systemd/systemd-journald
/usr/lib/systemd/systemd-logind

I understand from the previous mails those are necessary to make systemd work - but here they are doing nothing more than talking to each other!
Post by Greg KH
That what "most" other system designers in your situation do :)
Unfortunately I cannot reserve a CPU for OS - I'd like to, but the app developers insist to use all 254 cores available
Post by Greg KH
Your kernel is eating more CPU time than those 1s numbers, why you aren't complaining about that seems strange to me :)
I also check kernel - last time I look on RH6 all kernel threads taking up clock ticks were actually doing work ^^
No time yet to do the same on RH7 kernel


-----Original Message-----
From: Greg KH [mailto:***@linuxfoundation.org]
Sent: Wednesday, June 08, 2016 8:54 AM
To: Hebenstreit, Michael
Cc: Jóhann B. Guðmundsson; Lennart Poettering; systemd-***@lists.freedesktop.org
Subject: Re: [systemd-devel] question on special configuration case
Post by Greg KH
The base system is actually pretty large (currently 1200 packages) - I
hate that myself. Still performance wise the packages are not the
issue. The SSDs used can easily handle that, and library loads are
only happening once at startup (where the difference van be measured,
but if the runtime is 24h startup time of 1s are not an issue). Kernel
is tweaked, but those changes are relatively small.
The single problem biggest problem is OS noise. Aka every cycle that
the CPU(s) are working on anything but the application. This is caused
by a combination of "large number of nodes" and "tightly coupled job
processes".
Then bind your applications to the cpus and don't let anything else run on them, including the kernel. That way you will not get any jitter or latencies and can use the CPUs to their max, without having to worry about anything. Leave one CPU alone to have the kernel be able to manage its housekeeping tasks (you seem to be ignoring that issue when looking at systemd, which is odd to me as it's more noise than anything else), and also let everything else run there as well.

That what "most" other system designers in your situation do :)
Post by Greg KH
Our current (RH6) based system runs with a minimal number of demons,
none of them taking up any CPU time unless they are used. Systemd
process are not so well behaved. After a few hours of running they are
already at a few seconds.
What processes are showing up in your count? Perhaps it's just a bug that needs to be fixed.
Post by Greg KH
On a single system - or systems working independent like server farms
- that is not an issue. On our systems each second lost is multiplied
by the number of nodes in the jobs (let's say 200, but it could also
be up to 10000 or more on large installations) due to tight coupling.
If 3 demons use 1s a day each (and this is realistic on Xeon Phi
Knights Landing systems), that's slowing down the performance by
almost 1% (3 * 200 / 86400 = 0.7% to be exact). And - we do not gain
anything from those demons after initial startup!
Your kernel is eating more CPU time than those 1s numbers, why you aren't complaining about that seems strange to me :)
Post by Greg KH
My worst experience with such issues was on a cluster that lost 20%
application performance due to a badly configured crond demon.
That's not the issue here though.

Again, what tasks are causing cpu time for "no good reason", let's see if we can just fix them.

thanks,

greg k-h
Greg KH
2016-06-08 04:00:53 UTC
Permalink
Post by Hebenstreit, Michael
Post by Greg KH
What processes are showing up in your count? Perhaps it's just a
bug that needs to be fixed.
/bin/dbus-daemon
/usr/lib/systemd/systemd-journald
/usr/lib/systemd/systemd-logind
I understand from the previous mails those are necessary to make
systemd work - but here they are doing nothing more than talking to
each other!
Really? No journal messages are getting created at all? No users
logging in/out? What does strace show on those processes?
Post by Hebenstreit, Michael
Post by Greg KH
That what "most" other system designers in your situation do :)
Unfortunately I cannot reserve a CPU for OS - I'd like to, but the app
developers insist to use all 254 cores available
So you are hurting all 253 cores because you can't spare 1? If you do
the math I think you will find you will get increased throughput. But
what do I know... :)
Post by Hebenstreit, Michael
Post by Greg KH
Your kernel is eating more CPU time than those 1s numbers, why you
aren't complaining about that seems strange to me :)
I also check kernel - last time I look on RH6 all kernel threads
taking up clock ticks were actually doing work ^^ No time yet to do
the same on RH7 kernel
Again, that's not the issue, you can't see the time the kernel is using
to do its work, but it is there (interrupts, scheduling, housekeeping,
etc.) So get it out of the way entirely and see how much faster your
application runs without it even present on those cpus, if you really
have cpu bound processes. That's what the feature was made for, people
in your situation, to ignore it and try to go after something else seems
very strange to me.

greg k-h
Hebenstreit, Michael
2016-06-08 06:43:04 UTC
Permalink
Really? No journal messages are getting created at all? No users logging in/out? What does strace show on those processes?
Yes, messages are created - but I'm not interested in them. Maybe a user logs in for a 6h job - that's already tracked by the cluster software. There are virtually no demons running, no changes to the hardware - so all those demons are doing are looking out for themselves. Not really productive
So you are hurting all 253 cores because you can't spare 1?
Situation is a bit more complex. I have 64 physical cores, with 4 units each for integer operations and 2 floating point units. So essentially if I reserve one integer unit for the OS, due to cache hierarchies and other oddities, I essentially take down 4 cores. The applications typically scale best if they run on a power of 2 number of cores.
Again, that's not the issue, you can't see the time the kernel is using to do its work, but it is there (interrupts, scheduling, housekeeping,
etc.)

shouldn't that show up in the time for worker threads? And I'm not arguing you are wrong. We should minimize that and if possible keep all OS on an extra core. That does not make my argument invalid those demons are doing nothing more than housekeeping themselves in a very complicated fashion and they are wasting resources.



-----Original Message-----
From: Greg KH [mailto:***@linuxfoundation.org]
Sent: Wednesday, June 08, 2016 11:01 AM
To: Hebenstreit, Michael
Cc: Jóhann B. Guðmundsson; Lennart Poettering; systemd-***@lists.freedesktop.org
Subject: Re: [systemd-devel] question on special configuration case
Post by Greg KH
What processes are showing up in your count? Perhaps it's just a
bug that needs to be fixed.
/bin/dbus-daemon
/usr/lib/systemd/systemd-journald
/usr/lib/systemd/systemd-logind
I understand from the previous mails those are necessary to make
systemd work - but here they are doing nothing more than talking to
each other!
Really? No journal messages are getting created at all? No users logging in/out? What does strace show on those processes?
Post by Greg KH
That what "most" other system designers in your situation do :)
Unfortunately I cannot reserve a CPU for OS - I'd like to, but the app
developers insist to use all 254 cores available
So you are hurting all 253 cores because you can't spare 1? If you do the math I think you will find you will get increased throughput. But what do I know... :)
Post by Greg KH
Your kernel is eating more CPU time than those 1s numbers, why you
aren't complaining about that seems strange to me :)
I also check kernel - last time I look on RH6 all kernel threads
taking up clock ticks were actually doing work ^^ No time yet to do
the same on RH7 kernel
Again, that's not the issue, you can't see the time the kernel is using to do its work, but it is there (interrupts, scheduling, housekeeping,
etc.) So get it out of the way entirely and see how much faster your application runs without it even present on those cpus, if you really have cpu bound processes. That's what the feature was made for, people in your situation, to ignore it and try to go after something else seems very strange to me.

greg k-h
Greg KH
2016-06-08 16:09:53 UTC
Permalink
Post by Hebenstreit, Michael
Really? No journal messages are getting created at all? No users logging in/out? What does strace show on those processes?
Yes, messages are created - but I'm not interested in them. Maybe a
user logs in for a 6h job - that's already tracked by the cluster
software. There are virtually no demons running, no changes to the
hardware - so all those demons are doing are looking out for
themselves. Not really productive
If messages are created, they have to go somewhere, to think that they
would be "free" is crazy :)
Post by Hebenstreit, Michael
So you are hurting all 253 cores because you can't spare 1?
Situation is a bit more complex. I have 64 physical cores, with 4
units each for integer operations and 2 floating point units. So
essentially if I reserve one integer unit for the OS, due to cache
hierarchies and other oddities, I essentially take down 4 cores. The
applications typically scale best if they run on a power of 2 number
of cores.
You can still run the applications on the "non-reserved" core, it's just
that the kernel can't get access to any of the others. So you only take
the hit of any potential wakeups and other kernel housekeeping on that
one core.

Again, try it, you might be pleasantly surprised as your workload is
_exactly_ what that feature was created for. To ignore it without
testing seems bizarre to me. If it doesn't work for you, then either
that kernel feature needs to be fixed, or maybe we can just rip it out,
so you need to tell the kernel developers about it.
Post by Hebenstreit, Michael
Again, that's not the issue, you can't see the time the kernel is using to do its work, but it is there (interrupts, scheduling, housekeeping,
etc.)
shouldn't that show up in the time for worker threads?
How do you account for interrupts, I/O, scheduler processing time, etc?
:)
Post by Hebenstreit, Michael
And I'm not arguing you are wrong. We should minimize that and if
possible keep all OS on an extra core. That does not make my argument
invalid those demons are doing nothing more than housekeeping
themselves in a very complicated fashion and they are wasting
resources.
Again, I think you are wasting more resources than you realize just
because you can't see it :)

And as others have pointed out, turn off watchdogs and you should be
fine from a systemd point of view.

thanks,

greg k-h

Lennart Poettering
2016-06-08 12:57:41 UTC
Permalink
Post by Greg KH
Post by Hebenstreit, Michael
Post by Greg KH
What processes are showing up in your count? Perhaps it's just a
bug that needs to be fixed.
/bin/dbus-daemon
/usr/lib/systemd/systemd-journald
/usr/lib/systemd/systemd-logind
I understand from the previous mails those are necessary to make
systemd work - but here they are doing nothing more than talking to
each other!
Really? No journal messages are getting created at all? No users
logging in/out? What does strace show on those processes?
It's the "watchdog" logic most likely. i.e. systemd has a per-service
setting WatchdogSec=. If that's set the daemons have to ping back PID
1 in regular intervals, or otherwise are assumed hanging.

On top of that PID 1 actually talks to hw watchdogs, if there are any,
by default.

If both of that is turned off, then there should be zero wakeups
really...

Lennart
--
Lennart Poettering, Red Hat
Andrew Thompson
2016-06-08 04:06:25 UTC
Permalink
On Tue, Jun 7, 2016 at 7:04 PM, Hebenstreit, Michael
Post by Hebenstreit, Michael
Post by Greg KH
That what "most" other system designers in your situation do :)
Unfortunately I cannot reserve a CPU for OS - I'd like to, but the app developers insist to use all 254 cores available
Tough situation. Use Forth, it's the only way, my friend.
Simon McVittie
2016-06-08 13:43:00 UTC
Permalink
Post by Hebenstreit, Michael
Post by Greg KH
What processes are showing up in your count? Perhaps it's just a bug that needs to be fixed.
/bin/dbus-daemon
/usr/lib/systemd/systemd-journald
/usr/lib/systemd/systemd-logind
dbus-daemon will wake up when there are D-Bus messages to be delivered,
or when D-Bus-related data in /usr/share/dbus-1/ changes. If there is
nothing emitting D-Bus messages then it shouldn't normally wake up.

In dbus >= 1.10 you can run "dbus-monitor --system" as root, and you'll
see any D-Bus message that goes past. Unfortunately this use-case for
monitoring didn't really work in previous versions.

If you want it to stay off the majority of your CPU cores, Greg's
recommendation to set up CPU affinity seems wise. dbus-daemon is
single-threaded (or 2-threaded if SELinux and the audit subsystem are
active), so it will normally only run on one CPU at a time anyway.
--
Simon McVittie
Collabora Ltd. <http://www.collabora.com/>
Mantas Mikulėnas
2016-06-08 04:34:49 UTC
Permalink
This sounds like you could start by unsetting WatchdogSec= for those
daemons. Other than the watchdog, they shouldn't be using any CPU unless
explicitly contacted.

On Wed, Jun 8, 2016, 02:50 Hebenstreit, Michael <
Post by Hebenstreit, Michael
The base system is actually pretty large (currently 1200 packages) - I
hate that myself. Still performance wise the packages are not the issue.
The SSDs used can easily handle that, and library loads are only happening
once at startup (where the difference van be measured, but if the runtime
is 24h startup time of 1s are not an issue). Kernel is tweaked, but those
changes are relatively small.
The single problem biggest problem is OS noise. Aka every cycle that the
CPU(s) are working on anything but the application. This is caused by a
combination of "large number of nodes" and "tightly coupled job processes".
Our current (RH6) based system runs with a minimal number of demons, none
of them taking up any CPU time unless they are used. Systemd process are
not so well behaved. After a few hours of running they are already at a few
seconds. On a single system - or systems working independent like server
farms - that is not an issue. On our systems each second lost is multiplied
by the number of nodes in the jobs (let's say 200, but it could also be up
to 10000 or more on large installations) due to tight coupling. If 3 demons
use 1s a day each (and this is realistic on Xeon Phi Knights Landing
systems), that's slowing down the performance by almost 1% (3 * 200 / 86400
= 0.7% to be exact). And - we do not gain anything from those demons after
initial startup!
My worst experience with such issues was on a cluster that lost 20%
application performance due to a badly configured crond demon. Now I do not
expect systemd to have such a negative impact, but even 1%, or even 0.5% of
expected loss are too much in our case.
-----Original Message-----
Sent: Wednesday, June 08, 2016 6:10 AM
To: Hebenstreit, Michael; Lennart Poettering
Subject: Re: [systemd-devel] question on special configuration case
Post by Hebenstreit, Michael
I understand this usage model cannot be compared to laptops or web
servers. But basically you are saying systemd is not usable for our
High Performance Computing usage case and I might better off by
replacing it with sysinitV. I was hoping for some simpler solution,
but if it's not possible then that's life. Will certainly make an
interesting topic at HPC conferences :P
I personally would be interesting comparing your legacy sysv init setup to
an systemd one since systemd is widely deployed on embedded devices with
minimal build ( systemd, udevd and journald ) where systemd footprint and
resource usage has been significantly reduced.
Given that I have pretty much crawled in the entire mud bath that makes up
the core/baseOS layer in Fedora ( which RHEL and it's clone derive from )
when I was working on integrating systemd in the distribution I'm also
interesting how you plan on making a minimal targeted base image which
installs and uses just what you need from that ( dependency ) mess without
having to rebuild those components first. ( I would think systemd
"tweaking" came after you had solved that problem first along with
rebuilding the kernel if your plan is to use just what you need ).
JBG
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Hebenstreit, Michael
2016-06-08 06:51:33 UTC
Permalink
Thanks for this and the other suggestions!

So for starters we’ll disable logind and dbus, increase watchdogsec and see where the footprint is – before disabling journald if necessary in a next step.

Regards
Michael


From: Mantas Mikulėnas [mailto:***@gmail.com]
Sent: Wednesday, June 08, 2016 11:35 AM
To: Hebenstreit, Michael
Cc: Systemd
Subject: Re: [systemd-devel] question on special configuration case

This sounds like you could start by unsetting WatchdogSec= for those daemons. Other than the watchdog, they shouldn't be using any CPU unless explicitly contacted.
On Wed, Jun 8, 2016, 02:50 Hebenstreit, Michael <***@intel.com<mailto:***@intel.com>> wrote:
The base system is actually pretty large (currently 1200 packages) - I hate that myself. Still performance wise the packages are not the issue. The SSDs used can easily handle that, and library loads are only happening once at startup (where the difference van be measured, but if the runtime is 24h startup time of 1s are not an issue). Kernel is tweaked, but those changes are relatively small.

The single problem biggest problem is OS noise. Aka every cycle that the CPU(s) are working on anything but the application. This is caused by a combination of "large number of nodes" and "tightly coupled job processes".

Our current (RH6) based system runs with a minimal number of demons, none of them taking up any CPU time unless they are used. Systemd process are not so well behaved. After a few hours of running they are already at a few seconds. On a single system - or systems working independent like server farms - that is not an issue. On our systems each second lost is multiplied by the number of nodes in the jobs (let's say 200, but it could also be up to 10000 or more on large installations) due to tight coupling. If 3 demons use 1s a day each (and this is realistic on Xeon Phi Knights Landing systems), that's slowing down the performance by almost 1% (3 * 200 / 86400 = 0.7% to be exact). And - we do not gain anything from those demons after initial startup!

My worst experience with such issues was on a cluster that lost 20% application performance due to a badly configured crond demon. Now I do not expect systemd to have such a negative impact, but even 1%, or even 0.5% of expected loss are too much in our case.
Jóhann B. Guðmundsson
2016-06-08 13:28:53 UTC
Permalink
Post by Hebenstreit, Michael
Thanks for this and the other suggestions!
So for starters we’ll disable logind and dbus, increase watchdogsec
and see where the footprint is – before disabling journald if
necessary in a next step.
You cannot disable journal but you can reduce it and the following
should give the least amount of logging in all potential scenarios and
usage ;)

Just create "/etc/systemd/journald.conf.d/10-hpc-tweaks.conf"which contains

[Journal]
Storage=none
MaxLevelConsole=emerg
MaxLevelStore=emerg
MaxLevelSyslog=emerg
MaxLevelKMsg=emerg
MaxLevelConsole=emerg
MaxLevelWall=emerg
TTYPath=/dev/null

Then restart the journal ( systemctl restart systemd-journald )

JBG
Lennart Poettering
2016-06-08 12:50:30 UTC
Permalink
Post by Hebenstreit, Michael
Thanks for the answers
Well, there's no tracking of sessions anymore, i.e. polkit and all that stuff won't work anymore reasonably, and everything else that involves anything graphical and so on.
Nothing listed is in anyway used on our system as already laid out
in the original mail. Your answer implies though there is no real
security issue though (like sshd not working or being exploitable to
gain access to other accounts) - is this correct?
Yes, that's correct.
Post by Hebenstreit, Michael
If I were you I'd actually look what wakes up the system IRL
instead of just trying to blanket remove everything. Can you
clarify how dbus-daemon, systemd-journald, systemd-logind,
systemd-udevd are causing issue/impacting in the above setup some
thing more than "I dont think we need it hence we want to disable
it".
The approach "if you do not need it, do not run it" works for this
case pretty well. Systemd demons take up cycles without doing
anything useful for us. We do not do any logging, we do not change
the hardware during runtime - so no matter how little time those
unit consumes, it impacts scalability. As explained this is not
acceptable in our environment.
Well, they really shouldn't take up cycles when idle, except for the
watchdog stuff, which is easy to disable... It sounds like the much
better idea to track this down, and fix it in the individual case.

Lennart
--
Lennart Poettering, Red Hat
Continue reading on narkive:
Loading...