Discussion:
A server says: "System is going down." But never does.
(too old to reply)
Yuri Kanivetsky
2017-11-28 15:02:22 UTC
Permalink
Hi,

This mailing list is the only place where I expect to have some
helpful feedback. But feel free to suggest other places. I'd like to
investigate situation I have now, find out what went wrong and prevent
it from happening again if possible. Your help is appreciated.

Like I said, a server reports that it's going down, when I ssh to it
as root. As a non-root user, it says that and closes the connection.

In the journal I see a lot of this:


Nov 28 16:22:01 st2 systemd-journal[353]: Journal stopped
Nov 28 16:22:01 st2 systemd-journal[494]: Runtime journal is using
624.0M (max allowed 642.1M, trying to leave 963.1M free of 5.6G
available → current limit 642.1M).
Nov 28 16:22:01 st2 systemd-journal[494]: Runtime journal is using
624.0M (max allowed 642.1M, trying to leave 963.1M free of 5.6G
available → current limit 642.1M).
Nov 28 16:22:01 st2 systemd-journal[494]: Journal started
Nov 28 16:22:01 st2 systemd[1]: systemd-journald.service watchdog
timeout (limit 1min)!
Nov 28 16:22:01 st2 systemd-journald[353]: Received SIGTERM from PID 1
(systemd).
Nov 28 16:22:01 st2 systemd[1]: Unit systemd-journald.service entered
failed state.
Nov 28 16:22:01 st2 systemd[1]: systemd-journald.service has no
holdoff time, scheduling restart.
Nov 28 16:22:01 st2 systemd[1]: Stopping Journal Service...
Nov 28 16:22:01 st2 systemd[1]: Starting Journal Service...
Nov 28 16:22:01 st2 systemd[1]: Started Journal Service.
Nov 28 16:22:01 st2 systemd[1]: Starting Trigger Flushing of Journal
to Persistent Storage...
Nov 28 16:22:01 st2 systemd[1]: systemd-journal-flush.service: main
process exited, code=exited, status=1/FAILURE
Nov 28 16:22:01 st2 systemd[1]: Failed to start Trigger Flushing of
Journal to Persistent Storage.
Nov 28 16:22:01 st2 systemd[1]: Unit systemd-journal-flush.service
entered failed state.


Nov 28 16:22:52 st2 systemd[1]: systemd-timesyncd.service start
operation timed out. Terminating.
Nov 28 16:22:52 st2 systemd[1]: Failed to start Network Time Synchronization.
Nov 28 16:22:52 st2 systemd[1]: Unit systemd-timesyncd.service entered
failed state.
Nov 28 16:22:53 st2 systemd[1]: systemd-timesyncd.service has no
holdoff time, scheduling restart.
Nov 28 16:22:53 st2 systemd[1]: Stopping Network Time Synchronization...
Nov 28 16:22:53 st2 systemd[1]: Starting Network Time Synchronization...


Nov 28 16:23:02 st2 systemd-journal[494]: Journal stopped
Nov 28 16:23:02 st2 systemd-journal[632]: Runtime journal is using
624.0M (max allowed 642.1M, trying to leave 963.1M free of 5.6G
available → current limit 642.1M).
Nov 28 16:23:02 st2 systemd-journal[632]: Runtime journal is using
624.0M (max allowed 642.1M, trying to leave 963.1M free of 5.6G
available → current limit 642.1M).
Nov 28 16:23:02 st2 systemd-journal[632]: Journal started
Nov 28 16:23:02 st2 systemd[1]: systemd-journald.service watchdog
timeout (limit 1min)!
Nov 28 16:23:02 st2 systemd-journald[494]: Received SIGTERM from PID 1
(systemd).
Nov 28 16:23:02 st2 systemd[1]: Unit systemd-journald.service entered
failed state.
Nov 28 16:23:02 st2 systemd[1]: systemd-journald.service has no
holdoff time, scheduling restart.
Nov 28 16:23:02 st2 systemd[1]: Stopping Journal Service...
Nov 28 16:23:02 st2 systemd[1]: Starting Journal Service...
Nov 28 16:23:02 st2 systemd[1]: Started Journal Service.
Nov 28 16:23:02 st2 systemd[1]: Starting Trigger Flushing of Journal
to Persistent Storage...
Nov 28 16:23:02 st2 systemd[1]: systemd-journal-flush.service: main
process exited, code=exited, status=1/FAILURE
Nov 28 16:23:02 st2 systemd[1]: Failed to start Trigger Flushing of
Journal to Persistent Storage.
Nov 28 16:23:02 st2 systemd[1]: Unit systemd-journal-flush.service
entered failed state.


It repeats itself every minute.

systemctl doesn't work:


# systemctl
Failed to get D-Bus connection: Connection refused


I have 16 lxc containers running on the server:


# lxc-ls -f | grep RUNNING | wc -l
16


and 16 dbus-daemon's (so supposedly one dbus-daemon is missing):


# ps -ef | grep dbus
message+ 845 1 0 Feb15 ? 00:09:56 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 1615 579 0 Jun13 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
root 1673 28602 0 16:26 pts/31 00:00:00 grep dbus
systemd+ 3761 3461 0 Feb15 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 4635 3436 0 Feb15 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 4767 3527 0 Feb15 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 5344 3597 0 Feb15 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 5714 3664 0 Feb15 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 5793 3750 0 Feb15 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 7856 7198 0 Oct18 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 9477 8848 0 Oct18 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 10930 10322 0 Oct18 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 13130 10717 0 Jun27 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 13300 11339 0 Apr03 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 19689 19360 0 Jul28 ? 00:00:00 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation
systemd+ 21045 20562 0 Oct19 ? 00:00:01 /usr/bin/dbus-daemon
--system --address=systemd: --nofork --nopidfile --systemd-activation


# ps -ef | grep dbus | wc -l
16


My conjecture is that the first dbus-daemon is of the physical host,
since it has ppid == 1, and user messagebus.

On Nov 21 in the log I can see supposedly restart, starting with:


Nov 21 19:55:27 st2 systemd[320]: systemd 215 running in system mode.
(+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ
-SECCOMP -APPARMOR)

https://gist.github.com/x-yuri/8dfe9e561327ad445b1713749cd83252


But I don't understand what triggered it.

Different tools report different time of last reboot:


# last reboot
reboot system boot 3.16.0-4-amd64 Tue Nov 21 19:55 - 19:56 (00:01)

wtmp begins Thu Nov 2 17:22:02 2017

# who -b
system boot 2017-11-21 19:55

# journalctl --list-boots
0 606cc0c448794f2a8573fcdc2ba8d163 Fri 2017-10-13 05:09:18 EEST—Tue
2017-11-28 16:56:18 EET

# uptime
16:57:21 up 286 days, 13:22, 1 user, load average: 3.19, 3.32, 3.33


Is there anything I can check? Any suggestions are welcome.

P.S.,


# cat /etc/issue
Debian GNU/Linux 8 \n \l


Regards,
Yuri

Loading...