Discussion:
systemd-journald may crash during memory pressure
(too old to reply)
Kai Krakow
2018-02-08 22:50:45 UTC
Permalink
Hello!

During memory pressure and/or high load, journald may crash. This is
probably due to design using mmap but it should really not do this.

On 32-bit systems, we are seeing such crashes constantly although the
available memory is still gigabytes (it's a 32-bit userland running in a
64-bit kernel).


[82988.670323] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[82988.670684] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[82988.685928] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
[82988.709575] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[82988.717390] systemd[1]: Stopped Flush Journal to Persistent Storage.
[82988.717411] systemd[1]: Stopping Flush Journal to Persistent Storage...
[82988.726303] systemd[1]: Stopped Journal Service.
[82988.844462] systemd[1]: Starting Journal Service...
[82993.633781] systemd-coredump[22420]: MESSAGE=Process 461 (systemd-journal) of user 0 dumped core.
[82993.633811] systemd-coredump[22420]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.3d492c866f254fb981f916c6c3918046.461.1518125377000000.lz4
[82993.633813] systemd-coredump[22420]: Stack trace of thread 461:
[82993.633814] systemd-coredump[22420]: #0 0x00007f940241d4dd journal_file_move_to_object (libsystemd-shared-237.so)
[82993.633815] systemd-coredump[22420]: #1 0x00007f940241e910 journal_file_find_data_object_with_hash (libsystemd-shared-237.so)
[82993.633816] systemd-coredump[22420]: #2 0x00007f940241fe81 journal_file_append_data (libsystemd-shared-237.so)
[82993.633817] systemd-coredump[22420]: #3 0x0000556a343ae9ea write_to_journal (systemd-journald)
[82993.633819] systemd-coredump[22420]: #4 0x0000556a343b0974 server_dispatch_message (systemd-journald)
[82993.633820] systemd-coredump[22420]: #5 0x0000556a343b24bb stdout_stream_log (systemd-journald)
[82993.633821] systemd-coredump[22420]: #6 0x0000556a343b2afe stdout_stream_line (systemd-journald)
[82993.723157] systemd-coredum: 7 output lines suppressed due to ratelimiting
[83002.830610] systemd-journald[22424]: File /var/log/journal/121b87ca633e8ac0016656680000001b/system.journal corrupted or uncleanly shut down, renaming and replacing.
[83014.774538] systemd[1]: Started Journal Service.
[83119.277143] systemd-journald[22424]: File /var/log/journal/121b87ca633e8ac0016656680000001b/user-500.journal corrupted or uncleanly shut down, renaming and replacing.
--
Regards,
Kai

Replies to list-only preferred.
v***@pengaru.com
2018-02-09 02:12:23 UTC
Permalink
Note the logs you've pasted portray a watchdog timeout which resulted in
SIGABRT and a subsequent core dump.

This is not really a journald "crash", and you can increase the watchdog
timeout or disable it entirely to make it more tolerant of thrashing.

What I presume happened is the system was thrashing and a page fault in
the mapped journal took too long to complete.

Regards,
Vito Caputo
Post by Kai Krakow
Hello!
During memory pressure and/or high load, journald may crash. This is
probably due to design using mmap but it should really not do this.
On 32-bit systems, we are seeing such crashes constantly although the
available memory is still gigabytes (it's a 32-bit userland running in a
64-bit kernel).
[82988.670323] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[82988.670684] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[82988.685928] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
[82988.709575] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[82988.717390] systemd[1]: Stopped Flush Journal to Persistent Storage.
[82988.717411] systemd[1]: Stopping Flush Journal to Persistent Storage...
[82988.726303] systemd[1]: Stopped Journal Service.
[82988.844462] systemd[1]: Starting Journal Service...
[82993.633781] systemd-coredump[22420]: MESSAGE=Process 461 (systemd-journal) of user 0 dumped core.
[82993.633811] systemd-coredump[22420]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.3d492c866f254fb981f916c6c3918046.461.1518125377000000.lz4
[82993.633814] systemd-coredump[22420]: #0 0x00007f940241d4dd journal_file_move_to_object (libsystemd-shared-237.so)
[82993.633815] systemd-coredump[22420]: #1 0x00007f940241e910 journal_file_find_data_object_with_hash (libsystemd-shared-237.so)
[82993.633816] systemd-coredump[22420]: #2 0x00007f940241fe81 journal_file_append_data (libsystemd-shared-237.so)
[82993.633817] systemd-coredump[22420]: #3 0x0000556a343ae9ea write_to_journal (systemd-journald)
[82993.633819] systemd-coredump[22420]: #4 0x0000556a343b0974 server_dispatch_message (systemd-journald)
[82993.633820] systemd-coredump[22420]: #5 0x0000556a343b24bb stdout_stream_log (systemd-journald)
[82993.633821] systemd-coredump[22420]: #6 0x0000556a343b2afe stdout_stream_line (systemd-journald)
[82993.723157] systemd-coredum: 7 output lines suppressed due to ratelimiting
[83002.830610] systemd-journald[22424]: File /var/log/journal/121b87ca633e8ac0016656680000001b/system.journal corrupted or uncleanly shut down, renaming and replacing.
[83014.774538] systemd[1]: Started Journal Service.
[83119.277143] systemd-journald[22424]: File /var/log/journal/121b87ca633e8ac0016656680000001b/user-500.journal corrupted or uncleanly shut down, renaming and replacing.
--
Regards,
Kai
Replies to list-only preferred.
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Kai Krakow
2018-02-09 09:43:21 UTC
Permalink
Post by v***@pengaru.com
Note the logs you've pasted portray a watchdog timeout which resulted in
SIGABRT and a subsequent core dump.
This is not really a journald "crash", and you can increase the watchdog
timeout or disable it entirely to make it more tolerant of thrashing.
What I presume happened is the system was thrashing and a page fault in
the mapped journal took too long to complete.
Oh thanks, this is a good pointer. I'll try that.
Post by v***@pengaru.com
Post by Kai Krakow
Hello!
During memory pressure and/or high load, journald may crash. This is
probably due to design using mmap but it should really not do this.
On 32-bit systems, we are seeing such crashes constantly although the
available memory is still gigabytes (it's a 32-bit userland running in a
64-bit kernel).
[82988.670323] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[82988.670684] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[82988.685928] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
[82988.709575] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[82988.717390] systemd[1]: Stopped Flush Journal to Persistent Storage.
[82988.717411] systemd[1]: Stopping Flush Journal to Persistent Storage...
[82988.726303] systemd[1]: Stopped Journal Service.
[82988.844462] systemd[1]: Starting Journal Service...
[82993.633781] systemd-coredump[22420]: MESSAGE=Process 461 (systemd-journal) of user 0 dumped core.
[82993.633811] systemd-coredump[22420]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.3d492c866f254fb981f916c6c3918046.461.1518125377000000.lz4
[82993.633814] systemd-coredump[22420]: #0 0x00007f940241d4dd journal_file_move_to_object (libsystemd-shared-237.so)
[82993.633815] systemd-coredump[22420]: #1 0x00007f940241e910 journal_file_find_data_object_with_hash (libsystemd-shared-237.so)
[82993.633816] systemd-coredump[22420]: #2 0x00007f940241fe81 journal_file_append_data (libsystemd-shared-237.so)
[82993.633817] systemd-coredump[22420]: #3 0x0000556a343ae9ea write_to_journal (systemd-journald)
[82993.633819] systemd-coredump[22420]: #4 0x0000556a343b0974 server_dispatch_message (systemd-journald)
[82993.633820] systemd-coredump[22420]: #5 0x0000556a343b24bb stdout_stream_log (systemd-journald)
[82993.633821] systemd-coredump[22420]: #6 0x0000556a343b2afe stdout_stream_line (systemd-journald)
[82993.723157] systemd-coredum: 7 output lines suppressed due to ratelimiting
[83002.830610] systemd-journald[22424]: File /var/log/journal/121b87ca633e8ac0016656680000001b/system.journal corrupted or uncleanly shut down, renaming and replacing.
[83014.774538] systemd[1]: Started Journal Service.
[83119.277143] systemd-journald[22424]: File /var/log/journal/121b87ca633e8ac0016656680000001b/user-500.journal corrupted or uncleanly shut down, renaming and replacing.
--
Regards,
Kai
Replies to list-only preferred.
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
Regards,
Kai

Replies to list-only preferred.
Lennart Poettering
2018-02-09 11:41:50 UTC
Permalink
Post by Kai Krakow
Hello!
During memory pressure and/or high load, journald may crash. This is
probably due to design using mmap but it should really not do this.
On 32-bit systems, we are seeing such crashes constantly although the
available memory is still gigabytes (it's a 32-bit userland running in a
64-bit kernel).
[82988.670323] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[82988.670684] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
This last log lines indicates journald wasn't scheduled for a long
time which caused the watchdog to hit and journald was
aborted. Consider increasing the watchdog timeout if your system is
indeed that loaded and that's is supposed to be an OK thing...

Lennart
--
Lennart Poettering, Red Hat
Uoti Urpala
2018-02-10 00:16:44 UTC
Permalink
Post by Lennart Poettering
This last log lines indicates journald wasn't scheduled for a long
time which caused the watchdog to hit and journald was
aborted. Consider increasing the watchdog timeout if your system is
indeed that loaded and that's is supposed to be an OK thing...
BTW I've seen the same behavior on a system with a single active
process that uses enough memory to trigger significant swap use. I
wonder if there has been a regression in the kernel causing misbehavior
when swapping? The problems aren't specific to journald - desktop
environment can totally freeze too etc.
Kai Krakow
2018-02-10 13:23:34 UTC
Permalink
Post by Uoti Urpala
Post by Lennart Poettering
This last log lines indicates journald wasn't scheduled for a long
time which caused the watchdog to hit and journald was
aborted. Consider increasing the watchdog timeout if your system is
indeed that loaded and that's is supposed to be an OK thing...
BTW I've seen the same behavior on a system with a single active
process that uses enough memory to trigger significant swap use. I
wonder if there has been a regression in the kernel causing misbehavior
when swapping? The problems aren't specific to journald - desktop
environment can totally freeze too etc.
This problem seems to be there since kernel 4.9 which was a real pita in
this regard. It's progressively becoming better since kernel 4.10. The
kernel seems trying to prevent swapping at any cost since then, at least
at the cost of much higher latency, and at the cost of pushing all cache
out of RAM.

The result is processes stuck for easily 30 seconds and more during
memory pressure. Sometimes I see the kernel loudly complaining in dmesg
about high wait times for allocating RAM, especially from the btrfs
module. Thus, the biggest problem may be that kernel threads itself get
stuck in memory allocations and are a victim of high latency.

Currently I'm running my user session in a slice with max 80% RAM which
seems to help. It helps not discarding all cache. I also put some
potentially high memory users (regarding cache and/or resident mem) into
slices with carefully selected memory limits (backup and maintenance
services). Slices limited in such a way will start swapping before cache
is discarded and everything works better again. Part of this problem may
be that I have one process running which mmaps and locks 1G of memory
(bees, a btrfs deduplicator).

This system has 16G of RAM which is usually plenty but I use tmpfs to
build packages in Gentoo, and while that worked wonderfully before 4.9, I
have to be really careful now. The kernel happily throws away cache
instead of swapping early. Setting vm.swappiness differently seems to
have no perceivable effect.

Software that uses mmap is the first latency victim of this new behavior.
As such, also systemd-journald seems to be hit hard by this.

After the system recovered from high memory pressure (which can take
10-15 minutes, resulting in a loadavg of 400+), it ends up with some
gigabytes of inactive memory in the swap which it will only swap back in
then during shutdown (which will also take some minutes then).

The problem since 4.9 seems to be that the kernel tends to do swap storms
instead of constantly swapping out memory at low rates during usage. The
swap storms totally thrash the system.

Before 4.9, the kernel had no such latency spikes under memory pressure.
Swap would usually grew slowly over time, and the system felt sluggish
one or another time but still usable wrt latency. I usually ended up with
5-8G of swap usage, and that was no problem. Now, swap only significantly
grows during swap storms with an unusable system for many minutes, with
latencies of 10+ seconds around twice per minute.

I had no swap storm yet since the last boot, and swap usage is around 16M
now. Before kernel 4.9, this would be much higher already.
--
Regards,
Kai

Replies to list-only preferred.
Kai Krakow
2018-02-10 14:05:16 UTC
Permalink
Post by Kai Krakow
Post by Uoti Urpala
Post by Lennart Poettering
This last log lines indicates journald wasn't scheduled for a long
time which caused the watchdog to hit and journald was aborted.
Consider increasing the watchdog timeout if your system is indeed that
loaded and that's is supposed to be an OK thing...
BTW I've seen the same behavior on a system with a single active
process that uses enough memory to trigger significant swap use. I
wonder if there has been a regression in the kernel causing misbehavior
when swapping? The problems aren't specific to journald - desktop
environment can totally freeze too etc.
This problem seems to be there since kernel 4.9 which was a real pita in
this regard. It's progressively becoming better since kernel 4.10. The
kernel seems trying to prevent swapping at any cost since then, at least
at the cost of much higher latency, and at the cost of pushing all cache
out of RAM.
The result is processes stuck for easily 30 seconds and more during
memory pressure. Sometimes I see the kernel loudly complaining in dmesg
about high wait times for allocating RAM, especially from the btrfs
module. Thus, the biggest problem may be that kernel threads itself get
stuck in memory allocations and are a victim of high latency.
Currently I'm running my user session in a slice with max 80% RAM which
seems to help. It helps not discarding all cache. I also put some
potentially high memory users (regarding cache and/or resident mem) into
slices with carefully selected memory limits (backup and maintenance
services). Slices limited in such a way will start swapping before cache
is discarded and everything works better again. Part of this problem may
be that I have one process running which mmaps and locks 1G of memory
(bees, a btrfs deduplicator).
This system has 16G of RAM which is usually plenty but I use tmpfs to
build packages in Gentoo, and while that worked wonderfully before 4.9,
I have to be really careful now. The kernel happily throws away cache
instead of swapping early. Setting vm.swappiness differently seems to
have no perceivable effect.
Software that uses mmap is the first latency victim of this new behavior.
As such, also systemd-journald seems to be hit hard by this.
After the system recovered from high memory pressure (which can take
10-15 minutes, resulting in a loadavg of 400+), it ends up with some
gigabytes of inactive memory in the swap which it will only swap back in
then during shutdown (which will also take some minutes then).
The problem since 4.9 seems to be that the kernel tends to do swap
storms instead of constantly swapping out memory at low rates during
usage. The swap storms totally thrash the system.
Before 4.9, the kernel had no such latency spikes under memory pressure.
Swap would usually grew slowly over time, and the system felt sluggish
one or another time but still usable wrt latency. I usually ended up
with 5-8G of swap usage, and that was no problem. Now, swap only
significantly grows during swap storms with an unusable system for many
minutes, with latencies of 10+ seconds around twice per minute.
I had no swap storm yet since the last boot, and swap usage is around
16M now. Before kernel 4.9, this would be much higher already.
After some more research, I found that vm.watermark_scale_factor may be
the knob I am looking for. I'm going to watch behavior now with a higher
factor (default = 10, now 200).
--
Regards,
Kai

Replies to list-only preferred.
v***@pengaru.com
2018-02-10 17:39:35 UTC
Permalink
Post by Kai Krakow
Post by Kai Krakow
Post by Uoti Urpala
Post by Lennart Poettering
This last log lines indicates journald wasn't scheduled for a long
time which caused the watchdog to hit and journald was aborted.
Consider increasing the watchdog timeout if your system is indeed that
loaded and that's is supposed to be an OK thing...
BTW I've seen the same behavior on a system with a single active
process that uses enough memory to trigger significant swap use. I
wonder if there has been a regression in the kernel causing misbehavior
when swapping? The problems aren't specific to journald - desktop
environment can totally freeze too etc.
This problem seems to be there since kernel 4.9 which was a real pita in
this regard. It's progressively becoming better since kernel 4.10. The
kernel seems trying to prevent swapping at any cost since then, at least
at the cost of much higher latency, and at the cost of pushing all cache
out of RAM.
The result is processes stuck for easily 30 seconds and more during
memory pressure. Sometimes I see the kernel loudly complaining in dmesg
about high wait times for allocating RAM, especially from the btrfs
module. Thus, the biggest problem may be that kernel threads itself get
stuck in memory allocations and are a victim of high latency.
Currently I'm running my user session in a slice with max 80% RAM which
seems to help. It helps not discarding all cache. I also put some
potentially high memory users (regarding cache and/or resident mem) into
slices with carefully selected memory limits (backup and maintenance
services). Slices limited in such a way will start swapping before cache
is discarded and everything works better again. Part of this problem may
be that I have one process running which mmaps and locks 1G of memory
(bees, a btrfs deduplicator).
This system has 16G of RAM which is usually plenty but I use tmpfs to
build packages in Gentoo, and while that worked wonderfully before 4.9,
I have to be really careful now. The kernel happily throws away cache
instead of swapping early. Setting vm.swappiness differently seems to
have no perceivable effect.
Software that uses mmap is the first latency victim of this new behavior.
As such, also systemd-journald seems to be hit hard by this.
After the system recovered from high memory pressure (which can take
10-15 minutes, resulting in a loadavg of 400+), it ends up with some
gigabytes of inactive memory in the swap which it will only swap back in
then during shutdown (which will also take some minutes then).
The problem since 4.9 seems to be that the kernel tends to do swap
storms instead of constantly swapping out memory at low rates during
usage. The swap storms totally thrash the system.
Before 4.9, the kernel had no such latency spikes under memory pressure.
Swap would usually grew slowly over time, and the system felt sluggish
one or another time but still usable wrt latency. I usually ended up
with 5-8G of swap usage, and that was no problem. Now, swap only
significantly grows during swap storms with an unusable system for many
minutes, with latencies of 10+ seconds around twice per minute.
I had no swap storm yet since the last boot, and swap usage is around
16M now. Before kernel 4.9, this would be much higher already.
After some more research, I found that vm.watermark_scale_factor may be
the knob I am looking for. I'm going to watch behavior now with a higher
factor (default = 10, now 200).
Have you reporteed this to the kernel maintainers? LKML?

While this is interesting to read on systemd-devel, it's not right
venue. What you describe sounds like a regression that probably should
be improved upon.

Also, out of curiosity, are you running dmcrypt in this scenario? If
so, is swap on dmcrypt as well?

Regards,
Vito Caputo
Kai Krakow
2018-02-10 19:25:55 UTC
Permalink
Post by v***@pengaru.com
Post by Kai Krakow
After some more research, I found that vm.watermark_scale_factor may be
the knob I am looking for. I'm going to watch behavior now with a
higher factor (default = 10, now 200).
Have you reporteed this to the kernel maintainers? LKML?
No, not yet. I think they are aware of the issues as there's still on-
going work to memory allocations within kernel threads, and there's
perceivable improvement with every new kernel version. Especially, btrfs
has seen a few patches in this area.
Post by v***@pengaru.com
While this is interesting to read on systemd-devel, it's not right
venue. What you describe sounds like a regression that probably should
be improved upon.
I know it's mostly off-topic. But the problem is most visible in systemd-
journald and I think there are some users here which may have a better
understanding of the underlying problem, or maybe even found solutions to
it.

One approach for me was using systemd specific slices. So it may be
interesting to other people.
Post by v***@pengaru.com
Also, out of curiosity, are you running dmcrypt in this scenario? If
so, is swap on dmcrypt as well?
No, actually not. I'm using bcache for rootfs which may have similar
implications to memory allocations. Swap is just plain swap distributed
across 4 disks.

If I understand correctly, dmcrypt may expose this problem further
because it needs to "double buffer" memory while passing it further down
the storage layer.

I had zswap enabled previously which may expose this problem, too. I now
disabled it and later enabled THP again. THP now runs very well again.
Looks like zswap and THP don't play well together. OTOH, these options
were switched on and off during different kernel versions. So it may also
be an effect of fixes in newer kernels.
--
Regards,
Kai

Replies to list-only preferred.
Loading...