Discussion:
Triggering the HW Watchdog
(too old to reply)
D.S. Ljungmark
2018-02-27 11:44:10 UTC
Permalink
Raw Message
Hi list!

We're using systemd to control the hardware watchdog, and would want to
induce fail state to _verify_ that the shutdown/reboot process works as
expected.

How do we make systemd "fail" to ping the watchdog?

How do we control which states ( root fs not available, etc) cause
systemd to not ping the hardware watchdog?

//D.S.
--
8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC
Lennart Poettering
2018-02-27 12:20:32 UTC
Permalink
Raw Message
Post by D.S. Ljungmark
Hi list!
We're using systemd to control the hardware watchdog, and would want to
induce fail state to _verify_ that the shutdown/reboot process works as
expected.
How do we make systemd "fail" to ping the watchdog?
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
Post by D.S. Ljungmark
How do we control which states ( root fs not available, etc) cause
systemd to not ping the hardware watchdog?
The watchdog is for detecting software hanging. Root fs not being
available does not really qualify as "software hanging". If you want
to reboot the machine if it fails to bring everything up, then use
JobTimeoutAction= on some suitable action, for example local-fs.target
or multi-user.target.

Lennart
--
Lennart Poettering, Red Hat
D.S. Ljungmark
2018-02-27 14:12:57 UTC
Permalink
Raw Message
( re-send as I forgot the list )

On 27/02/18 13:20, Lennart Poettering wrote:> On Di, 27.02.18 12:44,
Post by Lennart Poettering
Post by D.S. Ljungmark
Hi list!
We're using systemd to control the hardware watchdog, and would want to
induce fail state to _verify_ that the shutdown/reboot process works as
expected.
How do we make systemd "fail" to ping the watchdog?
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
It seems that SIGSTOP is being filtered, because nothing appears to
happen, and the system certainly isn't rebooting.
Post by Lennart Poettering
Post by D.S. Ljungmark
How do we control which states ( root fs not available, etc) cause
systemd to not ping the hardware watchdog?
The watchdog is for detecting software hanging. Root fs not being
available does not really qualify as "software hanging". If you want
to reboot the machine if it fails to bring everything up, then use
JobTimeoutAction= on some suitable action, for example local-fs.target
or multi-user.target.
Lennart
Thanks,
I'm trying to get to a state where the machine fails over and triggers
watchdog on known things, rather than triggering the rescue shell or
similar.


I'll try with a jobtimeout on multi-user.

//D.S.
--
8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC
Lennart Poettering
2018-02-27 14:21:00 UTC
Permalink
Raw Message
Post by D.S. Ljungmark
Post by Lennart Poettering
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
It seems that SIGSTOP is being filtered, because nothing appears to
happen, and the system certainly isn't rebooting.
You should be able to trigger an abort in PID 1 by sending it SIGABRT
or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop
in which it stops pinging the hw watchdog.

Lennart
--
Lennart Poettering, Red Hat
D.S. Ljungmark
2018-02-27 16:25:45 UTC
Permalink
Raw Message
Post by Lennart Poettering
Post by D.S. Ljungmark
Post by Lennart Poettering
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
It seems that SIGSTOP is being filtered, because nothing appears to
happen, and the system certainly isn't rebooting.
You should be able to trigger an abort in PID 1 by sending it SIGABRT
or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop
in which it stops pinging the hw watchdog.
Lennart
ABRT works, or well..

systemd[1]: Caught <ABRT>, core dump failed (child 3844, code=killed,
status=6/ABRT).

And then a broadcast, freezing execution


And after that, what I was afraid of:

[25417.186351] watchdog: watchdog0: watchdog did not stop!


Well, that gives me a tool to debug this with, Thank you!


//D.S
--
8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC
Mantas Mikulėnas
2018-02-27 16:34:37 UTC
Permalink
Raw Message
Post by D.S. Ljungmark
Post by Lennart Poettering
Post by D.S. Ljungmark
Post by Lennart Poettering
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
It seems that SIGSTOP is being filtered, because nothing appears to
happen, and the system certainly isn't rebooting.
You should be able to trigger an abort in PID 1 by sending it SIGABRT
or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop
in which it stops pinging the hw watchdog.
Lennart
ABRT works, or well..
systemd[1]: Caught <ABRT>, core dump failed (child 3844, code=killed,
status=6/ABRT).
And then a broadcast, freezing execution
[25417.186351] watchdog: watchdog0: watchdog did not stop!
Isn't that exactly the result you asked for?
--
Mantas Mikulėnas
D.S. Ljungmark
2018-02-27 22:34:23 UTC
Permalink
Raw Message
Partially,

It shows that systemd is handling the watchdog as I expect it to
here, but it also means that the "dysfunctional" times where the
system isn't resetting properly is _not_ due to watchdog triggering,
but is a "normal system" according to systemd.

Which is a worse case for me, since it's harder to debug.

So, conclusion:
systemd seems to handle watchdog properly
systemd seems to not die properly when we expect it to, leaving us to
find more debugging.

I hope that makes more sense than less.
Post by Mantas Mikulėnas
Post by D.S. Ljungmark
Post by Lennart Poettering
Post by D.S. Ljungmark
Post by Lennart Poettering
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
It seems that SIGSTOP is being filtered, because nothing appears to
happen, and the system certainly isn't rebooting.
You should be able to trigger an abort in PID 1 by sending it SIGABRT
or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop
in which it stops pinging the hw watchdog.
Lennart
ABRT works, or well..
systemd[1]: Caught <ABRT>, core dump failed (child 3844, code=killed,
status=6/ABRT).
And then a broadcast, freezing execution
[25417.186351] watchdog: watchdog0: watchdog did not stop!
Isn't that exactly the result you asked for?
--
Mantas Mikulėnas
Ray, Ian (GE Healthcare)
2018-02-27 14:19:39 UTC
Permalink
Raw Message
Post by D.S. Ljungmark
( re-send as I forgot the list )
On 27/02/18 13:20, Lennart Poettering wrote:> On Di, 27.02.18 12:44,
Post by Lennart Poettering
Post by D.S. Ljungmark
Hi list!
We're using systemd to control the hardware watchdog, and would want to
induce fail state to _verify_ that the shutdown/reboot process works as
expected.
How do we make systemd "fail" to ping the watchdog?
I figure you can send SIGSTOP to PID 1, no? (there are some signals
the kernel blocks for PID 1, but I think SIGSTOP is not among them,
please try)
It seems that SIGSTOP is being filtered, because nothing appears to
happen, and the system certainly isn't rebooting.
This works for me: `gdb --pid 1'.
Post by D.S. Ljungmark
Post by Lennart Poettering
Post by D.S. Ljungmark
How do we control which states ( root fs not available, etc) cause
systemd to not ping the hardware watchdog?
The watchdog is for detecting software hanging. Root fs not being
available does not really qualify as "software hanging". If you want
to reboot the machine if it fails to bring everything up, then use
JobTimeoutAction= on some suitable action, for example local-fs.target
or multi-user.target.
Lennart
Thanks,
I'm trying to get to a state where the machine fails over and triggers
watchdog on known things, rather than triggering the rescue shell or
similar.
I'll try with a jobtimeout on multi-user.
//D.S.
--
8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Loading...