Discussion:
Errorneous detection of degraded array
Add Reply
Luke Pyzowski
2017-01-26 18:02:50 UTC
Reply
Permalink
Raw Message
Hello,
I have a large RAID6 device with 24 local drives on CentOS7.3. Randomly (around 50% of the time) systemd will unmount my RAID device thinking it is degraded after the mdadm-last-***@.timer expires, however the device is working normally by all accounts, and I can immediately mount it manually upon boot completion. In the logs below /share is the RAID device. I can increase the timer in /usr/lib/systemd/system/mdadm-last-***@.timer from 30 to 60 seconds, but this problem can randomly still occur.

systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
systemd[1]: Started Activate md array even though degraded.
systemd[1]: Unmounted /share.

When the system boots normally the following is in the logs:
systemd[1]: Started Timer to wait for more drives before activating degraded array..
systemd[1]: Starting Timer to wait for more drives before activating degraded array..
...
systemd[1]: Stopped Timer to wait for more drives before activating degraded array..
systemd[1]: Stopping Timer to wait for more drives before activating degraded array..

The above occurs within the same second according to the timestamps and the timer ends prior to mounting any local filesystems, it properly detects that the RAID is valid and everything continues normally. The other RAID device - a RAID1 of 2 disks containing swap and / have never exhibited this failure.

My question is, what are the conditions where systemd detects the RAID6 as being degraded? It seems to be a race condition somewhere, but I am not sure what configuration should be modified if any. If needed I can provide more verbose logs, just let me know if they might be useful.

Many thanks,
Luke Pyzowski
Reindl Harald
2017-01-26 18:09:51 UTC
Reply
Permalink
Raw Message
Post by Luke Pyzowski
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
systemd[1]: Started Activate md array even though degraded.
systemd[1]: Unmounted /share.
that also happens randomly in my Fedora 24 testing-vm with a RAID10 and
you can be sure that in a virtual machine drives don't disappear or take
long to appear
Reindl Harald
2017-01-26 18:09:55 UTC
Reply
Permalink
Raw Message
Post by Luke Pyzowski
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
systemd[1]: Started Activate md array even though degraded.
systemd[1]: Unmounted /share.
that also happens randomly in my Fedora 24 testing-vm with a RAID10 and
you can be sure that in a virtual machine drives don't disappear or take
long to appear
Andrei Borzenkov
2017-01-27 07:12:22 UTC
Reply
Permalink
Raw Message
Post by Luke Pyzowski
Hello,
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
systemd[1]: Started Activate md array even though degraded.
systemd[1]: Unmounted /share.
systemd[1]: Started Timer to wait for more drives before activating degraded array..
systemd[1]: Starting Timer to wait for more drives before activating degraded array..
...
systemd[1]: Stopped Timer to wait for more drives before activating degraded array..
systemd[1]: Stopping Timer to wait for more drives before activating degraded array..
The above occurs within the same second according to the timestamps and the timer ends prior to mounting any local filesystems, it properly detects that the RAID is valid and everything continues normally. The other RAID device - a RAID1 of 2 disks containing swap and / have never exhibited this failure.
My question is, what are the conditions where systemd detects the RAID6 as being degraded? It seems to be a race condition somewhere, but I am not sure what configuration should be modified if any. If needed I can provide more verbose logs, just let me know if they might be useful.
It is not directly related to systemd. When block device that is part of
MD array is detected by kernel, udev rule queries array if it is
complete. If it is, it starts array (subject to general rules of which
arrays are auto-started); and if not, it (udev rule) starts timer to
assemble degraded array.

See udev-md-raid-assembly.rules in mdadm sources:

ACTION=="add|change", ENV{MD_STARTED}=="*unsafe*",
ENV{MD_FOREIGN}=="no",
ENV{SYSTEMD_WANTS}+="mdadm-last-resort@$env{MD_DEVICE}.timer"

So it looks like events for some array members either got lost or are
delivered late.

Note that there was discussion on openSUSE list where arrays would not
be auto-assembled on boot, even though triggering device change *after*
initial boot would correctly run these rules. This situation was
triggered by adding extra disk to the system (i.e. - boot with 3 disks
worked, with 4 disks - not). I could not find any hints even after
enabling full udev and systemd debug logs. Logs are available if anyone
wants to try it.
Martin Wilck
2017-01-27 08:25:02 UTC
Reply
Permalink
Raw Message
Post by Luke Pyzowski
Hello,
I have a large RAID6 device with 24 local drives on CentOS7.3.
Randomly (around 50% of the time) systemd will unmount my RAID
expires, however the device is working normally by all accounts,
and I can immediately mount it manually upon boot completion. In
the logs below /share is the RAID device. I can increase the timer
seconds, but this problem can randomly still occur.
It seems to me that you rather need to decrease the timeout value, or
(more reasonable) increase x-systemd.device-timeout for the /share
mount point.
Unfortunately your log excerpt contains to time stamps but I suppose
you're facing a race where the device times out before the "last
resort" timer starts it (and before the last devices appear).

Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Luke Pyzowski
2017-01-27 19:44:22 UTC
Reply
Permalink
Raw Message
I've modified a number of settings to try to resolve this, so far no success.
I've created an explicit mount file for the RAID array: /etc/systemd/system/share.mount
Inside there I've experimented with TimeoutSec=

In /etc/systemd/system/mdadm-last-***@.timer I've worked with
OnActiveSec=

I've also tried (without an explicit mount file) to add x-systemd.device-timeout to /etc/fstab for the mount.

Here's a few more system logs showing perhaps more detail. I've edited them to show only relevant details, full pastebin of logs: http://pastebin.com/sL8nKt7j
These logs were generated with TimeoutSec=120 in /etc/systemd/system/share.mount the description of the mount in the logs is: "Mount /share RAID partition explicitly"
And OnActiveSec=30 in /etc/systemd/system/mdadm-last-***@.timer
From blkid:
/dev/md0: UUID="2b9114be-3d5a-41d7-8d4b-e5047d223129" TYPE="ext4"
/dev/md0 is the /share partition.

From /etc/mdadm.conf:
ARRAY /dev/md/0 metadata=1.2 UUID=97566d2f:ae7a169b:966f5840:3e8267f9 name=lnxnfs01:0

Boot begins at Jan 27 11:33:10
+4 seconds from boot:
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959.
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdc operational as raid disk 2
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdx operational as raid disk 23
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdu operational as raid disk 20
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdt operational as raid disk 19
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdo operational as raid disk 14
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdn operational as raid disk 13
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdd operational as raid disk 3
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdv operational as raid disk 21
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sda operational as raid disk 0
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdf operational as raid disk 5
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdm operational as raid disk 12
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sde operational as raid disk 4
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdp operational as raid disk 15
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdi operational as raid disk 8
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdl operational as raid disk 11
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdk operational as raid disk 10
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sds operational as raid disk 18
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdb operational as raid disk 1
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdj operational as raid disk 9
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdg operational as raid disk 6
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdr operational as raid disk 17
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdh operational as raid disk 7
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdq operational as raid disk 16
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdw operational as raid disk 22
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: allocated 25534kB
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: raid level 6 active with 24 out of 24 devices, algorithm 2
Jan 27 11:33:14 lnxnfs01 kernel: RAID conf printout:
Jan 27 11:33:14 lnxnfs01 kernel: --- level:6 rd:24 wd:24
Jan 27 11:33:14 lnxnfs01 kernel: disk 0, o:1, dev:sda
Jan 27 11:33:14 lnxnfs01 kernel: disk 1, o:1, dev:sdb
Jan 27 11:33:14 lnxnfs01 kernel: disk 2, o:1, dev:sdc
Jan 27 11:33:14 lnxnfs01 kernel: disk 3, o:1, dev:sdd
Jan 27 11:33:14 lnxnfs01 kernel: disk 4, o:1, dev:sde
Jan 27 11:33:14 lnxnfs01 kernel: disk 5, o:1, dev:sdf
Jan 27 11:33:14 lnxnfs01 kernel: disk 6, o:1, dev:sdg
Jan 27 11:33:14 lnxnfs01 kernel: disk 7, o:1, dev:sdh
Jan 27 11:33:14 lnxnfs01 kernel: disk 8, o:1, dev:sdi
Jan 27 11:33:14 lnxnfs01 kernel: disk 9, o:1, dev:sdj
Jan 27 11:33:14 lnxnfs01 kernel: disk 10, o:1, dev:sdk
Jan 27 11:33:14 lnxnfs01 kernel: disk 11, o:1, dev:sdl
Jan 27 11:33:14 lnxnfs01 kernel: disk 12, o:1, dev:sdm
Jan 27 11:33:14 lnxnfs01 kernel: disk 13, o:1, dev:sdn
Jan 27 11:33:14 lnxnfs01 kernel: disk 14, o:1, dev:sdo
Jan 27 11:33:14 lnxnfs01 kernel: disk 15, o:1, dev:sdp
Jan 27 11:33:14 lnxnfs01 kernel: disk 16, o:1, dev:sdq
Jan 27 11:33:14 lnxnfs01 kernel: disk 17, o:1, dev:sdr
Jan 27 11:33:14 lnxnfs01 kernel: disk 18, o:1, dev:sds
Jan 27 11:33:14 lnxnfs01 kernel: disk 19, o:1, dev:sdt
Jan 27 11:33:14 lnxnfs01 kernel: disk 20, o:1, dev:sdu
Jan 27 11:33:14 lnxnfs01 kernel: disk 21, o:1, dev:sdv
Jan 27 11:33:14 lnxnfs01 kernel: disk 22, o:1, dev:sdw
Jan 27 11:33:14 lnxnfs01 kernel: disk 23, o:1, dev:sdx
Jan 27 11:33:14 lnxnfs01 kernel: md0: detected capacity change from 0 to 45062020923392
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/2b9114be-3d5a-41d7-8d4b-e5047d223129.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started udev Wait for Complete Device Initialization.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting system-lvm2\x2dpvscan.slice.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting LVM2 PV scan on device 9:127...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Activation of DM RAID sets...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounting Mount /share RAID partition explicitly...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting File System Check on /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Activating swap /dev/mapper/vg_root-swap...
Jan 27 11:33:14 lnxnfs01 lvm[1494]: 2 logical volume(s) in volume group "vg_root" now active
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started LVM2 PV scan on device 9:127.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Activated swap /dev/mapper/vg_root-swap.
Jan 27 11:33:14 lnxnfs01 kernel: Adding 32763900k swap on /dev/mapper/vg_root-swap. Priority:-1 extents:1 across:32763900k FS
Jan 27 11:33:14 lnxnfs01 systemd-fsck[1499]: /dev/md126: clean, 345/64128 files, 47930/256240 blocks
Jan 27 11:33:14 lnxnfs01 systemd[1]: Reached target Swap.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Swap.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started File System Check on /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounting /boot...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounted /boot.
Jan 27 11:33:14 lnxnfs01 kernel: EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null)
Jan 27 11:33:21 lnxnfs01 systemd[1]: Started Activation of DM RAID sets.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Reached target Encrypted Volumes.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Starting Encrypted Volumes.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Reached target Local File Systems.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Starting Local File Systems.
...
Jan 27 11:33:21 lnxnfs01 kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
Jan 27 11:33:21 lnxnfs01 systemd[1]: Mounted Mount /share RAID partition explicitly.

... + 31 seconds from disk initialization, expiration of 30 second timer from mdadm-last-***@.timer

Jan 27 11:33:45 lnxnfs01 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped target Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopping Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounting Mount /share RAID partition explicitly...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting Activate md array even though degraded...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped (with error) /dev/md0.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Started Activate md array even though degraded.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounted Mount /share RAID partition explicitly.

... + 121 seconds from disk initialization, expiration of 120 second timer from TimeoutSec=120 in /etc/systemd/system/share.mount

Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-md-0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/md/0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-md-0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2dname\x2dlnxnfs01:0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/disk/by-id/md-name-lnxnfs01:0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2dname\x2dlnxnfs01:0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job sys-devices-virtual-block-md0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /sys/devices/virtual/block/md0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job sys-devices-virtual-block-md0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2duuid\x2d97566d2f:ae7a169b:966f5840:3e8267f9.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/disk/by-id/md-uuid-97566d2f:ae7a169b:966f5840:3e8267f9.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2duuid\x2d97566d2f:ae7a169b:966f5840:3e8267f9.device/stop failed with result 'timeout'.


In the logs above, the timer is started to wait for devices, it does not stop immediately because it sees a degraded array. It continues for 30 seconds and then kicks out my mount believing the raid is problematic.

As an experiment, I've increased the values for the mdadm-last-***@.timer to OnActiveSec=300 and TimeoutSec=320. This give me enough time to log in to the system. During that time, I can view the RAID and everything appears proper, yet 300 seconds later the mdadm-last-***@.timer expires with an error on /dev/md0?

Perhaps systemd is working normally, but then the question is why is the raid not being assembled properly - which triggers /usr/lib/systemd/system/mdadm-last-***@.service?

From your suggestion I need to next move to full udev debug logs? Can I delay the assembly of the RAID by the kernel if this is a race condition, as it appears that might be the case? Should that delay be early in the systemd startup process or elsewhere?

Thanks again,
Luke Pyzowski
Andrei Borzenkov
2017-01-28 17:34:26 UTC
Reply
Permalink
Raw Message
27.01.2017 22:44, Luke Pyzowski пишет:
...
Post by Luke Pyzowski
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: raid level 6 active with 24 out of 24 devices, algorithm 2
...
Post by Luke Pyzowski
Jan 27 11:33:14 lnxnfs01 kernel: md0: detected capacity change from 0 to 45062020923392
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/2b9114be-3d5a-41d7-8d4b-e5047d223129.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started udev Wait for Complete Device Initialization.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Timer to wait for more drives before activating degraded array..
...
Post by Luke Pyzowski
Jan 27 11:33:45 lnxnfs01 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped target Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopping Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounting Mount /share RAID partition explicitly...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting Activate md array even though degraded...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped (with error) /dev/md0.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Started Activate md array even though degraded.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounted Mount /share RAID partition explicitly.
Here is my educated guess.

Both mdadm-last-***@.timer and mdadm-last-***@.service conflict
with MD device:

***@bor-Latitude-E5450:~/src/systemd$ cat ../mdadm/systemd/
mdadm-grow-***@.service mdadm.shutdown
SUSE-mdadm_env.sh
mdadm-last-***@.service mdmonitor.service
mdadm-last-***@.timer ***@.service
***@bor-Latitude-E5450:~/src/systemd$ cat
../mdadm/systemd/mdadm-last-***@.timer
[Unit]
Description=Timer to wait for more drives before activating degraded array.
DefaultDependencies=no
Conflicts=sys-devices-virtual-block-%i.device

[Timer]
OnActiveSec=30
***@bor-Latitude-E5450:~/src/systemd$ cat
../mdadm/systemd/mdadm-last-***@.service
[Unit]
Description=Activate md array even though degraded
DefaultDependencies=no
Conflicts=sys-devices-virtual-block-%i.device

[Service]
Type=oneshot
ExecStart=BINDIR/mdadm --run /dev/%i

I presume intention is to stop these units when MD device is finally
assembled as complete. This is indeed what happens on my (test) system:

Jan 28 14:18:04 linux-ffk5 kernel: md: bind<vda1>
Jan 28 14:18:04 linux-ffk5 kernel: md: bind<vdb1>
Jan 28 14:18:05 linux-ffk5 kernel: md/raid1:md0: active with 2 out of 2
mirrors
Jan 28 14:18:05 linux-ffk5 kernel: md0: detected capacity change from 0
to 5363466240
Jan 28 14:18:06 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Installed new job mdadm-last-***@md0.timer/start as 287
Jan 28 14:18:06 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Enqueued job mdadm-last-***@md0.timer/start as 287
Jan 28 14:18:06 linux-ffk5 systemd[1]: dev-ttyS9.device: Changed dead ->
plugged
Jan 28 14:18:07 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Changed dead -> waiting
Jan 28 14:18:12 linux-ffk5 systemd[1]:
sys-devices-virtual-block-md0.device: Changed dead -> plugged
Jan 28 14:18:12 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Trying to enqueue job mdadm-last-***@md0.timer/stop/replace
Jan 28 14:18:12 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Installed new job mdadm-last-***@md0.timer/stop as 292
Jan 28 14:18:12 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Enqueued job mdadm-last-***@md0.timer/stop as 292
Jan 28 14:18:12 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer:
Changed waiting -> dead
Jan 28 14:18:12 linux-ffk5 systemd[1]: mdadm-last-***@md0.timer: Job
mdadm-last-***@md0.timer/stop finished, result=done
Jan 28 14:18:12 linux-ffk5 systemd[1]: Stopped Timer to wait for more
drives before activating degraded array..
Jan 28 14:19:34 10 systemd[1692]: dev-vda1.device: Changed dead -> plugged
Jan 28 14:19:34 10 systemd[1692]: dev-vdb1.device: Changed dead -> plugged


On your system apparently timer is not stopped when md device appears so
that when later last-resort service runs, it causes attempt to stop md
device (due to conflict) and transitively mount on top of it.

Could you try run with systemd.log_level=debug on kernel command line
and upload journal again. We can only hope that it will not skew timings
enough but it may prove my hypothesis.
Luke Pyzowski
2017-01-30 22:41:10 UTC
Reply
Permalink
Raw Message
Post by NeilBrown
Does
systemctl list-dependencies sys-devices-virtual-block-md0.device
report anything interesting? I get
sys-devices-virtual-block-md0.device
● └─mdmonitor.service
Nothing interesting, the same output as you have above.
Post by NeilBrown
Could you try run with systemd.log_level=debug on kernel command line and upload journal again. We can only hope that it will not skew timings enough but it may prove my hypothesis.
I've uploaded the full debug logs to: https://gist.github.com/Kryai/8273322c8a61347e2300e476c70b4d05
In around 20 reboots, the error appeared only twice, certainly with debug enabled it is more rare, but it does still occur, but to your correct guess, debug logging does affect the exhibition of the race condition.

Reminder of key things in the log:
# cat /etc/systemd/system/mdadm-last-***@.timer
[Unit]
Description=Timer to wait for more drives before activating degraded array.
DefaultDependencies=no
Conflicts=sys-devices-virtual-block-%i.device

[Timer]
OnActiveSec=30



# cat /etc/systemd/system/share.mount
[Unit]
Description=Mount /share RAID partition explicitly
Before=nfs-server.service

[Mount]
What=/dev/disk/by-uuid/2b9114be-3d5a-41d7-8d4b-e5047d223129
Where=/share
Type=ext4
Options=defaults
TimeoutSec=120

[Install]
WantedBy=multi-user.target


Again, if any more information is needed please let me know I'll provide it.


Many thanks,
Luke Pyzowski
NeilBrown
2017-01-30 01:53:37 UTC
Reply
Permalink
Raw Message
Post by Luke Pyzowski
Hello,
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
This line perplexes me.

The last-resort.service (and .timer) files have a Conflict= directive
against sys-devices-virtual-block-md$DEV.device
Normally a Conflicts= directive means that if this service starts, that
one is stopped, and if that one starts, this is stopped.
However .device units cannot be stopped:

$ systemctl show sys-devices-virtual-block-md0.device | grep Can
CanStart=no
CanStop=no
CanReload=no
CanIsolate=no

so presumable the attempt to stop the device fails, so the Conflict=
dependency cannot be met, so the last-resort service (or timer) doesn't
get started.
At least, that is what I see happening in my tests.

But your log doesn't mention sys-devices-virtual-block-md0, it
mentions /dev/md0.
How does systemd know about /dev/md0, or the connection it has with
sys-devices-virtual-block-md0 ??

Does
systemctl list-dependencies sys-devices-virtual-block-md0.device

report anything interesting? I get

sys-devices-virtual-block-md0.device
● └─mdmonitor.service

NeilBrown
Andrei Borzenkov
2017-01-30 03:40:09 UTC
Reply
Permalink
Raw Message
Post by NeilBrown
Post by Luke Pyzowski
Hello,
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
This line perplexes me.
The last-resort.service (and .timer) files have a Conflict= directive
against sys-devices-virtual-block-md$DEV.device
Normally a Conflicts= directive means that if this service starts, that
one is stopped, and if that one starts, this is stopped.
$ systemctl show sys-devices-virtual-block-md0.device | grep Can
CanStart=no
CanStop=no
CanReload=no
CanIsolate=no
so presumable the attempt to stop the device fails, so the Conflict=
dependency cannot be met, so the last-resort service (or timer) doesn't
get started.
As I explained in other mail, to me it looks like last-resort timer does
get started, and then last-resort service is started which attempts to
stop device and because mount point depends on device it also stops
mount point. So somehow we have bad timing when both device and timer
start without canceling each other.

The fact that stopping of device itself fails is irrelevant here -
dependencies are evaluated at the time job is submitted, so if
share.mount Requires dev-md0.device and you attempt to Stop
dev-md0.device, systemd still queues job to Stop share.mount.
Post by NeilBrown
At least, that is what I see happening in my tests.
Yes, we have race condition here, I cannot reproduce this either. It
does not mean it does not exist :) Let's hope debug logging will show
something more useful (it is entirely possible that with debugging logs
turned on this race does not happen).
Post by NeilBrown
But your log doesn't mention sys-devices-virtual-block-md0, it
mentions /dev/md0.
How does systemd know about /dev/md0, or the connection it has with
sys-devices-virtual-block-md0 ??
By virtue of "Following" attribute. dev-md0.device is Following
sys-devices-virtual-block-md0.device so stopping the latter will also
stop the former.
Post by NeilBrown
Does
systemctl list-dependencies sys-devices-virtual-block-md0.device
report anything interesting? I get
sys-devices-virtual-block-md0.device
● └─mdmonitor.service
NeilBrown
2017-01-30 06:36:51 UTC
Reply
Permalink
Raw Message
Post by Andrei Borzenkov
Post by NeilBrown
Post by Luke Pyzowski
Hello,
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
This line perplexes me.
The last-resort.service (and .timer) files have a Conflict= directive
against sys-devices-virtual-block-md$DEV.device
Normally a Conflicts= directive means that if this service starts, that
one is stopped, and if that one starts, this is stopped.
$ systemctl show sys-devices-virtual-block-md0.device | grep Can
CanStart=no
CanStop=no
CanReload=no
CanIsolate=no
so presumable the attempt to stop the device fails, so the Conflict=
dependency cannot be met, so the last-resort service (or timer) doesn't
get started.
As I explained in other mail, to me it looks like last-resort timer does
get started, and then last-resort service is started which attempts to
stop device and because mount point depends on device it also stops
mount point. So somehow we have bad timing when both device and timer
start without canceling each other.
The fact that stopping of device itself fails is irrelevant here -
dependencies are evaluated at the time job is submitted, so if
share.mount Requires dev-md0.device and you attempt to Stop
dev-md0.device, systemd still queues job to Stop share.mount.
Post by NeilBrown
At least, that is what I see happening in my tests.
Yes, we have race condition here, I cannot reproduce this either. It
does not mean it does not exist :) Let's hope debug logging will show
something more useful (it is entirely possible that with debugging logs
turned on this race does not happen).
Post by NeilBrown
But your log doesn't mention sys-devices-virtual-block-md0, it
mentions /dev/md0.
How does systemd know about /dev/md0, or the connection it has with
sys-devices-virtual-block-md0 ??
By virtue of "Following" attribute. dev-md0.device is Following
sys-devices-virtual-block-md0.device so stopping the latter will also
stop the former.
Ahh.. I see why I never saw this now.
Two reasons.
1/ My /etc/fstab has UUID=d1711227-c9fa-4883-a904-7cd7a3eb865c rather
than /dev/md0
systemd doesn't manage to intuit a 'Following' dependency between
the UUID and the mount point.
2/ I use partitions of md arrays: that UUID is actually /dev/md0p3.
systemd doesn't intuit that md0p3.device is Following md0.device.

So you only hit a problem if you have "/dev/md0" or similar in
/etc/fstab.

The race is, I think, that one I mentioned. If the md device is started
before udev tells systemd to start the timer, the Conflicts dependencies
goes the "wrong" way and stops the wrong thing.

It would be nice to be able to reliably stop the timer when the device
starts, without risking having the device get stopped when the timer
starts, but I don't think we can reliably do that.

Changing the
Conflicts=sys-devices-virtual-block-%i.device
lines to
ConditionPathExists=/sys/devices/virtual/block/%i
might make the problem go away, without any negative consequences.

The primary purpose of having the 'Conflicts' directives was so that
systemd wouldn't log
Starting Activate md array even though degraded
after the array was successfully started.
Hopefully it won't do that when the Condition fails.

Thanks,
NeilBrown
Post by Andrei Borzenkov
Post by NeilBrown
Does
systemctl list-dependencies sys-devices-virtual-block-md0.device
report anything interesting? I get
sys-devices-virtual-block-md0.device
● └─mdmonitor.service
Reindl Harald
2017-01-30 07:10:57 UTC
Reply
Permalink
Raw Message
Post by NeilBrown
Post by Andrei Borzenkov
By virtue of "Following" attribute. dev-md0.device is Following
sys-devices-virtual-block-md0.device so stopping the latter will also
stop the former.
Ahh.. I see why I never saw this now.
Two reasons.
1/ My /etc/fstab has UUID=d1711227-c9fa-4883-a904-7cd7a3eb865c rather
than /dev/md0
systemd doesn't manage to intuit a 'Following' dependency between
the UUID and the mount point.
2/ I use partitions of md arrays: that UUID is actually /dev/md0p3.
systemd doesn't intuit that md0p3.device is Following md0.device.
So you only hit a problem if you have "/dev/md0" or similar in
/etc/fstab.
the fstab in the virtual machine which ad that issue around 3 times in
the last year is using UUID - so it's not hit often at all but wouldn't
have expected it especially hit a VM (reason for that raid is just
testing RAID10 with writemostly option)

UUID=f0b27a5c-7e3d-45ad-8b7f-617820860379 /mnt/raid10 ext4
defaults,commit=30,inode_readahead_blks=16,relatime,lazytime,barrier=0 0 1
Andrei Borzenkov
2017-01-30 07:29:27 UTC
Reply
Permalink
Raw Message
On Mon, Jan 30, 2017 at 9:36 AM, NeilBrown <***@suse.com> wrote:
...
Post by NeilBrown
Post by Luke Pyzowski
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
...
Post by NeilBrown
The race is, I think, that one I mentioned. If the md device is started
before udev tells systemd to start the timer, the Conflicts dependencies
goes the "wrong" way and stops the wrong thing.
From the logs provided it is unclear whether it is *timer* or
*service*. If it is timer - I do not understand why it is started
exactly 30 seconds after device apparently appears. This would match
starting service.

Yet another case where system logging is hopelessly unfriendly for
troubleshooting :(
Post by NeilBrown
It would be nice to be able to reliably stop the timer when the device
starts, without risking having the device get stopped when the timer
starts, but I don't think we can reliably do that.
Well, let's wait until we can get some more information about what happens.
Post by NeilBrown
Changing the
Conflicts=sys-devices-virtual-block-%i.device
lines to
ConditionPathExists=/sys/devices/virtual/block/%i
might make the problem go away, without any negative consequences.
Ugly, but yes, may be this is the only way using current systemd.
Post by NeilBrown
The primary purpose of having the 'Conflicts' directives was so that
systemd wouldn't log
Starting Activate md array even though degraded
after the array was successfully started.
This looks like cosmetic problem. What will happen if last resort
service is started when array is fully assembled? Will it do any harm?
Post by NeilBrown
Hopefully it won't do that when the Condition fails.
NeilBrown
2017-01-30 22:19:48 UTC
Reply
Permalink
Raw Message
Post by Andrei Borzenkov
...
Post by NeilBrown
Post by Luke Pyzowski
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
...
Post by NeilBrown
The race is, I think, that one I mentioned. If the md device is started
before udev tells systemd to start the timer, the Conflicts dependencies
goes the "wrong" way and stops the wrong thing.
From the logs provided it is unclear whether it is *timer* or
*service*. If it is timer - I do not understand why it is started
exactly 30 seconds after device apparently appears. This would match
starting service.
My guess is that the timer is triggered immediately after the device is
started, but before it is mounted.
The Conflicts directive tries to stop the device, but is cannot stop the
device and there are no dependencies yet, so nothing happen.
After the timer fires (30 seconds later) the .service starts. It also
has a Conflicts directory so systemd tried to stop the device again.
Now that it has been mounted, there is a dependences that can be
stopped, and the device gets unmounted.
Post by Andrei Borzenkov
Yet another case where system logging is hopelessly unfriendly for
troubleshooting :(
Post by NeilBrown
It would be nice to be able to reliably stop the timer when the device
starts, without risking having the device get stopped when the timer
starts, but I don't think we can reliably do that.
Well, let's wait until we can get some more information about what happens.
Post by NeilBrown
Changing the
Conflicts=sys-devices-virtual-block-%i.device
lines to
ConditionPathExists=/sys/devices/virtual/block/%i
might make the problem go away, without any negative consequences.
Ugly, but yes, may be this is the only way using current systemd.
Post by NeilBrown
The primary purpose of having the 'Conflicts' directives was so that
systemd wouldn't log
Starting Activate md array even though degraded
after the array was successfully started.
This looks like cosmetic problem. What will happen if last resort
service is started when array is fully assembled? Will it do any harm?
Yes, it could be seen as cosmetic, but cosmetic issues can be important
too. Confusing messages in logs can be harmful.

In all likely cases, running the last-resort service won't cause any
harm.
If, during the 30 seconds, the array is started, then deliberately
stopped, then partially assembled again, then when the last-resort
service finally starts it might do the wrong thing.
So it would be cleanest if the timer was killed as soon as the device
is started. But I don't think there is a practical concern.

I guess I could make a udev rule that fires when the array started, and
that runs "systemctl stop mdadm-last-***@md0.timer"

NeilBrown
Post by Andrei Borzenkov
Post by NeilBrown
Hopefully it won't do that when the Condition fails.
Andrei Borzenkov
2017-01-31 20:17:03 UTC
Reply
Permalink
Raw Message
Post by NeilBrown
Post by Andrei Borzenkov
...
Post by NeilBrown
Post by Luke Pyzowski
systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
systemd[1]: Starting Activate md array even though degraded...
systemd[1]: Stopped target Local File Systems.
systemd[1]: Stopping Local File Systems.
systemd[1]: Unmounting /share...
systemd[1]: Stopped (with error) /dev/md0.
...
Post by NeilBrown
The race is, I think, that one I mentioned. If the md device is started
before udev tells systemd to start the timer, the Conflicts dependencies
goes the "wrong" way and stops the wrong thing.
From the logs provided it is unclear whether it is *timer* or
*service*. If it is timer - I do not understand why it is started
exactly 30 seconds after device apparently appears. This would match
starting service.
My guess is that the timer is triggered immediately after the device is
started, but before it is mounted.
The Conflicts directive tries to stop the device, but is cannot stop the
device and there are no dependencies yet, so nothing happen.
After the timer fires (30 seconds later) the .service starts. It also
has a Conflicts directory so systemd tried to stop the device again.
Now that it has been mounted, there is a dependences that can be
stopped, and the device gets unmounted.
Post by Andrei Borzenkov
Yet another case where system logging is hopelessly unfriendly for
troubleshooting :(
Post by NeilBrown
It would be nice to be able to reliably stop the timer when the device
starts, without risking having the device get stopped when the timer
starts, but I don't think we can reliably do that.
Well, let's wait until we can get some more information about what happens.
Not much more, but we at least have confirmed that it was indeed last
resort service which was fired off by last resort timer. Unfortunately
no trace of timer itself.
Post by NeilBrown
Post by Andrei Borzenkov
Post by NeilBrown
Changing the
Conflicts=sys-devices-virtual-block-%i.device
lines to
ConditionPathExists=/sys/devices/virtual/block/%i
might make the problem go away, without any negative consequences.
Ugly, but yes, may be this is the only way using current systemd.
This won't work. sysfs node appears as soon as the very first array
member is found and array is still inactive, while what we need is
condition "array is active".

Conflicts line works because array is not announced to systemd
(SYSTEMD_READY) until it is active. Which in turn is derived from the
content of md/array_state.
Post by NeilBrown
Post by Andrei Borzenkov
Post by NeilBrown
The primary purpose of having the 'Conflicts' directives was so that
systemd wouldn't log
Starting Activate md array even though degraded
after the array was successfully started.
Yes, I understand it.
Post by NeilBrown
Post by Andrei Borzenkov
This looks like cosmetic problem. What will happen if last resort
service is started when array is fully assembled? Will it do any harm?
Yes, it could be seen as cosmetic, but cosmetic issues can be important
too. Confusing messages in logs can be harmful.
In all likely cases, running the last-resort service won't cause any
harm.
If, during the 30 seconds, the array is started, then deliberately
stopped, then partially assembled again, then when the last-resort
service finally starts it might do the wrong thing.
So it would be cleanest if the timer was killed as soon as the device
is started. But I don't think there is a practical concern.
I guess I could make a udev rule that fires when the array started, and
Well ... what we really need is unidirectional dependency. Actually the
way Conflicts is used *is* unidirectional anyway - nobody seriously
expects that starting foo.service will stop currently running
shutdown.target. But that is semantic we have currently.

But this probably will do to mitigate this issue until something more
generic can be implemented.
NeilBrown
2017-02-08 04:10:57 UTC
Reply
Permalink
Raw Message
Post by Andrei Borzenkov
Post by Andrei Borzenkov
Post by NeilBrown
Changing the
Conflicts=sys-devices-virtual-block-%i.device
lines to
ConditionPathExists=/sys/devices/virtual/block/%i
might make the problem go away, without any negative consequences.
Ugly, but yes, may be this is the only way using current systemd.
This won't work. sysfs node appears as soon as the very first array
member is found and array is still inactive, while what we need is
condition "array is active".
Of course, you are right.
A suitable "array is active" test is the existence of
.../md/sync_action
which appears when an array is activated (except for RAID0 and Linear,
which don't need last-resort support).

So this is what I propose to post upstream. Could you please confirm
that it works for you? It appears to work for me.

Thanks,
NeilBrown

From: NeilBrown <***@suse.com>
Date: Wed, 8 Feb 2017 15:01:05 +1100
Subject: [PATCH] systemd/mdadm-last-resort: use ConditionPathExists instead of
Conflicts

Commit cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")

added a 'Conflicts' directive to the mdadm-last-***@.service file in
the hope that this would make sure the service didn't run after the device
was active, even if the timer managed to get started, which is possible in
race conditions.

This seemed to work is testing, but it isn't clear why, and it is known
to cause problems.
If systemd happens to know that the mentioned device is a dependency of a
mount point, the Conflicts can unmount that mountpoint, which is certainly
not wanted.

So remove the "Conflicts" and instead use
ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action

The "sync_action" file exists for any array which require last-resort
handling, and only appear when the array is activated. So it is safe
to reliy on it to determine if the last-resort is really needed.

Fixes: cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")
Signed-off-by: NeilBrown <***@suse.com>
---
systemd/mdadm-last-***@.service | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/systemd/mdadm-last-***@.service b/systemd/mdadm-last-***@.service
index e93d72b2b45e..f9d4d12738a3 100644
--- a/systemd/mdadm-last-***@.service
+++ b/systemd/mdadm-last-***@.service
@@ -1,7 +1,7 @@
[Unit]
Description=Activate md array even though degraded
DefaultDependencies=no
-Conflicts=sys-devices-virtual-block-%i.device
+ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action

[Service]
Type=oneshot
--
2.11.0
Loading...