Discussion:
more verbose debug info than systemd.log_level=debug?
Add Reply
Chris Murphy
2017-03-17 04:19:22 UTC
Reply
Permalink
Raw Message
I've got a Fedora 22, 23, 24, 25 bug where systemd offline updates of
kernel results in an unbootable system when on XFS only (/boot is a
directory), the system boots to a grub menu. The details of that are
in this bug's comment:

https://bugzilla.redhat.com/show_bug.cgi?id=1227736#c39

The gist of that is the file system is dirty following offline update,
and the grub.cfg is 0 length. If the fs is mounted with a rescue
system, the XFS journal is replayed and cleans things up, now there is
a valid grub.cfg, and at the next reboot there is a grub menu as
expected with the newly installed kernel.

That bug is on baremetal for another user, but I've reproduced it in a
qemu-kvm where I use boot parameters systemd.log_level=debug
systemd.log_target=console console=ttyS0,38400 and virsh console to
capture what's going on during the offline update that results in the
dirty file system.

What I get is more confusing than helpful:



Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Process 304 (plymouthd) has been marked to be excluded from killing.
It is running from the root file system, and thus likely to block
re-mounting of the root file system to read-only. Please consider
moving it into an initrd file system instead.
Unmounting file systems.
Remounting '/tmp' read-only with options 'seclabel'.
Unmounting /tmp.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
All filesystems unmounted.
Deactivating swaps.
All swaps deactivated.
Detaching loop devices.
device-enumerator: scan all dirs
device-enumerator: scanning /sys/bus
device-enumerator: scanning /sys/class
All loop devices detached.
Detaching DM devices.
device-enumerator: scan all dirs
device-enumerator: scanning /sys/bus
device-enumerator: scanning /sys/class
All DM devices detached.
Spawned /usr/lib/systemd/system-shutdown/mdadm.shutdown as 8408.
/usr/lib/systemd/system-shutdown/mdadm.shutdown succeeded.
system-shutdown succeeded.
Failed to read reboot parameter file: No such file or directory
Rebooting.
[ 52.963598] Unregister pv shared memory for cpu 0
[ 52.965736] Unregister pv shared memory for cpu 1
[ 52.966795] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[ 52.991220] reboot: Restarting system
[ 52.993119] reboot: machine restart
<no further entries, VM shuts down>

1. Why are there three remount read-only entries? Are these failing?
These same three entries happen when the file system is Btrfs, so it's
not an XFS specific anomaly.

2. All filesystems unmounted. What condition is required to generate
this message? I guess I'm asking if it's reliable. Or if it's possible
after three failed read-only remounts that systemd gives up and claims
the file systems are unmounted, and then reboots?

There is an XFS specific problem here, as the dirty fs problem only
happens on XFS; the file system is clean if it's ext4 or Btrfs.
Nevertheless it looks like something is holding up the remount, and
there's no return value from umount logged.

Is there a way to get more information during shutdown than this? The
question at this point is why is the XFS volume dirty at reboot time,
but there's not much to go on, as I get all the same console messages
for ext4 and Btrfs which don't have a dirty fs at reboot following
offline update.


Thanks,
--
Chris Murphy
Chris Murphy
2017-03-21 03:25:05 UTC
Reply
Permalink
Raw Message
Any thoughts on this?

I've followed these instructions:
https://freedesktop.org/wiki/Software/systemd/Debugging/
Shutdown Completes Eventually

However, no additional information is being logged that gives any
answer to why there are three remount ro attempts, and why they aren't
succeeding.

https://github.com/systemd/systemd/blob/master/src/core/umount.c
line 409

This suggests three ro attempts shouldn't happen. And then 413 says
that / won't actually get umounted, reboot happens leaving it ro
mounted. So the "All filesystems unmounted." doesn't tell us anything;
but it does seem like there should be a way to expose exit code for
umount. I'm just not sure how to do it, and if that means compiling
systemd myself.
--
Chris Murphy
Mantas Mikulėnas
2017-03-21 05:05:48 UTC
Reply
Permalink
Raw Message
First thought: Even without the exit code or anything, it's going to be
-EBUSY like 99.999% of the time. Not much else can fail during umount.

And ”Filesystem is busy" would perfectly fit the earlier error message
which you overlooked:

"Process 304 (plymouthd) has been marked to be excluded from killing.
It is running from the root file system, and thus likely to block
re-mounting of the root file system to read-only."

So you have a process holding / open (Plymouth is the boot splash screen
app) and the kernel doesn't allow it to be umounted due to that.
Post by Chris Murphy
Any thoughts on this?
https://freedesktop.org/wiki/Software/systemd/Debugging/
Shutdown Completes Eventually
However, no additional information is being logged that gives any
answer to why there are three remount ro attempts, and why they aren't
succeeding.
https://github.com/systemd/systemd/blob/master/src/core/umount.c
line 409
This suggests three ro attempts shouldn't happen. And then 413 says
that / won't actually get umounted, reboot happens leaving it ro
mounted. So the "All filesystems unmounted." doesn't tell us anything;
but it does seem like there should be a way to expose exit code for
umount. I'm just not sure how to do it, and if that means compiling
systemd myself.
--
Chris Murphy
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
Mantas Mikulėnas <***@gmail.com>
Sent from my phone
Chris Murphy
2017-03-21 06:04:56 UTC
Reply
Permalink
Raw Message
Thanks for the reply.
Post by Mantas Mikulėnas
First thought: Even without the exit code or anything, it's going to be
-EBUSY like 99.999% of the time. Not much else can fail during umount.
And ”Filesystem is busy" would perfectly fit the earlier error message
"Process 304 (plymouthd) has been marked to be excluded from killing.
It is running from the root file system, and thus likely to block
re-mounting of the root file system to read-only."
So you have a process holding / open (Plymouth is the boot splash screen
app) and the kernel doesn't allow it to be umounted due to that.
a. Seems flawed to have something that can block remount to read only.
Either a flaw of Plymouth directly, or running it from root fs rather than
the initramfs.

b. This message occurs, as well as the three remount ro messages,
regardless of filesystem (volume format).

c. Only XFS is left in a dirty state following the reboot. Ext4 and Btrfs
are OK.

So I'm still left with why XFS is affected, and XFS devs want to know the
exit code.

At reboot/shutdown time, exactly what does systemd issue to the kernel to
do this?


Chris Murphy
Chris Murphy
2017-03-21 19:58:26 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
c. Only XFS is left in a dirty state following the reboot. Ext4 and Btrfs
are OK.
This is incorrect. This problem affects ext4 as well, it's just that
on ext4, while the fs is left in a dirty state, the modified grub.cfg
is still readable and boot is possible. But boot after pk offline
update, always includes journal replay.

Basically these reboots are leaving file systems dirty. I can't tell
from the available information if it's a systemd bug, or a kernel bug.
The file system remount to read-only is failing, and a umount isn't
attempted. And I guess between systemd and the kernel, they're
deciding to reboot anyway, resulting in this problem.
--
Chris Murphy
Chris Murphy
2017-03-21 21:10:29 UTC
Reply
Permalink
Raw Message
OK so I had the idea to uninstall plymouth, since that's estensibly
what's holding up the remount read-only. But it's not true.

Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Unmounting file systems.
Remounting '/tmp' read-only with options 'seclabel'.
Unmounting /tmp.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
All filesystems unmounted.
Deactivating swaps.
All swaps deactivated.
Detaching loop devices.
device-enumerator: scan all dirs
device-enumerator: scanning /sys/bus
device-enumerator: scanning /sys/class
All loop devices detached.
Detaching DM devices.
device-enumerator: scan all dirs
device-enumerator: scanning /sys/bus
device-enumerator: scanning /sys/class
All DM devices detached.
Spawned /usr/lib/systemd/system-shutdown/mdadm.shutdown as 7058.
/usr/lib/systemd/system-shutdown/mdadm.shutdown succeeded.
system-shutdown succeeded.
Failed to read reboot parameter file: No such file or directory
Rebooting.
[ 47.288419] Unregister pv shared memory for cpu 0
[ 47.289140] Unregister pv shared memory for cpu 1
[ 47.290013] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[ 47.315486] reboot: Restarting system
[ 47.316036] reboot: machine restart


There are still three attempts to remount read-only. Why? Separately
checking the file system following this reboot, the fs is clean, not
dirty. So one of those remounts must have worked this time. And the
file system is bootable.

There really isn't enough debugging within system to isolate
everything that's going on here.



Chris Murphy
Andrei Borzenkov
2017-03-22 03:48:41 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
OK so I had the idea to uninstall plymouth, since that's estensibly
what's holding up the remount read-only. But it's not true.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Unmounting file systems.
Remounting '/tmp' read-only with options 'seclabel'.
Unmounting /tmp.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
All filesystems unmounted.
Could you show your /proc/self/mountinfo before starting shutdown (or
ideally just before systemd goes into uount all)? This suggests that "/"
appears there three times there.

Result code of "remount ro" is not evaluated or logged. systemd does

(void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options);

where "options" are those from /proc/self/mountinfo sans ro|rw.

Probably it should log it at least with debug level.
Chris Murphy
2017-03-22 17:05:16 UTC
Reply
Permalink
Raw Message
Post by Andrei Borzenkov
Post by Chris Murphy
OK so I had the idea to uninstall plymouth, since that's estensibly
what's holding up the remount read-only. But it's not true.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Unmounting file systems.
Remounting '/tmp' read-only with options 'seclabel'.
Unmounting /tmp.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'.
All filesystems unmounted.
Could you show your /proc/self/mountinfo before starting shutdown (or
ideally just before systemd goes into uount all)? This suggests that "/"
appears there three times there.
I'm too stupid to figure out how to get virsh console to attach to
tty9/early debug shell but here's a screen shot right as
pk-offline-update is done, maybe 2 seconds before the remounting and
reboot.
https://drive.google.com/open?id=0B_2Asp8DGjJ9NXRGTTFjSlVPSU0
Post by Andrei Borzenkov
Result code of "remount ro" is not evaluated or logged. systemd does
(void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options);
where "options" are those from /proc/self/mountinfo sans ro|rw.
Probably it should log it at least with debug level.
So I've asked over on the XFS about this, and they suggest all of this
is expected behavior under the circumstances. The sync only means data
is committed to disk with an appropriate journal entry, it doesn't
mean fs metadata is up to date, and it's the fs metadata that GRUB is
depending on, but isn't up to date yet. So the suggestion is that if
remount-ro fails, to use freeze/unfreeze and then reboot. The
difference with freeze/unfreeze and remount-ro is that freeze/unfreeze
will update fs metadata even if there's something preventing
remount-ro.

If it's useful I'll file an issue with systemd on github to get a
freeze/unfreeze inserted. remount-ro isn't always successful, and
clearly it's not ok to reboot anyway if remount-ro fails.
--
Chris Murphy
Chris Murphy
2017-03-27 19:20:24 UTC
Reply
Permalink
Raw Message
Ok so the dirty file system problem always happens with all pk offline
updates on Fedora using either ext4 or XFS with any layout; and it's
easy to reproduce.

1. Clean install any version of Fedora, defaults.
2. Once Gnome Software gives notification of updates, Restart & Install
3. System reboots, updates are applied, system reboots again.
4. Now check the journal filtering for 'fsck' and you'll see it
replayed the journals; if using XFS check the filter for "XFS" and
you'll see the kernel did journal replace at mount time.

Basically systemd is rebooting even though the remoun-ro fails
multiple times, due to plymouth running off root fs and being exempt
from being killed, and then reboots anyway, leaving the file system
dirty. So it seems like a flaw to me to allow an indefinite exemption
from killing a process that's holding a volume rw, preventing
remount-ro before reboot.

So there's a risk that in other configurations this makes either ext4
and XFS systems unbootable following an offline update.

Chris Murphy
Mantas Mikulėnas
2017-03-27 19:27:16 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Ok so the dirty file system problem always happens with all pk offline
updates on Fedora using either ext4 or XFS with any layout; and it's
easy to reproduce.
1. Clean install any version of Fedora, defaults.
2. Once Gnome Software gives notification of updates, Restart & Install
3. System reboots, updates are applied, system reboots again.
4. Now check the journal filtering for 'fsck' and you'll see it
replayed the journals; if using XFS check the filter for "XFS" and
you'll see the kernel did journal replace at mount time.
Basically systemd is rebooting even though the remoun-ro fails
multiple times, due to plymouth running off root fs and being exempt
from being killed, and then reboots anyway, leaving the file system
dirty. So it seems like a flaw to me to allow an indefinite exemption
from killing a process that's holding a volume rw, preventing
remount-ro before reboot.
So there's a risk that in other configurations this makes either ext4
and XFS systems unbootable following an offline update.
So on the one hand it's probably a Plymouth bug or misconfiguration (it
shouldn't mark itself exempt unless it runs off an in-memory initramfs).

But on the other hand, are filesystems really so fragile? Even though it's
after a system upgrade (which updated many files), I was sure systemd at
least tries to *sync* all remaining filesystems before reboot, doesn't it?
--
Mantas Mikulėnas <***@gmail.com>
Chris Murphy
2017-03-28 14:01:29 UTC
Reply
Permalink
Raw Message
Post by Mantas Mikulėnas
Post by Chris Murphy
Ok so the dirty file system problem always happens with all pk offline
updates on Fedora using either ext4 or XFS with any layout; and it's
easy to reproduce.
1. Clean install any version of Fedora, defaults.
2. Once Gnome Software gives notification of updates, Restart & Install
3. System reboots, updates are applied, system reboots again.
4. Now check the journal filtering for 'fsck' and you'll see it
replayed the journals; if using XFS check the filter for "XFS" and
you'll see the kernel did journal replace at mount time.
Basically systemd is rebooting even though the remoun-ro fails
multiple times, due to plymouth running off root fs and being exempt
from being killed, and then reboots anyway, leaving the file system
dirty. So it seems like a flaw to me to allow an indefinite exemption
from killing a process that's holding a volume rw, preventing
remount-ro before reboot.
So there's a risk that in other configurations this makes either ext4
and XFS systems unbootable following an offline update.
So on the one hand it's probably a Plymouth bug or misconfiguration (it
shouldn't mark itself exempt unless it runs off an in-memory initramfs).
OK. But does it even make sense to have a process exempt from killing,
when it's going to get killed by reboot? Seems to me once we're at
remount-ro or umount time, nothing is exempt, they need to be forcibly
killed, clean up the file system, and then reboot.
Post by Mantas Mikulėnas
But on the other hand, are filesystems really so fragile? Even though it's
after a system upgrade (which updated many files), I was sure systemd at
least tries to *sync* all remaining filesystems before reboot, doesn't it?
All sync does is flush data and the log to disk, not file system
metadata. While this is crash safe, by not either remount-ro or umount
of root fs, doing a reboot anyway is basically a crash as far as the
file system is concerned. So it has to do log recovery at next mount,
which the bootloader can't do. The bootloader depends on file system
metadata being correct.
--
Chris Murphy
Mantas Mikulėnas
2017-03-28 16:41:04 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Post by Mantas Mikulėnas
Post by Chris Murphy
Ok so the dirty file system problem always happens with all pk offline
updates on Fedora using either ext4 or XFS with any layout; and it's
easy to reproduce.
1. Clean install any version of Fedora, defaults.
2. Once Gnome Software gives notification of updates, Restart & Install
3. System reboots, updates are applied, system reboots again.
4. Now check the journal filtering for 'fsck' and you'll see it
replayed the journals; if using XFS check the filter for "XFS" and
you'll see the kernel did journal replace at mount time.
Basically systemd is rebooting even though the remoun-ro fails
multiple times, due to plymouth running off root fs and being exempt
from being killed, and then reboots anyway, leaving the file system
dirty. So it seems like a flaw to me to allow an indefinite exemption
from killing a process that's holding a volume rw, preventing
remount-ro before reboot.
So there's a risk that in other configurations this makes either ext4
and XFS systems unbootable following an offline update.
So on the one hand it's probably a Plymouth bug or misconfiguration (it
shouldn't mark itself exempt unless it runs off an in-memory initramfs).
OK. But does it even make sense to have a process exempt from killing,
when it's going to get killed by reboot? Seems to me once we're at
remount-ro or umount time, nothing is exempt, they need to be forcibly
killed, clean up the file system, and then reboot.
Processes are killed *before* the remount/unmount stage. The primary users
of kill-exemption would therefore be daemons which *provide* access to the
root filesystem, e.g. iscsid, rpc helper daemons, or even ntfs-3g.
(Naturally these are expected to be running from the initramfs.)

So the same applies to plymouth, IMO -- it should only mark itself exempt
if it runs from the initramfs and knows that it won't interfere.

(Unrelated, but I should also mention that systemd-shutdown has a "shutdown
initramfs" feature, where it can jump *back* to the initramfs and let its
"/shutdown" script do additional cleanup steps.)
--
Mantas Mikulėnas <***@gmail.com>
Chris Murphy
2017-03-28 17:31:31 UTC
Reply
Permalink
Raw Message
Post by Mantas Mikulėnas
Post by Chris Murphy
Post by Mantas Mikulėnas
Post by Chris Murphy
Ok so the dirty file system problem always happens with all pk offline
updates on Fedora using either ext4 or XFS with any layout; and it's
easy to reproduce.
1. Clean install any version of Fedora, defaults.
2. Once Gnome Software gives notification of updates, Restart & Install
3. System reboots, updates are applied, system reboots again.
4. Now check the journal filtering for 'fsck' and you'll see it
replayed the journals; if using XFS check the filter for "XFS" and
you'll see the kernel did journal replace at mount time.
Basically systemd is rebooting even though the remoun-ro fails
multiple times, due to plymouth running off root fs and being exempt
from being killed, and then reboots anyway, leaving the file system
dirty. So it seems like a flaw to me to allow an indefinite exemption
from killing a process that's holding a volume rw, preventing
remount-ro before reboot.
So there's a risk that in other configurations this makes either ext4
and XFS systems unbootable following an offline update.
So on the one hand it's probably a Plymouth bug or misconfiguration (it
shouldn't mark itself exempt unless it runs off an in-memory initramfs).
OK. But does it even make sense to have a process exempt from killing,
when it's going to get killed by reboot? Seems to me once we're at
remount-ro or umount time, nothing is exempt, they need to be forcibly
killed, clean up the file system, and then reboot.
Processes are killed *before* the remount/unmount stage. The primary users
of kill-exemption would therefore be daemons which *provide* access to the
root filesystem, e.g. iscsid, rpc helper daemons, or even ntfs-3g.
(Naturally these are expected to be running from the initramfs.)
OK but it's obviously possible for a developer to run a process from
root fs, and mark it kill exempt. That's the problem under discussion,
the developer is doing the wrong thing, and it's allowed. And it's
been going on for a very long time (at least 5 releases of Fedora)
Post by Mantas Mikulėnas
So the same applies to plymouth, IMO -- it should only mark itself exempt if
it runs from the initramfs and knows that it won't interfere.
How is this exemption specified? Would it be part of the plymouth packaging?

I recognize the immediate bug is with plymouth so to progress this
forward I'm happy to assign the Fedora bug and leave it up to Fedora
devs to figure out whether plymouth should go in the initramfs, or
remove the kill exemption. But long term, I still think there's a roll
for systemd here, to disregard process kill exemption if they're
running from a volume that's about to be remounted-ro or umounted.
Preventing that is asking for big problems, as seen here.
--
Chris Murphy
Mantas Mikulėnas
2017-03-28 17:35:41 UTC
Reply
Permalink
Raw Message
Post by Mantas Mikulėnas
Post by Mantas Mikulėnas
So the same applies to plymouth, IMO -- it should only mark itself
exempt if
Post by Mantas Mikulėnas
it runs from the initramfs and knows that it won't interfere.
How is this exemption specified? Would it be part of the plymouth packaging?
https://cgit.freedesktop.org/plymouth/commit/?id=9e5a276f322cfce46b5b2ed2125cb9ec67df7e9f
--
Mantas Mikulėnas <***@gmail.com>
Lennart Poettering
2017-03-30 10:30:16 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
OK but it's obviously possible for a developer to run a process from
root fs, and mark it kill exempt. That's the problem under discussion,
the developer is doing the wrong thing, and it's allowed. And it's
been going on for a very long time (at least 5 releases of Fedora)
We expect that people who use this functionality are careful with it,
and we made sure to document this all very explicitly:

https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/

We even say very clearly what the correct way is to detect whether we
are running from the initrd or the host system.

But anyway, I'd claim that the main culprit is XFS here.

Lennart
--
Lennart Poettering, Red Hat
Lennart Poettering
2017-03-30 10:28:10 UTC
Reply
Permalink
Raw Message
Post by Mantas Mikulėnas
Post by Chris Murphy
Ok so the dirty file system problem always happens with all pk offline
updates on Fedora using either ext4 or XFS with any layout; and it's
easy to reproduce.
1. Clean install any version of Fedora, defaults.
2. Once Gnome Software gives notification of updates, Restart & Install
3. System reboots, updates are applied, system reboots again.
4. Now check the journal filtering for 'fsck' and you'll see it
replayed the journals; if using XFS check the filter for "XFS" and
you'll see the kernel did journal replace at mount time.
Basically systemd is rebooting even though the remoun-ro fails
multiple times, due to plymouth running off root fs and being exempt
from being killed, and then reboots anyway, leaving the file system
dirty. So it seems like a flaw to me to allow an indefinite exemption
from killing a process that's holding a volume rw, preventing
remount-ro before reboot.
So there's a risk that in other configurations this makes either ext4
and XFS systems unbootable following an offline update.
So on the one hand it's probably a Plymouth bug or misconfiguration (it
shouldn't mark itself exempt unless it runs off an in-memory initramfs).
Correct. Plymouth shouldn't mark it itself this way, unless it runs
from the initrd. The documentation says this very explicitly:

Again: if your code is being run from the root file system, then
this logic suggested above is NOT for you. Sorry. Talk to us, we
can probably help you to find a different solution to your problem.

See
https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/.

That said, a file system remaining mounting during shutdown is ugly
but shouldn't result in dataloss, as we do call sync() before
reboot(), and so does any other init system (see other mail).

Hence, there are two bugs here:

a) an ugliness in plymouth (or the way it is used by fedora's package
update logic), resulting in something that is mostly a
cosmetic problem

b) XFS is simply broken, if we call sync() it should sync metadata,
that happens to be triggered by a).

Lennart
--
Lennart Poettering, Red Hat
Lennart Poettering
2017-03-30 10:24:48 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Post by Andrei Borzenkov
Result code of "remount ro" is not evaluated or logged. systemd does
(void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options);
where "options" are those from /proc/self/mountinfo sans ro|rw.
Probably it should log it at least with debug level.
So I've asked over on the XFS about this, and they suggest all of this
is expected behavior under the circumstances. The sync only means data
is committed to disk with an appropriate journal entry, it doesn't
mean fs metadata is up to date, and it's the fs metadata that GRUB is
depending on, but isn't up to date yet. So the suggestion is that if
remount-ro fails, to use freeze/unfreeze and then reboot. The
I am sorry, but XFS is really broken here. All init systems since time
began kinda did the same thing when shutting down:

a) try to unmount all fs that can be unmounted
b) for the remaining ones, try to remount ro (the root fs usually qualifies)
c) sync()
d) reboot()

That's how sysvinit does it, how Upstart does it, and systemd does it
the same way. (Well, if the initrd supports it we go one step further
though, and optionally pivot back to the initrd which can then unmount
the root file system, too. That's a systemd innovation however, and
only supported on initrd systems where the initrd supports it)

If the XFS devs think that the sync() before reboot() can be partially
ignored, then I am sorry for them, but that makes XFS pretty much
incompatible with every init system in existence.

Or to say this differently: if they expect us to invoke some magic
per-filesystem ioctl() before reboot(), then that's nonsense. No init
system calls that, and I am strongly against such hacks. They should
just fix their APIs.

Moreover, the man page of sync() is pretty clear on this:

"sync() causes all pending modifications to file system
metadata and cached file data to be written to the
underlying filesystems."

It explicitly mentions metadata. Any way you turn it, the XFS folks
are just confused if they really claim sync() doesn't have to sync
metadata. History says differently, and so does the man page
documentation.
Post by Chris Murphy
If it's useful I'll file an issue with systemd on github to get a
freeze/unfreeze inserted. remount-ro isn't always successful, and
clearly it's not ok to reboot anyway if remount-ro fails.
I don't think we'd merge such a patch. The XFS folks should implement
documented behaviour and that'll not just fix things with systemd, but
with any init system.

Lennart
--
Lennart Poettering, Red Hat
Mantas Mikulėnas
2017-03-30 10:37:04 UTC
Reply
Permalink
Raw Message
Post by Lennart Poettering
Or to say this differently: if they expect us to invoke some magic
per-filesystem ioctl() before reboot(), then that's nonsense. No init
system calls that, and I am strongly against such hacks. They should
just fix their APIs.
On the other hand, no other init system generally supports exclusions from
the killing spree...

As for freezing, that feature seems to have been made generic in 2.6.28
(FIFREEZE/FITHAW), although I couldn't find much documentation on it. Looks
mainly meant for snapshots and backups – not for regular reboots.
--
Mantas Mikulėnas <***@gmail.com>
Michael Chapman
2017-03-30 12:07:53 UTC
Reply
Permalink
Raw Message
On Thu, 30 Mar 2017, Lennart Poettering wrote:
[...]
Post by Lennart Poettering
I am sorry, but XFS is really broken here. All init systems since time
a) try to unmount all fs that can be unmounted
b) for the remaining ones, try to remount ro (the root fs usually qualifies)
c) sync()
d) reboot()
That's how sysvinit does it, how Upstart does it, and systemd does it
the same way. (Well, if the initrd supports it we go one step further
though, and optionally pivot back to the initrd which can then unmount
the root file system, too. That's a systemd innovation however, and
only supported on initrd systems where the initrd supports it)
If the XFS devs think that the sync() before reboot() can be partially
ignored, then I am sorry for them, but that makes XFS pretty much
incompatible with every init system in existence.
Or to say this differently: if they expect us to invoke some magic
per-filesystem ioctl() before reboot(), then that's nonsense. No init
system calls that, and I am strongly against such hacks. They should
just fix their APIs.
"sync() causes all pending modifications to file system
metadata and cached file data to be written to the
underlying filesystems."
It explicitly mentions metadata. Any way you turn it, the XFS folks
are just confused if they really claim sync() doesn't have to sync
metadata. History says differently, and so does the man page
documentation.
I am not a filesystem developer (IANAFD?), but I'm pretty sure they're
going to say "the metadata _is_ synced, it's in the journal". And it's
hard to argue that. After all, the filesystem will be perfectly valid the
next time it is mounted, after the journal has been replayed, and it will
contain all data written prior to the sync call. It did exactly what the
manpage says it does.

The problem here seems to be that GRUB is an incomplete XFS
implementation, one which doesn't know about XFS journalling. It may be a
good argument XFS shouldn't be used for /boot... but the issue can really
arise with just about any other journalled filesystems, like Ext3/4.

As Mantas Mikulėnas points out, the FIFREEZE ioctl is supported wherever
systemd is, and it's not just XFS-specific. I think it'd be smartest just
to use it because it's there, it's cheap, and it can't make things worse.
--
Michael
Chris Murphy
2017-04-03 04:56:42 UTC
Reply
Permalink
Raw Message
I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going
to say "the metadata _is_ synced, it's in the journal". And it's hard to
argue that. After all, the filesystem will be perfectly valid the next time
it is mounted, after the journal has been replayed, and it will contain all
data written prior to the sync call. It did exactly what the manpage says it
does.
That's their position.

Also, the same file system dirtiness and journal replay is needed on
ext4. The sample size is too small to say categorically that the same
problem can't happen on ext4 in the same situation. Maybe the grub.cfg
is readable, but maybe the kernel isn't, or the initramfs, or
something else.
The problem here seems to be that GRUB is an incomplete XFS implementation,
one which doesn't know about XFS journalling. It may be a good argument XFS
shouldn't be used for /boot... but the issue can really arise with just
about any other journalled filesystems, like Ext3/4.
I wondered about it at the start, and asked about it on the XFS list
in the first post about the problem. The developers nearly died
laughing at the idea of doing journal replay in 640KiB of memory. They
said categorically it's not possible.
As Mantas Mikulėnas points out, the FIFREEZE ioctl is supported wherever
systemd is, and it's not just XFS-specific. I think it'd be smartest just to
use it because it's there, it's cheap, and it can't make things worse.
I think getting mount/umount exit codes reported in the journal when
systemd.log_level=debug should be a higher priority. We really ought
to find out exactly what's going on, so we don't have to speculate,
and I think it's handy to have anyway if it's not a PITA to implement.

I've retested after removing plymouth, and the problem isn't
reproducible; I've nominated the plymouth non-killable behavior as a
Fedora 26 blocker. So it should get fixed and upstreamed.
--
Chris Murphy
Andrei Borzenkov
2017-04-04 17:55:27 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going
to say "the metadata _is_ synced, it's in the journal". And it's hard to
argue that. After all, the filesystem will be perfectly valid the next time
it is mounted, after the journal has been replayed, and it will contain all
data written prior to the sync call. It did exactly what the manpage says it
does.
That's their position.
Also, the same file system dirtiness and journal replay is needed on
ext4. The sample size is too small to say categorically that the same
problem can't happen on ext4 in the same situation. Maybe the grub.cfg
is readable, but maybe the kernel isn't, or the initramfs, or
something else.
Yes, I have seen the same on ext4 which prompted me to play with journal
replay code. Unfortunately I do not know how to reliably trigger this
condition.
Post by Chris Murphy
The problem here seems to be that GRUB is an incomplete XFS implementation,
one which doesn't know about XFS journalling. It may be a good argument XFS
shouldn't be used for /boot... but the issue can really arise with just
about any other journalled filesystems, like Ext3/4.
I wondered about it at the start, and asked about it on the XFS list
in the first post about the problem. The developers nearly died
laughing at the idea of doing journal replay in 640KiB of memory. They
said categorically it's not possible.
grub2 is not limited to 640KiB. Actually it will actively avoid using
low memory. It switches to protected mode as the very first thing and
can use up to 4GiB (and even this probably can be lifted on 64 bit
platform). The real problem is the fact that grub is read-only so every
time you access file on journaled partition it will need to replay
journal again from scratch. This will likely be painfully slow (I
remember that grub legacy on reiser needed couple of minutes to read
kernel and much more to read initrd, and that was when both were smaller
than now).
Chris Murphy
2017-04-08 17:28:00 UTC
Reply
Permalink
Raw Message
Post by Andrei Borzenkov
Post by Chris Murphy
I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going
to say "the metadata _is_ synced, it's in the journal". And it's hard to
argue that. After all, the filesystem will be perfectly valid the next time
it is mounted, after the journal has been replayed, and it will contain all
data written prior to the sync call. It did exactly what the manpage says it
does.
That's their position.
Also, the same file system dirtiness and journal replay is needed on
ext4. The sample size is too small to say categorically that the same
problem can't happen on ext4 in the same situation. Maybe the grub.cfg
is readable, but maybe the kernel isn't, or the initramfs, or
something else.
Yes, I have seen the same on ext4 which prompted me to play with journal
replay code. Unfortunately I do not know how to reliably trigger this
condition.
I can reliably trigger a dirty ext4 or XFS file system 100% of the
time with all recent Fedora installations when doing an offline
update. What's very non-deterministic is how this dirtiness will
manifest. Filesystems folks basically live in an alternate reality
where the farther in time a file system is from mkfs time, the more
non-deterministic the file system behaves. *shrug*
Post by Andrei Borzenkov
Post by Chris Murphy
The problem here seems to be that GRUB is an incomplete XFS implementation,
one which doesn't know about XFS journalling. It may be a good argument XFS
shouldn't be used for /boot... but the issue can really arise with just
about any other journalled filesystems, like Ext3/4.
I wondered about it at the start, and asked about it on the XFS list
in the first post about the problem. The developers nearly died
laughing at the idea of doing journal replay in 640KiB of memory. They
said categorically it's not possible.
grub2 is not limited to 640KiB. Actually it will actively avoid using
low memory. It switches to protected mode as the very first thing and
can use up to 4GiB (and even this probably can be lifted on 64 bit
platform). The real problem is the fact that grub is read-only so every
time you access file on journaled partition it will need to replay
journal again from scratch. This will likely be painfully slow (I
remember that grub legacy on reiser needed couple of minutes to read
kernel and much more to read initrd, and that was when both were smaller
than now).
OK well that makes more sense; but yeah it still sounds like journal
replay is a non-starter. The entire fs metadata would have to be read
into memory and create something like a RAM based rw snapshot which is
backed by the ro disk version as origin, and then play the log against
the RAM snapshot. That could be faster than constantly replaying the
journal from scratch for each file access. But still - sounds overly
complicated.

I think this qualifies as "Doctor, it hurt when I do this." And the
doctor says, "So don't do that." And I'm referring to Plymouth
exempting itself from kill while also not running from initramfs. So
I'll kindly make the case with Plymouth folks to stop pressing this
particular hurt me button.

But hey, pretty cool bug. Not often is it the case you find such an
old bug so easily reproducible but near as I can tell only one person
was hitting it until I tried to reproduce it.
--
Chris Murphy
Michael Chapman
2017-04-09 00:11:56 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Post by Andrei Borzenkov
Post by Chris Murphy
I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going
to say "the metadata _is_ synced, it's in the journal". And it's hard to
argue that. After all, the filesystem will be perfectly valid the next time
it is mounted, after the journal has been replayed, and it will contain all
data written prior to the sync call. It did exactly what the manpage says it
does.
That's their position.
Also, the same file system dirtiness and journal replay is needed on
ext4. The sample size is too small to say categorically that the same
problem can't happen on ext4 in the same situation. Maybe the grub.cfg
is readable, but maybe the kernel isn't, or the initramfs, or
something else.
Yes, I have seen the same on ext4 which prompted me to play with journal
replay code. Unfortunately I do not know how to reliably trigger this
condition.
I can reliably trigger a dirty ext4 or XFS file system 100% of the
time with all recent Fedora installations when doing an offline
update. What's very non-deterministic is how this dirtiness will
manifest. Filesystems folks basically live in an alternate reality
where the farther in time a file system is from mkfs time, the more
non-deterministic the file system behaves. *shrug*
They don't expect their filesystems to be used except through their own
filesystem code. It is perfectly deterministic behaviour when their
filesystem code is used. Their logic seems _very_ reasonable to me.

Don't forget, they've provided an interface for software to use if it
needs more than the guarantees provided by sync. Informally speaking, the
FIFREEZE ioctl is intended to place a filesystem into a "fully consistent"
state, not just a "fully recoverable" state. (Formally it's all a bit
hazy: POSIX really doesn't guarantee anything with sync.)

Currently systemd calls sync at shutdown. It doesn't need to do that;
it could have just assumed all other software is written correctly. It
calls sync as a courtesy to that other software.

I really do think systemd ought to freeze the filesystem at the same time,
for _exactly the same reasons_. That will solve this Plymouth problem, but
it will also solve every other software that somebody might run (possibly
accidentally, possibly not) during late shutdown.

This problem doesn't just affect GRUB, it could affect users of other
operating systems too. I was speaking to somebody who runs OpenBSD.
Apparently OpenBSD doesn't have an Ext3 driver, only an Ext2 one, so it is
somewhat common practice to use an Ext3 filesystem on Linux but mount it
as Ext2 on OpenBSD. That can only work correctly if the filesystem's
journal is completely flushed. systemd is the only thing that can do this
reliably, since it's the only thing running just before the reboot call.
Lennart Poettering
2017-04-09 11:17:18 UTC
Reply
Permalink
Raw Message
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
If FIFREEZE is a generic ioctl() supported by a number of different
file systems I figure it would be much more OK with calling it.

That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called). Which isn't really what we are interested in here (note that
we return back to the initrd after the umount spree and it shall be
able to do the rest, and if it actually can do that, then the file
systems should be able to unmount and that usually results in writes
to disk...)

So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk... I mean, it appears
the call is pretty much useless and it's traditional usage (which
prominently is in sysvinit before reboot()) appears to be broken by
their behaviour.

Why bother with sync() at all, if it implies no guarantees? This is
quite frankly bullshit...

It appears to me that using /boot on a file system whith such broken
sync() semantics is really not a safe thing to do, and people should
probably only use something more reliable, i.e. ext2 or vfat where
sync() actually works correctly...

Lennart
--
Lennart Poettering, Red Hat
Chris Murphy
2017-04-10 04:37:36 UTC
Reply
Permalink
Raw Message
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
Post by Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html

The question isn't directly answered in there (it is part of the
thread on this very subject though). My guess at is that sync()
predates journaled file systems, and the expectations of sync() for a
journaled file system are basically just crash consistency, not all
metadata is on disk. Fully writing all metadata is expensive; as is
checking fixing it with an offline fsck. Both of those are reasons why
we have journaled filesystems. If sync() required all fs metadata to
commit to stable media it would make file systems dog slow. Every damn
thing is doing fsync's now. Before Btrfs had a log tree, workloads
with many fsyncs would hang the file system and the entire workfload
as well, so my guess is sync() meaning all fs metadata is committed on
ext4 and XFS would mean massive performance hits that no one would be
happy about.
Post by Lennart Poettering
Why bother with sync() at all, if it implies no guarantees? This is
quite frankly bullshit...
It appears to me that using /boot on a file system whith such broken
sync() semantics is really not a safe thing to do, and people should
probably only use something more reliable, i.e. ext2 or vfat where
sync() actually works correctly...
Oh god - that's the opposite direction to go in. There's not even
pretend crash safety with those file systems. If they're dirty, you
must use an fsck to get them back to consistency. Even if the toy fs
support found in firmware will tolerate the inconsistency, who knows
what blocks it actually ends up loading into memory, you can just get
a crash later at the bootloader, or the kernel, or initramfs. That so
much consumer hardware routinely lies about having committed data to
stable media following sync() makes those file systems even less
reliable for this purpose. Once corrupt, the file system has no fail
safe or fallback like a journaled or COW file system. It's busted
until fixed with fsck.
--
Chris Murphy
Tomasz Torcz
2017-04-10 05:14:41 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
Post by Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html
So the “solution” seems to be adding FIFREEZE/FITHAW ioctls after sync()?
--
Tomasz Torcz "Never underestimate the bandwidth of a station
xmpp: ***@chrome.pl wagon filled with backup tapes." -- Jim Gray
Michael Chapman
2017-04-10 06:14:17 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
Post by Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html
Ah good, Dave actually suggests using a freeze there. A freeze without a
corresponding thaw should be OK if it's definitely after all processes
have been killed, since we're just about to reboot anyway. (Obviously we'd
want to avoid the whole lot when running in a container or when doing
kexec.)

I'll try to reproduce the problem (I don't use Plymouth, so I haven't seen
it myself yet) and come up with a patch.
Mantas Mikulėnas
2017-04-10 06:17:27 UTC
Reply
Permalink
Raw Message
Post by Michael Chapman
Post by Chris Murphy
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
Post by Lennart Poettering
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
Post by Lennart Poettering
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html
Ah good, Dave actually suggests using a freeze there. A freeze without a
corresponding thaw should be OK if it's definitely after all processes have
been killed, since we're just about to reboot anyway. (Obviously we'd want
to avoid the whole lot when running in a container or when doing kexec.)
Or, I think, when pivoting back to the shutdown-initramfs. (Though then you
also need the shutdown-initramfs to run `fsfreeze`, I guess?)
--
Mantas Mikulėnas <***@gmail.com>
Michael Chapman
2017-04-10 07:21:42 UTC
Reply
Permalink
Raw Message
Post by Mantas Mikulėnas
Post by Michael Chapman
Post by Chris Murphy
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
Post by Lennart Poettering
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
Post by Lennart Poettering
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html
Ah good, Dave actually suggests using a freeze there. A freeze without a
corresponding thaw should be OK if it's definitely after all processes have
been killed, since we're just about to reboot anyway. (Obviously we'd want
to avoid the whole lot when running in a container or when doing kexec.)
Or, I think, when pivoting back to the shutdown-initramfs. (Though then you
also need the shutdown-initramfs to run `fsfreeze`, I guess?)
No, I don't think it should be done then. If a filesystem is still in use,
then doing a freeze there would likely make any processes still using it
unkillable. And doing a freeze followed by a thaw doesn't gain us
much, we'd still need to do another freeze at the end of
shutdown-initramfs.

So I think we should only freeze any still-mounted filesystems *right*
before the reboot(2) call. That's the only time it's guaranteed to be safe
-- if there's still miraculously some other process hanging around, it's
about to disappear anyway.

On the topic of XFS filesystem freezing, I just found this slide deck:

http://oss.sgi.com/projects/xfs/training/xfs_slides_09_internals.pdf

Page 39 is of particular interest:

"""
sync(2)

* XFS implements an optimization to sync(2) of metadata:
- XFS will only force the log out, such that any dirty
metadata that is incore is written to the log only,
the metadata itself is not necessarily written
- This is safe, since all change is ondisk
- File data is guaranteed too (even barriers)

* Log and metadata are written to disk for
- freeze/thaw
- remount ro
- unmount

Applications like grub have been bitten in the past, but fixed nowadays
"""

I'm not sure what it's referring to with GRUB there, but at least this
confirms what the filesystem developers' intentions are with the sync(2)
call.
Lennart Poettering
2017-04-10 09:11:37 UTC
Reply
Permalink
Raw Message
Post by Michael Chapman
Post by Mantas Mikulėnas
Or, I think, when pivoting back to the shutdown-initramfs. (Though then you
also need the shutdown-initramfs to run `fsfreeze`, I guess?)
No, I don't think it should be done then. If a filesystem is still in use,
then doing a freeze there would likely make any processes still using it
unkillable. And doing a freeze followed by a thaw doesn't gain us much, we'd
still need to do another freeze at the end of shutdown-initramfs.
Hmm? Are you saing that on XFS you might even see corruption on files
that weren't accessed for write since the last freeze if you forget to
freeze when shutting down?

I mean, unless the initrd hooks modify the boot loader having done
FIFREEZE once sounds safe enough, no?

Lennart
--
Lennart Poettering, Red Hat
Michael Chapman
2017-04-10 09:41:53 UTC
Reply
Permalink
Raw Message
Post by Lennart Poettering
Post by Michael Chapman
Post by Mantas Mikulėnas
Or, I think, when pivoting back to the shutdown-initramfs. (Though then you
also need the shutdown-initramfs to run `fsfreeze`, I guess?)
No, I don't think it should be done then. If a filesystem is still in use,
then doing a freeze there would likely make any processes still using it
unkillable. And doing a freeze followed by a thaw doesn't gain us much, we'd
still need to do another freeze at the end of shutdown-initramfs.
Hmm? Are you saing that on XFS you might even see corruption on files
that weren't accessed for write since the last freeze if you forget to
freeze when shutting down?
No, I'm not saying that at all.
Post by Lennart Poettering
I mean, unless the initrd hooks modify the boot loader having done
FIFREEZE once sounds safe enough, no?
Mantas Mikulėnas suggested doing a freeze on the pivot back to the
shutdown-initramfs. But that's no good: any argv[0][0] == '@' processes
could be writing to the filesystems then.

This is why I've been stressing that the filesystem freezes (and thaws
too, if necessary) should only happen right before the reboot(2) syscall.
Lennart Poettering
2017-04-10 08:55:16 UTC
Reply
Permalink
Raw Message
Post by Michael Chapman
Post by Chris Murphy
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
Post by Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html
Ah good, Dave actually suggests using a freeze there. A freeze without a
corresponding thaw should be OK if it's definitely after all processes have
been killed, since we're just about to reboot anyway. (Obviously we'd want
to avoid the whole lot when running in a container or when doing
kexec.)
No, there is no such guarantee. We support initrds that run userspace
stuff from the initrd at boot, that stays around in the background is
only killed after we transition back into the initrd. And we really
don't control what they do, they can do anything they like, access any
file they want at any time. We added this primarily to support storage
services backing the root file system (think iscsid, nbd, ...), but it
actually can be anything that hsa the "feel" of a kernel component in
being around since the time before systemd initialiazes until after
the time it shut down again, but is actually implemented in userspace.

In fact, this is precisely what plymouth is making use of: by marking
a process with argv[0][0] = '@' we permit any privileged process to be
excluded from the final killing spree, so that it will survive until
the initrd shutdown transition.

So no, "freeze" is not an option. That sounds like a recipe to make
shutdown hang. We need a sync() that actually does what is documented
and sync the file system properly.

Lennart
--
Lennart Poettering, Red Hat
Michael Chapman
2017-04-10 09:07:37 UTC
Reply
Permalink
Raw Message
Post by Lennart Poettering
Post by Michael Chapman
Post by Chris Murphy
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering
Post by Lennart Poettering
That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called).
So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk...
https://www.spinics.net/lists/linux-xfs/msg05113.html
Ah good, Dave actually suggests using a freeze there. A freeze without a
corresponding thaw should be OK if it's definitely after all processes have
been killed, since we're just about to reboot anyway. (Obviously we'd want
to avoid the whole lot when running in a container or when doing
kexec.)
No, there is no such guarantee. We support initrds that run userspace
stuff from the initrd at boot, that stays around in the background is
only killed after we transition back into the initrd. And we really
don't control what they do, they can do anything they like, access any
file they want at any time. We added this primarily to support storage
services backing the root file system (think iscsid, nbd, ...), but it
actually can be anything that hsa the "feel" of a kernel component in
being around since the time before systemd initialiazes until after
the time it shut down again, but is actually implemented in userspace.
In fact, this is precisely what plymouth is making use of: by marking
excluded from the final killing spree, so that it will survive until
the initrd shutdown transition.
This is precisely why I intend to add it _just before_ the reboot(2) call.
Any processes that have survived that far are going to stop running a very
short moment later anyway; it doesn't matter if they get hung on a write.

Note that I am specifically NOT talking about doing a filesystem freeze
on the shutdown-initrd transition. That would be ludicrous.
Post by Lennart Poettering
So no, "freeze" is not an option. That sounds like a recipe to make
shutdown hang. We need a sync() that actually does what is documented
and sync the file system properly.
sync() is never going to work the way you want it to work. Let's make
systemd work correctly for the systems we have today, not some
hypothetical system of the future.

The filesystem developers have good reasons for sync()'s current
behaviour. I can only point out again that the way they've designed it
does *not* lose or corrupt data: all synced data is available as soon
as the filesystem journals have been flushed. We have to explicitly flush
the journals ourselves, one way or another, to ensure that GRUB and other
not-fully-Linux-compatible filesystem implementations work correctly.
Lennart Poettering
2017-04-10 10:44:22 UTC
Reply
Permalink
Raw Message
Post by Michael Chapman
Post by Lennart Poettering
So no, "freeze" is not an option. That sounds like a recipe to make
shutdown hang. We need a sync() that actually does what is documented
and sync the file system properly.
sync() is never going to work the way you want it to work. Let's make
systemd work correctly for the systems we have today, not some hypothetical
system of the future.
It works the way I want on vfat, ext2. The problem you are having is
specific to XFS, no?
Post by Michael Chapman
The filesystem developers have good reasons for sync()'s current behaviour.
I can only point out again that the way they've designed it does *not* lose
or corrupt data: all synced data is available as soon as the filesystem
journals have been flushed. We have to explicitly flush the journals
ourselves, one way or another, to ensure that GRUB and other
not-fully-Linux-compatible filesystem implementations work correctly.
The data *is* lost from the perspective of a boot loader. And given
that /boot is pretty much exclusively about boot loading, that's kinda
major.

Note that these weird XFS semantics are not only a problem on systemd
btw: they are much worse on sysvinit and other simpler init systems,
since they generally don't have the kill/umount/remount/detach loop we
have, and don't support transitioning back into the initrd for
complete detaching/umounting of the root fs either.

Hence, any claims by the xfs folks that systemd doesn't disassemble
things the right way is very wrong: systemd is certainly the one
implementation that has a better chance to keep xfs sane than any
other...

Lennart
--
Lennart Poettering, Red Hat
Chris Murphy
2017-04-11 02:20:54 UTC
Reply
Permalink
Raw Message
On Mon, Apr 10, 2017 at 4:44 AM, Lennart Poettering
Post by Lennart Poettering
Post by Michael Chapman
Post by Lennart Poettering
So no, "freeze" is not an option. That sounds like a recipe to make
shutdown hang. We need a sync() that actually does what is documented
and sync the file system properly.
sync() is never going to work the way you want it to work. Let's make
systemd work correctly for the systems we have today, not some hypothetical
system of the future.
It works the way I want on vfat, ext2. The problem you are having is
specific to XFS, no?
ext3 and ext4 are dirty also after doing updates; it's just not
causing boot failure, but during startup, fsck is fixing things.

Btrfs doesn't complain, but btrfs-debug-tree immediately after the
offline update reboot (without mounting), compared to btrfs-debug-tree
following a mount (but not booting, reading, or modifying anything)
shows considerable changes are made to the file system just due to the
mount. So something was left stale, and I'm guessing it was sync()
causing things to get stuffed into the log tree; which is then cleaned
up at next mount. It's not corruption, it's not even really dirty in
Btrfs semantics, but functionally I guess you'd say it was fixing
itself back up, per design.

And BTW, this is in the XFS list thread, but it's not merely the
grub.cfg that's missing in action. It's a large pile of files
including the kernel and initramfs. None of those new files exist yet
from the perspective of the bootloader.
Post by Lennart Poettering
Post by Michael Chapman
The filesystem developers have good reasons for sync()'s current behaviour.
I can only point out again that the way they've designed it does *not* lose
or corrupt data: all synced data is available as soon as the filesystem
journals have been flushed. We have to explicitly flush the journals
ourselves, one way or another, to ensure that GRUB and other
not-fully-Linux-compatible filesystem implementations work correctly.
The data *is* lost from the perspective of a boot loader. And given
that /boot is pretty much exclusively about boot loading, that's kinda
major.
Right. So let's play the blame game for a sec:

1. The kernel update package is most responsible for the change in
boot state. It's changing kernel, modules, initramfs, and the
bootloader configuration file. So it could be argued, this is the
thing that should do freeze/thaw to make certain the bootloader will
still be happy at next boot.

2. Bootloader has no fallback. The bootloader configuration is
modified in a non-atomic way. In a sense, we should have
bootloader.old and bootloader.new and use preferably the new one but
if not found use the old (unmodifed) one. At the least, we get a
normal boot with the old configuration and kernel, the kernel code
cleans up the file system so now the next boot has the updated kernel
and bootloader config.

3. Blame the thing that prevents umount and remount-ro: in the example
case it's plymouth.

4. Systemd for not enforcing limited kill exemption to those running
from initramfs, i.e. ignore kill exemption if the program is running
other than initramfs.

5. The OS installer. It might very well be we've passed the point
where it's safe for /boot to be a directory on rootfs. If almost
anything can someday pin the file system and prevent umount or
remount-ro, and thereby make kernel, initramfs, and bootloader config
file changes invisible to the bootloader - that's a good reason to
separate those files from a pinned file system.

This bug is interesting because all of these are valid to blame. But
which is the most convincing? It's sortof difficult. And in the end,
it might be the least to blame is the the best position to just
clobber the problem, preventing it from happening for all use cases.
Post by Lennart Poettering
Note that these weird XFS semantics are not only a problem on systemd
btw: they are much worse on sysvinit and other simpler init systems,
since they generally don't have the kill/umount/remount/detach loop we
have, and don't support transitioning back into the initrd for
complete detaching/umounting of the root fs either.
Hence, any claims by the xfs folks that systemd doesn't disassemble
things the right way is very wrong: systemd is certainly the one
implementation that has a better chance to keep xfs sane than any
other...
Yes, I think that assertion made on the XFS list by one developer is
unconvincing.
--
Chris Murphy
Lennart Poettering
2017-04-17 11:06:37 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
4. Systemd for not enforcing limited kill exemption to those running
from initramfs, i.e. ignore kill exemption if the program is running
other than initramfs.
Well, we are not the police, and we do kill everything by default,
even though we have this explicit, privileged opt-out of this. If
people misuse it, then I am pretty sure it's on them, not us...

That said, I will subscribe to the request that systemd's shutdown
logic should go the safest way possible, and hence I am fine with
calling the generic FIFREEZE+FITHAW ioctls one after the other, if
that helps, even though I think this is really broken API.

Lennart
--
Lennart Poettering, Red Hat
Chris Murphy
2017-05-19 19:46:14 UTC
Reply
Permalink
Raw Message
FYI the file system folks are discussing this. It is not just a
problem with XFS it can affect ext4 too. And it's far from clear the
fs folks have a solution that won't cause worse problems.


http://www.spinics.net/lists/linux-fsdevel/msg111058.html


Chris Murphy
Chris Murphy
2017-06-07 16:32:24 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
FYI the file system folks are discussing this. It is not just a
problem with XFS it can affect ext4 too. And it's far from clear the
fs folks have a solution that won't cause worse problems.
OK so this is what I got out of those conversations

sync() -> write data to disk, write metadata to log
FIFREEZE() -> sync() and write log contents to fs.
unmount() -> sync() write log contents to fs.
reboot() -> sync() and reboot.

Only on non-journaled file systems does sync() mean write data to
disk, write metadata to fs, because there's no log.

sync() only makes the file system crash safe. It's doesn't mean the
bootloader can find files: configuration, kernel, initramfs, if they
are only sync()'d because the bootloader has no idea how to read the
log. And the fs itself isn't up to date because the log is dirty.

The most central blame here goes to the bootloader: specifically that
which makes changes to the bootloader configuration in a manner that
(pretty much) guarantees the bootloader proper (the binary that
executes after POST) will not be able to find either the old or new
configuration. At the least if it found the old configuration, it
would boot the old kernel and initramfs, which would then cause the
journal to be replayed, the file system updated, and on next boot, the
new configuration, kernel, and initramfs would get used. Because the
bootloader has a special requirement, since it cannot read dirty logs,
the thing making bootloader related changes needs to make sure that
its updates are not merely crash safe, but are actually fully
committed to the file system. That requires fsfreeze.

That implicates grub-mkconfig (for GRUB), grubby (not related to GRUB,
used on Red Hat, CentOS, Fedora systems), and myriad kernel package
scripts that modify bootloader configurations, kernels, and initramfs
out in the wild. The first two: grub-mkconfig and grubby, probably
represent a fairly good chunk of deployments. But there's still a
bunch of non-Red Hat systems that do not use GRUB, and do not use
grubby, they depend on the kernel package post-install scripts to make
bootloader changes, and that is what would need to do fsfreeze.

Or systemd can help pick up some of the slack and figure out a way to
either make sure one of three things definitely happens before a
reboot: umount, remount-ro, or fsfreeze. Of course, not every distro
uses systemd, and so only solving the central problem is a solution on
those distros, but in either case that's not systemd's responsibility.
--
Chris Murphy
Lennart Poettering
2017-04-10 08:42:51 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Oh god - that's the opposite direction to go in. There's not even
pretend crash safety with those file systems. If they're dirty, you
must use an fsck to get them back to consistency. Even if the toy fs
support found in firmware will tolerate the inconsistency, who knows
what blocks it actually ends up loading into memory, you can just get
a crash later at the bootloader, or the kernel, or initramfs. That so
much consumer hardware routinely lies about having committed data to
stable media following sync() makes those file systems even less
reliable for this purpose. Once corrupt, the file system has no fail
safe or fallback like a journaled or COW file system. It's busted
until fixed with fsck.
Well, note that in a systemd world where systemd manages the ESP
there's a pretty good chance the file system stays in a clean state,
since we unmount it after after 2s after each write, and only make it
available via autofs. So yeah, only in a short time frame around a
boot loader update there's a chance for corruption. Which is
certainly much better than a corruption on every disk change like on
XFS...

Lennart
--
Lennart Poettering, Red Hat
Mantas Mikulėnas
2017-04-10 06:16:33 UTC
Reply
Permalink
Raw Message
Post by Mantas Mikulėnas
Post by Michael Chapman
Don't forget, they've provided an interface for software to use if it
needs
Post by Michael Chapman
more than the guarantees provided by sync. Informally speaking, the
FIFREEZE
Post by Michael Chapman
ioctl is intended to place a filesystem into a "fully consistent" state,
not
Post by Michael Chapman
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
If FIFREEZE is a generic ioctl() supported by a number of different
file systems I figure it would be much more OK with calling it.
That said, are you sure FIFREEZE is really what we want there? it
appears to also pause any further writes to disk (until FITHAW is
called). Which isn't really what we are interested in here (note that
we return back to the initrd after the umount spree and it shall be
able to do the rest, and if it actually can do that, then the file
systems should be able to unmount and that usually results in writes
to disk...)
So, I am still puzzled why the file system people think that "sync()"
isn't supposed to actually sync things to disk... I mean, it appears
the call is pretty much useless and it's traditional usage (which
prominently is in sysvinit before reboot()) appears to be broken by
their behaviour.
Why bother with sync() at all, if it implies no guarantees? This is
quite frankly bullshit...
It appears to me that using /boot on a file system whith such broken
sync() semantics is really not a safe thing to do, and people should
probably only use something more reliable, i.e. ext2 or vfat where
sync() actually works correctly...
It does? My /boot is vfat due to UEFI requirements, and it becomes
unbootable if you as much as sneeze near it – I've already had to repair it
thrice, after a sync and everything.
--
Mantas Mikulėnas <***@gmail.com>
Lennart Poettering
2017-04-10 08:35:29 UTC
Reply
Permalink
Raw Message
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
FIFREEZE does considerably more than what you suggest: it also pauses
all further changes until FITHAW is called. And that's semantics we
really cannot have.

Lennart
--
Lennart Poettering, Red Hat
Michael Chapman
2017-04-10 08:45:56 UTC
Reply
Permalink
Raw Message
Post by Lennart Poettering
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
FIFREEZE does considerably more than what you suggest: it also pauses
all further changes until FITHAW is called. And that's semantics we
really cannot have.
If systemd is just about to call reboot(2), why does it matter?

I do think we should attempt to remount readonly before doing the
FIFREEZE. I thought systemd did that, but it appears that it does not. A
readonly remount will do what we want so long as no remaining processes
have any files opened for writing on the filesystem. The FIFREEZE would
only be necessary when the remount fails.

Remember, all of this is because there *is* software that does the wrong
thing, and it *is* possible for software to hang and be unkillable. It
would be good for systemd to do the right thing even in the presence of
that kind of software.
Lennart Poettering
2017-04-10 09:04:45 UTC
Reply
Permalink
Raw Message
Post by Michael Chapman
Post by Lennart Poettering
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
FIFREEZE does considerably more than what you suggest: it also pauses
all further changes until FITHAW is called. And that's semantics we
really cannot have.
If systemd is just about to call reboot(2), why does it matter?
Well, in the general case we don't actually call reboot(), because we
instead transition back into the initrd, which then eventually calls
that. At least that's what happens on the major general purpose
distros that have an initrd that does that (for example: Fedora/RHEL
with Dracut).

Moreover, on the kernel side, various bits and pieces hook into the
reboot() syscall too and do last-minute stuff before going down. Are
you sure that if you have a complex storage setup (let's say DM on top
of loop on top of XFS on top of something else), that having frozen a
lower-level file system is not going to make the kernel itself pretty
unhappy if it then tries to clean up something further above?

I am sorry, but just making all accesses hang is just broken. That
can't work.
Post by Michael Chapman
I do think we should attempt to remount readonly before doing the FIFREEZE.
I thought systemd did that, but it appears that it does not. A readonly
remount will do what we want so long as no remaining processes have any
files opened for writing on the filesystem. The FIFREEZE would only be
necessary when the remount fails.
We remount everything read-only we can if we cannot unmount
something. But do note that we can't do that in all cases. Most
prominently: consider a process that is running from an executable
that has been updated on disk (specifically: whose binary got deleted
because it was replaced by a newer version). This process will keep
the file pinned, and will block all read-only remounts, as the kernel
wants to mark the file properly deleted first, but it can't since the
process is keeping it pinned.

This is specifically the case that happened for Plymouth: the binary
probably got updated, hence the process in memory references a deleted
file, which blocks the read-only remounting, in which case we can't do
anything, and sync and remount.

Note that systemd itself always reexecutes itself on shutdown, to
ensure that if itself got updated during runtime we'll stop pinning
the old file.
Post by Michael Chapman
Remember, all of this is because there *is* software that does the wrong
thing, and it *is* possible for software to hang and be unkillable. It would
be good for systemd to do the right thing even in the presence of that kind
of software.
Yeah, we do what we can.

But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.

Lennart
--
Lennart Poettering, Red Hat
Michael Chapman
2017-04-10 09:38:35 UTC
Reply
Permalink
Raw Message
Post by Lennart Poettering
Post by Michael Chapman
Post by Lennart Poettering
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
FIFREEZE does considerably more than what you suggest: it also pauses
all further changes until FITHAW is called. And that's semantics we
really cannot have.
If systemd is just about to call reboot(2), why does it matter?
Well, in the general case we don't actually call reboot(), because we
instead transition back into the initrd, which then eventually calls
that. At least that's what happens on the major general purpose
distros that have an initrd that does that (for example: Fedora/RHEL
with Dracut).
If it's not systemd _inside_ the initrd calling reboot(2), then there's
nothing systemd can do about it.
Post by Lennart Poettering
Moreover, on the kernel side, various bits and pieces hook into the
reboot() syscall too and do last-minute stuff before going down. Are
you sure that if you have a complex storage setup (let's say DM on top
of loop on top of XFS on top of something else), that having frozen a
lower-level file system is not going to make the kernel itself pretty
unhappy if it then tries to clean up something further above?
OK, that is a good point.
Post by Lennart Poettering
I am sorry, but just making all accesses hang is just broken. That
can't work.
Post by Michael Chapman
I do think we should attempt to remount readonly before doing the FIFREEZE.
I thought systemd did that, but it appears that it does not. A readonly
remount will do what we want so long as no remaining processes have any
files opened for writing on the filesystem. The FIFREEZE would only be
necessary when the remount fails.
We remount everything read-only we can if we cannot unmount
something.
Ah, I see the code for that now. I was looking for something after the
umount call (specifically, if umount failed), not before.
Post by Lennart Poettering
But do note that we can't do that in all cases. Most
prominently: consider a process that is running from an executable
that has been updated on disk (specifically: whose binary got deleted
because it was replaced by a newer version). This process will keep
the file pinned, and will block all read-only remounts, as the kernel
wants to mark the file properly deleted first, but it can't since the
process is keeping it pinned.
This is specifically the case that happened for Plymouth: the binary
probably got updated, hence the process in memory references a deleted
file, which blocks the read-only remounting, in which case we can't do
anything, and sync and remount.
OK, so how about this. _After_ the unmount-everything loop we do a freeze
+ thaw for each remaining filesystem, one filesystem at a time. That won't
permanently block processes that are still writing to the filesystems (and
why would they be?!), it will ensure that all filesystems' journals are
fully flushed (which will make GRUB and other OSs happy), and it won't
block the kernel from doing any kind of reboot()-time cleanups you were
talking about earlier.
Post by Lennart Poettering
Note that systemd itself always reexecutes itself on shutdown, to
ensure that if itself got updated during runtime we'll stop pinning
the old file.
Post by Michael Chapman
Remember, all of this is because there *is* software that does the wrong
thing, and it *is* possible for software to hang and be unkillable. It would
be good for systemd to do the right thing even in the presence of that kind
of software.
Yeah, we do what we can.
But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.
To be honest, I think having systems unbootable is a more serious problem
than having shutdowns hang. But I also think with a freeze _and_ a thaw
for each filesystem, we won't have hangs.
Lennart Poettering
2017-04-10 10:29:35 UTC
Reply
Permalink
Raw Message
Post by Michael Chapman
Post by Lennart Poettering
Post by Michael Chapman
Post by Lennart Poettering
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
FIFREEZE does considerably more than what you suggest: it also pauses
all further changes until FITHAW is called. And that's semantics we
really cannot have.
If systemd is just about to call reboot(2), why does it matter?
Well, in the general case we don't actually call reboot(), because we
instead transition back into the initrd, which then eventually calls
that. At least that's what happens on the major general purpose
distros that have an initrd that does that (for example: Fedora/RHEL
with Dracut).
If it's not systemd _inside_ the initrd calling reboot(2), then there's
nothing systemd can do about it.
The initrd usually doesn't run a systemd environment anymore, PID 1 is
usually a shell script of some kind. It might use our "reboot" binary
and call it with "-ff" (which means it's really just a pure reboot()
wrapper), but we don't do a umount/kill/detach spree in that case. Or
in other words: if they do use our reboot utility then they use it in
pure sysvinit compat mode, where it won't do more than sync() +
reboot(), exactly the same way as sysvinit() did.
Post by Michael Chapman
Post by Lennart Poettering
I am sorry, but just making all accesses hang is just broken. That
can't work.
Post by Michael Chapman
I do think we should attempt to remount readonly before doing the FIFREEZE.
I thought systemd did that, but it appears that it does not. A readonly
remount will do what we want so long as no remaining processes have any
files opened for writing on the filesystem. The FIFREEZE would only be
necessary when the remount fails.
We remount everything read-only we can if we cannot unmount
something.
Ah, I see the code for that now. I was looking for something after the
umount call (specifically, if umount failed), not before.
Well, the scheme works like this: we kill, umount, remount, detach in
a loop until nothing changes anymore. It's a primitive but robust way,
to deal with stacked storage, where running processes might pin file
systems, which might in turn pin devices, which might in turn pin
backend userspace services, and so on... Hence, yes, we do the umount
first, and the remount second, but then we'll try another umount again
and another remount, until this stops being fruitful.
Post by Michael Chapman
Post by Lennart Poettering
But do note that we can't do that in all cases. Most
prominently: consider a process that is running from an executable
that has been updated on disk (specifically: whose binary got deleted
because it was replaced by a newer version). This process will keep
the file pinned, and will block all read-only remounts, as the kernel
wants to mark the file properly deleted first, but it can't since the
process is keeping it pinned.
This is specifically the case that happened for Plymouth: the binary
probably got updated, hence the process in memory references a deleted
file, which blocks the read-only remounting, in which case we can't do
anything, and sync and remount.
OK, so how about this. _After_ the unmount-everything loop we do a freeze +
thaw for each remaining filesystem, one filesystem at a time. That won't
permanently block processes that are still writing to the filesystems (and
why would they be?!), it will ensure that all filesystems' journals are
fully flushed (which will make GRUB and other OSs happy), and it won't block
the kernel from doing any kind of reboot()-time cleanups you were talking
about earlier.
Well, I figure that might work, but it's also fricking ugly: it feels
like booking a plane ticket that includes free airplane food, just
because you are hungry: you get considerably more than just the food,
and you have to sit uncomfortably for too long, end up where you
didn't want to go, and the food is quite awful too. ("Chicken or
pasta?")

I'd prefer if the file system folks would simply provide sane
semantics here, and provide an fsync()-style syscall or ioctl that
does what is needed here.

Lennart
--
Lennart Poettering, Red Hat
Michael Chapman
2017-04-10 10:49:49 UTC
Reply
Permalink
Raw Message
Post by Lennart Poettering
Post by Michael Chapman
Post by Lennart Poettering
Post by Michael Chapman
Post by Lennart Poettering
Don't forget, they've provided an interface for software to use if it needs
more than the guarantees provided by sync. Informally speaking, the FIFREEZE
ioctl is intended to place a filesystem into a "fully consistent" state, not
just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX
really doesn't guarantee anything with sync.)
FIFREEZE does considerably more than what you suggest: it also pauses
all further changes until FITHAW is called. And that's semantics we
really cannot have.
If systemd is just about to call reboot(2), why does it matter?
Well, in the general case we don't actually call reboot(), because we
instead transition back into the initrd, which then eventually calls
that. At least that's what happens on the major general purpose
distros that have an initrd that does that (for example: Fedora/RHEL
with Dracut).
If it's not systemd _inside_ the initrd calling reboot(2), then there's
nothing systemd can do about it.
The initrd usually doesn't run a systemd environment anymore, PID 1 is
usually a shell script of some kind. It might use our "reboot" binary
and call it with "-ff" (which means it's really just a pure reboot()
wrapper), but we don't do a umount/kill/detach spree in that case. Or
in other words: if they do use our reboot utility then they use it in
pure sysvinit compat mode, where it won't do more than sync() +
reboot(), exactly the same way as sysvinit() did.
OK, given that there's really no point in pursuing this from the systemd
end.
Post by Lennart Poettering
Post by Michael Chapman
Post by Lennart Poettering
I am sorry, but just making all accesses hang is just broken. That
can't work.
Post by Michael Chapman
I do think we should attempt to remount readonly before doing the FIFREEZE.
I thought systemd did that, but it appears that it does not. A readonly
remount will do what we want so long as no remaining processes have any
files opened for writing on the filesystem. The FIFREEZE would only be
necessary when the remount fails.
We remount everything read-only we can if we cannot unmount
something.
Ah, I see the code for that now. I was looking for something after the
umount call (specifically, if umount failed), not before.
Well, the scheme works like this: we kill, umount, remount, detach in
a loop until nothing changes anymore. It's a primitive but robust way,
to deal with stacked storage, where running processes might pin file
systems, which might in turn pin devices, which might in turn pin
backend userspace services, and so on... Hence, yes, we do the umount
first, and the remount second, but then we'll try another umount again
and another remount, until this stops being fruitful.
Post by Michael Chapman
Post by Lennart Poettering
But do note that we can't do that in all cases. Most
prominently: consider a process that is running from an executable
that has been updated on disk (specifically: whose binary got deleted
because it was replaced by a newer version). This process will keep
the file pinned, and will block all read-only remounts, as the kernel
wants to mark the file properly deleted first, but it can't since the
process is keeping it pinned.
This is specifically the case that happened for Plymouth: the binary
probably got updated, hence the process in memory references a deleted
file, which blocks the read-only remounting, in which case we can't do
anything, and sync and remount.
OK, so how about this. _After_ the unmount-everything loop we do a freeze +
thaw for each remaining filesystem, one filesystem at a time. That won't
permanently block processes that are still writing to the filesystems (and
why would they be?!), it will ensure that all filesystems' journals are
fully flushed (which will make GRUB and other OSs happy), and it won't block
the kernel from doing any kind of reboot()-time cleanups you were talking
about earlier.
Well, I figure that might work, but it's also fricking ugly: it feels
like booking a plane ticket that includes free airplane food, just
because you are hungry: you get considerably more than just the food,
and you have to sit uncomfortably for too long, end up where you
didn't want to go, and the food is quite awful too. ("Chicken or
pasta?")
I'd prefer if the file system folks would simply provide sane
semantics here, and provide an fsync()-style syscall or ioctl that
does what is needed here.
Perhaps we might be able to convince them to make reboot() a full "unmount
all remaining filesystems" operation. To be honest, I'm a little
surprised it isn't... but I suppose it's got all the same problems with
ordering between filesystems within the kernel itself.
Kai Krakow
2017-04-10 11:43:29 UTC
Reply
Permalink
Raw Message
Am Mon, 10 Apr 2017 11:04:45 +0200
Post by Lennart Poettering
Post by Michael Chapman
Remember, all of this is because there *is* software that does the
wrong thing, and it *is* possible for software to hang and be
unkillable. It would be good for systemd to do the right thing even
in the presence of that kind of software.
Yeah, we do what we can.
But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.
It could simply thaw the FS again after freeze to somewhat improve on
that. At least everything that should be flushed is now flushed at that
point and grub et al should be happy.

But I wonder why filesystems not just flush the journal on remount-ro?
It may take a while but I think that can be perfectly expected when
rmounting ro: At least I would expect that this forces out all pending
writes to the filesystem hence flushing the journal.

Tho, readonly mounts do not guarantee the filesystem not modifying the
underlying storage device. For example, btrfs can modify the storage
even when mounting an unmounted fs in ro mode. It guarantees readonly
from user-space perspective - and I think that's totally on par with the
specs of "mount -o ro".

So a final freeze/thaw cycle is probably the only way to go? As it
specifies what is needed here to be compatible with configurations that
involve grub on complex filesystems.

Then, what's with underlying cache infrastructures like BBU-supported
RAID caches? We had systems that failed on reboot because the BBU was
in relearning cycle at reboot and the controller thus refused to replay
the write-cache during POST and instead discarded it. That can really
create you a big mess, btw. Tho, I think that's a controller bug: The
writeback wasn't set to always writeback but only when it's safe. But
this suggests that the reboot code should even force some cache flush
for those components.

Taken everything into account it boils down to eventually not using
grub on XFS but only simple filesystems, or depend on ESP only for
booting. Everything else only means that systemd (and other init
systems) have to invent a huge complex mess to fix everything that
isn't done right by other involved software.
--
Regards,
Kai

Replies to list-only preferred.
Lennart Poettering
2017-04-10 11:54:27 UTC
Reply
Permalink
Raw Message
Post by Kai Krakow
Am Mon, 10 Apr 2017 11:04:45 +0200
Post by Lennart Poettering
Post by Michael Chapman
Remember, all of this is because there *is* software that does the
wrong thing, and it *is* possible for software to hang and be
unkillable. It would be good for systemd to do the right thing even
in the presence of that kind of software.
Yeah, we do what we can.
But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.
It could simply thaw the FS again after freeze to somewhat improve on
that. At least everything that should be flushed is now flushed at that
point and grub et al should be happy.
But I wonder why filesystems not just flush the journal on remount-ro?
It may take a while but I think that can be perfectly expected when
rmounting ro: At least I would expect that this forces out all pending
writes to the filesystem hence flushing the journal.
Well, the remount-ro doesn't succeed in the case this is all about:
the plymouth process appears to run off the root fs and keeps the
executable pinned, which was deleted because updated, and thus the
kernel will refuse the remount. See other mail.
Post by Kai Krakow
So a final freeze/thaw cycle is probably the only way to go? As it
specifies what is needed here to be compatible with configurations that
involve grub on complex filesystems.
A pair of FIFREEZE+FITHAW are likely to work, but it's frickin' ugly
(see other mails), and I'd certainly prefer if the fs folks would
provide a proper ioctl/syscall for the operation we need. Quite
frankly it doesn't appear like a particularly exotic operation, in
fact the operation we'd need would probably be run much more often
than the operation that FIFREEZE/FITHAW was introduced for...

Lennart
--
Lennart Poettering, Red Hat
Kai Krakow
2017-04-10 12:16:38 UTC
Reply
Permalink
Raw Message
Am Mon, 10 Apr 2017 13:54:27 +0200
Post by Lennart Poettering
Post by Kai Krakow
Am Mon, 10 Apr 2017 11:04:45 +0200
[...]
Post by Kai Krakow
Post by Lennart Poettering
Yeah, we do what we can.
But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.
It could simply thaw the FS again after freeze to somewhat improve
on that. At least everything that should be flushed is now flushed
at that point and grub et al should be happy.
But I wonder why filesystems not just flush the journal on
remount-ro? It may take a while but I think that can be perfectly
expected when rmounting ro: At least I would expect that this
forces out all pending writes to the filesystem hence flushing the
journal.
the plymouth process appears to run off the root fs and keeps the
executable pinned, which was deleted because updated, and thus the
kernel will refuse the remount. See other mail.
Ah okay, so given that case, a journal flush even isn't attempted, it
fails right away. My first idea was that it should flush the journal
but can fail anyways. I didn't get that point. Thus my assumption that
remount-ro doesn't flush the journal.
Post by Lennart Poettering
Post by Kai Krakow
So a final freeze/thaw cycle is probably the only way to go? As it
specifies what is needed here to be compatible with configurations
that involve grub on complex filesystems.
A pair of FIFREEZE+FITHAW are likely to work, but it's frickin' ugly
(see other mails), and I'd certainly prefer if the fs folks would
provide a proper ioctl/syscall for the operation we need. Quite
frankly it doesn't appear like a particularly exotic operation, in
fact the operation we'd need would probably be run much more often
than the operation that FIFREEZE/FITHAW was introduced for...
Yes it's ugly and there should be a proper ioctl/syscall for the exact
semantics needed. Usually, working around such missing APIs only
results in the needed bits never implemented. I totally understand your
point. ;-)
--
Regards,
Kai

Replies to list-only preferred.
Chris Murphy
2017-04-11 01:30:35 UTC
Reply
Permalink
Raw Message
On Mon, Apr 10, 2017 at 3:04 AM, Lennart Poettering
Post by Lennart Poettering
This is specifically the case that happened for Plymouth: the binary
probably got updated, hence the process in memory references a deleted
file, which blocks the read-only remounting, in which case we can't do
anything, and sync and remount.
In my reproduce case, the offline update contained only kernel,
kernel-core, and kernel-modules packages. This triggers grubby to do
the modification on the grub.cfg which happens to be on /boot/grub2 on
XFS. Plymouth was definitely not being updated.
Post by Lennart Poettering
Post by Michael Chapman
Remember, all of this is because there *is* software that does the wrong
thing, and it *is* possible for software to hang and be unkillable. It would
be good for systemd to do the right thing even in the presence of that kind
of software.
Yeah, we do what we can.
But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.
My understanding is freeze isn't ignorable, it's expressly for the use
case when the disk has active processing writing and the fs must be
made completely consistent, e.g. prior to taking a snapshot. The thaw
immediately following freeze would prevent any shutdown hang.

The point of freeze/thaw is it will cause the file system metadata
that grub depends on to know where the new grub.cfg is located, to get
committed to disk prior to reboot. If some process is still hanging
around with an open write, it doesn't really matter.
--
Chris Murphy
Lennart Poettering
2017-04-17 11:02:48 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Post by Lennart Poettering
Post by Michael Chapman
Remember, all of this is because there *is* software that does the wrong
thing, and it *is* possible for software to hang and be unkillable. It would
be good for systemd to do the right thing even in the presence of that kind
of software.
Yeah, we do what we can.
But I seriously doubt FIFREEZE will make things better. It's just
going to make shutdowns hang every now and then.
My understanding is freeze isn't ignorable, it's expressly for the use
case when the disk has active processing writing and the fs must be
made completely consistent, e.g. prior to taking a snapshot. The thaw
immediately following freeze would prevent any shutdown hang.
The point of freeze/thaw is it will cause the file system metadata
that grub depends on to know where the new grub.cfg is located, to get
committed to disk prior to reboot. If some process is still hanging
around with an open write, it doesn't really matter.
As mentioned: if you prep a patch that adds FIFREEZE+FITHAW when we
remount stuff read-only, then I'd merge it, even though I think the
kernel APIs for this are really broken, and it would be much
preferably having a proper API for this, either exposed via the
well-understood sync() syscall, or through a new ioctl, if they must.

Lennart
--
Lennart Poettering, Red Hat
Holger Kiehl
2017-04-09 13:44:44 UTC
Reply
Permalink
Raw Message
Post by Chris Murphy
Post by Andrei Borzenkov
grub2 is not limited to 640KiB. Actually it will actively avoid using
low memory. It switches to protected mode as the very first thing and
can use up to 4GiB (and even this probably can be lifted on 64 bit
platform). The real problem is the fact that grub is read-only so every
time you access file on journaled partition it will need to replay
journal again from scratch. This will likely be painfully slow (I
remember that grub legacy on reiser needed couple of minutes to read
kernel and much more to read initrd, and that was when both were smaller
than now).
OK well that makes more sense; but yeah it still sounds like journal
replay is a non-starter. The entire fs metadata would have to be read
into memory and create something like a RAM based rw snapshot which is
backed by the ro disk version as origin, and then play the log against
the RAM snapshot. That could be faster than constantly replaying the
journal from scratch for each file access. But still - sounds overly
complicated.
I think this qualifies as "Doctor, it hurt when I do this." And the
doctor says, "So don't do that." And I'm referring to Plymouth
exempting itself from kill while also not running from initramfs. So
I'll kindly make the case with Plymouth folks to stop pressing this
particular hurt me button.
But hey, pretty cool bug. Not often is it the case you find such an
old bug so easily reproducible but near as I can tell only one person
was hitting it until I tried to reproduce it.
I too was hit by this bug on one of my systems. But what I did is that
I just removed all plymouth rpms and everything was good form that moment
on.

Holger
Loading...