Discussion:
[ANNOUNCE] systemd 219
(too old to reply)
Lennart Poettering
2015-02-16 22:59:56 UTC
Permalink
Heya!

Many many improvements, in particular in the area of containers, btrfs
hookup, and networkd. Also, many bugfixes. Enjoy!

http://www.freedesktop.org/software/systemd/systemd-219.tar.xz

Note that this version is not available in Fedora F22/F23 yet. The
linker on ARM segfaults. Since the i386 and x86_64 versions built
fine, I decided to release 219 anyway.

CHANGES WITH 219:

* Introduce a new API "sd-hwdb.h" for querying the hardware
metadata database. With this minimal interface one can query
and enumerate the udev hwdb, decoupled from the old libudev
library. libudev's interface for this is now only a wrapper
around sd-hwdb. A new tool systemd-hwdb has been added to
interface with and update the database.

* When any of systemd's tools copies files (for example due to
tmpfiles' C lines) a btrfs reflink will attempted first,
before bytewise copying is done.

* systemd-nspawn gained a new --ephemeral switch. When
specified a btrfs snapshot is taken of the container's root
directory, and immediately removed when the container
terminates again. Thus, a container can be started whose
changes never alter the container's root directory, and are
lost on container termination. This switch can also be used
for starting a container off the root file system of the
host without affecting the host OS. This switch is only
available on btrfs file systems.

* systemd-nspawn gained a new --template= switch. It takes the
path to a container tree to use as template for the tree
specified via --directory=, should that directory be
missing. This allows instantiating containers dynamically,
on first run. This switch is only available on btrfs file
systems.

* When a .mount unit refers to a mount point on which multiple
mounts are stacked, and the .mount unit is stopped all of
the stacked mount points will now be unmounted until no
mount point remains.

* systemd now has an explicit notion of supported and
unsupported unit types. Jobs enqueued for unsupported unit
types will now fail with an "unsupported" error code. More
specifically .swap, .automount and .device units are not
supported in containers, .busname units are not supported on
non-kdbus systems. .swap and .automount are also not
supported if their respective kernel compile time options
are disabled.

* machinectl gained support for two new "copy-from" and
"copy-to" commands for copying files from a running
container to the host or vice versa.

* machinectl gained support for a new "bind" command to bind
mount host directories into local containers. This is
currently only supported for nspawn containers.

* networkd gained support for configuring bridge forwarding
database entries (fdb) from .network files.

* A new tiny daemon "systemd-importd" has been added that can
download container images in tar, raw, qcow2 or dkr formats,
and make them available locally in /var/lib/machines, so
that they can run as nspawn containers. The daemon can GPG
verify the downloads (not supported for dkr, since it has no
provisions for verifying downloads). It will transparently
decompress bz2, xz, gzip compressed downloads if necessary,
and restore sparse files on disk. The daemon uses privilege
separation to ensure the actual download logic runs with
fewer privileges than the deamon itself. machinectl has
gained new commands "pull-tar", "pull-raw" and "pull-dkr" to
make the functionality of importd available to the
user. With this in place the Fedora and Ubuntu "Cloud"
images can be downloaded and booted as containers unmodified
(the Fedora images lack the appropriate GPG signature files
currently, so they cannot be verified, but this will change
soon, hopefully). Note that downloading images is currently
only fully supported on btrfs.

* machinectl is now able to list container images found in
/var/lib/machines, along with some metadata about sizes of
disk and similar. If the directory is located on btrfs and
quota is enabled, this includes quota display. A new command
"image-status" has been added that shows additional
information about images.

* machinectl is now able to clone container images
efficiently, if the underlying file system (btrfs) supports
it, with the new "machinectl list-images" command. It also
gained commands for renaming and removing images, as well as
marking them read-only or read-write (supported also on
legacy file systems).

* networkd gained support for collecting LLDP network
announcements, from hardware that supports this. This is
shown in networkctl output.

* systemd-run gained support for a new -t (--pty) switch for
invoking a binary on a pty whose input and output is
connected to the invoking terminal. This allows executing
processes as system services while interactively
communicating with them via the terminal. Most interestingly
this is supported across container boundaries. Invoking
"systemd-run -t /bin/bash" is an alternative to running a
full login session, the difference being that the former
will not register a session, nor go through the PAM session
setup.

* tmpfiles gained support for a new "v" line type for creating
btrfs subvolumes. If the underlying file system is a legacy
file system, this automatically degrades to creating a
normal directory. Among others /var/lib/machines is now
created like this at boot, should it be missing.

* The directory /var/lib/containers/ has been deprecated and
been replaced by /var/lib/machines. The term "machines" has
been used in the systemd context as generic term for both
VMs and containers, and hence appears more appropriate for
this, as the directory can also contain raw images bootable
via qemu/kvm.

* systemd-nspawn when invoked with -M but without --directory=
or --image= is now capable of searching for the container
root directory, subvolume or disk image automatically, in
/var/lib/machines. systemd-***@.service has been updated
to make use of this, thus allowing it to be used for raw
disk images, too.

* A new machines.target unit has been introduced that is
supposed to group all containers/VMs invoked as services on
the system. systemd-***@.service has been updated to
integrate with that.

* machinectl gained a new "start" command, for invoking a
container as a service. "machinectl start foo" is mostly
equivalent to "systemctl start systemd-***@foo.service",
but handles escaping in a nicer way.

* systemd-nspawn will now mount most of the cgroupfs tree
read-only into each container, with the exception of the
container's own subtree in the name=systemd hierarchy.

* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.

* When journald detects that journal files it is writing to
have been deleted it will immediately start new journal
files.

* systemd now provides a way to store file descriptors
per-service in PID 1.This is useful for daemons to ensure
that fds they require are not lost during a daemon
restart. The fds are passed to the deamon on the next
invocation in the same way socket activation fds are
passed. This is now used by journald to ensure that the
various sockets connected to all the system's stdout/stderr
are not lost when journald is restarted. File descriptors
may be stored in PID 1 via the sd_pid_notify_with_fds() API,
an extension to sd_notify(). Note that a limit is enforced
on the number of fds a service can store in PID 1, and it
defaults to 0, so that no fds may be stored, unless this is
explicitly turned on.

* The default TERM variable to use for units connected to a
terminal, when no other value is explicitly is set is now
vt220 rather than vt102. This should be fairly safe still,
but allows PgUp/PgDn work.

* The /etc/crypttab option header= as known from Debian is now
supported.

* "loginctl user-status" and "loginctl session-status" will
now show the last 10 lines of log messages of the
user/session following the status output. Similar,
"machinectl status" will show the last 10 log lines
associated with a virtual machine or container
service. (Note that this is usually not the log messages
done in the VM/container itself, but simply what the
container manager logs. For nspawn this includes all console
output however.)

* "loginctl session-status" without further argument will now
show the status of the session of the caller. Similar,
"lock-session", "unlock-session", "activate",
"enable-linger", "disable-linger" may now be called without
session/user parameter in which case they apply to the
caller's session/user.

* An X11 session scriptlet is now shipped that uploads
$DISPLAY and $XAUTHORITY into the environment of the systemd
--user daemon if a session begins. This should improve
compatibility with X11 enabled applications run as systemd
user services.

* Generators are now subject to masking via /etc and /run, the
same way as unit files.

* networkd .network files gained support for configuring
per-link IPv4/IPv6 packet forwarding as well as IPv4
masquerading. This is by default turned on for veth links to
containers, as registered by systemd-nspawn. This means that
nspawn containers run with --network-veth will now get
automatic routed access to the host's networks without any
further configuration or setup, as long as networkd runs on
the host.

* systemd-nspawn gained the --port= (-p) switch to expose TCP
or UDP posts of a container on the host. With this in place
it is possible to run containers with private veth links
(--network-veth), and have their functionality exposed on
the host as if their services were running directly on the
host.

* systemd-nspawn's --network-veth switch now gained a short
version "-n", since with the changes above it is now truly
useful out-of-the-box. The systemd-***@.service has been
updated to make use of it too by default.

* systemd-nspawn will now maintain a per-image R/W lock, to
ensure that the same image is not started more than once
writable. (It's OK to run an image multiple times
simultaneously in read-only mode.)

* systemd-nspawn's --image= option is now capable of
dissecting and booting MBR and GPT disk images that contain
only a single active Linux partition. Previously it
supported only GPT disk images with proper GPT type
IDs. This allows running cloud images from major
distributions directly with systemd-nspawn, without
modification.

* In addition to collecting mouse dpi data in the udev
hardware database, there's now support for collecting angle
information for mouse scroll wheels. The database is
supposed to guarantee similar scrolling behavior on mice
that it knows about. There's also support for collecting
information about Touchpad types.

* udev's input_id built-in will now also collect touch screen
dimension data and attach it to probed devices.

* /etc/os-release gained support for a Distribution Privacy
Policy link field.

* networkd gained support for creating "ipvlan", "gretap",
"ip6gre", "ip6gretap" and "ip6tnl" network devices.

* systemd-tmpfiles gained support for "a" lines for setting
ACLs on files.

* systemd-nspawn will now mount /tmp in the container to
tmpfs, automatically.

* systemd now exposes the memory.usage_in_bytes cgroup
attribute and shows it for each service in the "systemctl
status" output, if available.

* When the user presses Ctrl-Alt-Del more than 7x within 2s an
immediate reboot is triggered. This useful if shutdown is
hung and is unable to complete, to expedite the
operation. Note that this kind of reboot will still unmount
all file systems, and hence should not result in fsck being
run on next reboot.

* A .device unit for an optical block device will now be
considered active only when a medium is in the drive. Also,
mount units are now bound to their backing devices thus
triggering automatic unmounting when devices become
unavailable. With this in place systemd will now
automatically unmount left-over mounts when a CD-ROM is
ejected or an USB stick is yanked from the system.

* networkd-wait-online now has support for waiting for
specific interfaces only (with globbing), and for giving up
after a configurable timeout.

* networkd now exits when idle. It will be automatically
restarted as soon as interfaces show up, are removed or
change state. networkd will stay around as long as there is
at least one DHCP state machine or similar around, that keep
it non-idle.

* networkd may now configure IPv6 link-local addressing in
addition to IPv4 link-local addressing.

* The IPv6 "token" for use in SLAAC may now be configured for
each .network interface in networkd.

* Routes configured with networkd may now be assigned a scope
in .network files.

* networkd's [Match] sections now support globbing and lists
of multiple space-separated matches per item.

Contributions from: Alban Crequy, Alin Rauta, Andrey Chaser,
Bastien Nocera, Bruno Bottazzini, Carlos Garnacho, Carlos
Morata Castillo, Chris Atkinson, Chris J. Arges, Christian
Kirbach, Christian Seiler, Christoph Brill, Colin Guthrie,
Colin Walters, Cristian Rodríguez, Daniele Medri, Daniel Mack,
Dave Reisner, David Herrmann, Djalal Harouni, Erik Auerswald,
Filipe Brandenburger, Frank Theile, Gabor Kelemen, Gabriel de
Perthuis, Harald Hoyer, Hui Wang, Ivan Shapovalov, Jan
Engelhardt, Jan Synacek, Jay Faulkner, Johannes Hölzl, Jonas
Ådahl, Jonathan Boulle, Josef Andersson, Kay Sievers, Ken
Werner, Lennart Poettering, Lucas De Marchi, Lukas Märdian,
Lukas Nykryn, Lukasz Skalski, Luke Shumaker, Mantas Mikulėnas,
Manuel Mendez, Marcel Holtmann, Marc Schmitzer, Marko
Myllynen, Martin Pitt, Maxim Mikityanskiy, Michael Biebl,
Michael Marineau, Michael Olbrich, Michal Schmidt, Mindaugas
Baranauskas, Moez Bouhlel, Naveen Kumar, Patrik Flykt, Paul
Martin, Peter Hutterer, Peter Mattern, Philippe De Swert,
Piotr Drąg, Rafael Ferreira, Rami Rosen, Robert Milasan, Ronny
Chevalier, Sangjung Woo, Sebastien Bacher, Sergey Ptashnick,
Shawn Landden, Stéphane Graber, Susant Sahani, Sylvain
Plantefève, Thomas Hindoe Paaboel Andersen, Tim JP, Tom
Gundersen, Topi Miettinen, Torstein Husebø, Umut Tezduyar
Lindskog, Veres Lajos, Vincent Batts, WaLyong Cho, Wieland
Hoffmann, Zbigniew Jędrzejewski-Szmek

-- Berlin, 2015-02-16

Lennart
--
Lennart Poettering, Red Hat
Andrei Borzenkov
2015-02-17 03:53:09 UTC
Permalink
В Mon, 16 Feb 2015 23:59:56 +0100
Post by Lennart Poettering
* When a .mount unit refers to a mount point on which multiple
mounts are stacked, and the .mount unit is stopped all of
the stacked mount points will now be unmounted until no
mount point remains.
Does it mean that in either of below case

mount something-else /foo
systemctl start foo.mount

and

systemctl start foo.mount
mount something-else /foo

systemctl stop foo.mount will also unmount something-else?
Lennart Poettering
2015-02-17 10:08:57 UTC
Permalink
Post by Andrei Borzenkov
В Mon, 16 Feb 2015 23:59:56 +0100
Post by Lennart Poettering
* When a .mount unit refers to a mount point on which multiple
mounts are stacked, and the .mount unit is stopped all of
the stacked mount points will now be unmounted until no
mount point remains.
Does it mean that in either of below case
mount something-else /foo
systemctl start foo.mount
In this case the second line is a NOP, since the first line already
mounted something on /foo, and thus made foo.mount active.

(Also, small hint, you can just write "systemctl start /foo", it will
be implicitly converted to "systemctl start foo.mount".)
Post by Andrei Borzenkov
and
systemctl start foo.mount
mount something-else /foo
This one will result in too mounts one on top of the other.
Post by Andrei Borzenkov
systemctl stop foo.mount will also unmount something-else?
Correct. In the first case a single mount is removed, in the second
case two mounts will actually be removed.

Lennart
--
Lennart Poettering, Red Hat
Colin Guthrie
2015-02-17 23:30:09 UTC
Permalink
Post by Lennart Poettering
Post by Andrei Borzenkov
mount something-else /foo
systemctl start foo.mount
In this case the second line is a NOP, since the first line already
mounted something on /foo, and thus made foo.mount active.
So, even if foo.mount (the actual unit file) specifies it's
What=something (not What=something-else) the fact that *anything* is
mounted to /foo is sufficient to make the foo.mount unit active?

This seems somewhat counter-intuitive to me. I can understand why from
an implementation perspective - the mount units are all geared around
the mountpoint not the What=, but it's certainly not what I'd expect as
a user.

Wouldn't it be better if there was some other state - e.g. "conflict" if
something other than the desired device was mounted to the specified
destination?

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Lennart Poettering
2015-02-18 10:03:39 UTC
Permalink
Post by Colin Guthrie
Post by Lennart Poettering
Post by Andrei Borzenkov
mount something-else /foo
systemctl start foo.mount
In this case the second line is a NOP, since the first line already
mounted something on /foo, and thus made foo.mount active.
So, even if foo.mount (the actual unit file) specifies it's
What=something (not What=something-else) the fact that *anything* is
mounted to /foo is sufficient to make the foo.mount unit active?
Yes, and this always has been that way.
Post by Colin Guthrie
This seems somewhat counter-intuitive to me. I can understand why from
an implementation perspective - the mount units are all geared around
the mountpoint not the What=, but it's certainly not what I'd expect as
a user.
Well it's the only logic that can work really, already since the same
device node is usually known to the kernel by a different name thatn
to userspace. Trying to always map that is really nasty, as one can
see with the GPT generator complexity.
Post by Colin Guthrie
Wouldn't it be better if there was some other state - e.g. "conflict" if
something other than the desired device was mounted to the specified
destination?
I think it's really safe not to consider that a problem.

Lennart
--
Lennart Poettering, Red Hat
Zbigniew Jędrzejewski-Szmek
2015-02-18 14:11:52 UTC
Permalink
Post by Lennart Poettering
Well it's the only logic that can work really, already since the same
device node is usually known to the kernel by a different name thatn
to userspace. Trying to always map that is really nasty, as one can
see with the GPT generator complexity.
Post by Colin Guthrie
Wouldn't it be better if there was some other state - e.g. "conflict" if
something other than the desired device was mounted to the specified
destination?
I think it's really safe not to consider that a problem.
Yes, especially that the administator must take explicit manual
actions to reach this state. They should remember that they did that.

Zbyszek
Maciej Wereski
2015-02-17 16:13:08 UTC
Permalink
Hello,
Post by Lennart Poettering
Note that this version is not available in Fedora F22/F23 yet. The
linker on ARM segfaults. Since the i386 and x86_64 versions built
fine, I decided to release 219 anyway.
I was able to build systemd v219 both on armv7l and aarch64. As a workaround I
had to disable Link Time Optimizations.

Tizen 3.0:
gcc 4.9.2
binutils 2.24.90

cheers,
--
Maciej Wereski
Samsung R&D Institute Poland
Samsung Electronics
***@partner.samsung.com
Lennart Poettering
2015-02-17 16:23:55 UTC
Permalink
Post by Maciej Wereski
Hello,
Post by Lennart Poettering
Note that this version is not available in Fedora F22/F23 yet. The
linker on ARM segfaults. Since the i386 and x86_64 versions built
fine, I decided to release 219 anyway.
I was able to build systemd v219 both on armv7l and aarch64. As a workaround I
had to disable Link Time Optimizations.
Well, did it segfault for you if you had lto on?

This toolchain bug is tracked here btw:

https://bugzilla.redhat.com/show_bug.cgi?id=1193212

Lennart
--
Lennart Poettering, Red Hat
Maciej Wereski
2015-02-18 14:11:46 UTC
Permalink
Post by Lennart Poettering
Post by Maciej Wereski
Hello,
Post by Lennart Poettering
Note that this version is not available in Fedora F22/F23 yet. The
linker on ARM segfaults. Since the i386 and x86_64 versions built
fine, I decided to release 219 anyway.
I was able to build systemd v219 both on armv7l and aarch64. As a
workaround I had to disable Link Time Optimizations.
Well, did it segfault for you if you had lto on?
https://bugzilla.redhat.com/show_bug.cgi?id=1193212
No, we have some issues rather specific to our buildsystem.
--
Maciej Wereski
Samsung R&D Institute Poland
Samsung Electronics
***@partner.samsung.com
Goffredo Baroncelli
2015-02-17 19:05:29 UTC
Permalink
Hi Lennart,
Post by Lennart Poettering
* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.
If I read correctly the code, the FS_NOCOW is a temporary workaround, i.e.
when the file is closed (or rotated ?) the FS_NOCOW flags is unset again.
It is true ?

If so, the time window where a file is un-protect by the checksum is
quite small. I was worried not about the corruption detection but about loosing
the ability to recover the file from a good copy (if available) in case of corruption.
But this seems limited only when the file is in use (before the next rotation).

BR
G.Baroncelli
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Zbigniew Jędrzejewski-Szmek
2015-02-18 00:14:44 UTC
Permalink
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.
If I read correctly the code, the FS_NOCOW is a temporary workaround, i.e.
when the file is closed (or rotated ?) the FS_NOCOW flags is unset again.
It is true ?
Yes, but you miss the point in general. FS_NOCOW is set during the
entire time when the file is being written to, which could be months,
and then it is unset when the file will not be written to anymore. So
indeed, the file is not protected by btrfs checksums for the majority
of time, but journald does its own checksumming, so the contents are
protected in a different way.
Post by Goffredo Baroncelli
If so, the time window where a file is un-protect by the checksum is
quite small. I was worried not about the corruption detection but about loosing
the ability to recover the file from a good copy (if available) in case of corruption.
But this seems limited only when the file is in use (before the next rotation).
Zbyszek
Andrei Borzenkov
2015-02-18 03:22:38 UTC
Permalink
В Wed, 18 Feb 2015 01:14:44 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.
If I read correctly the code, the FS_NOCOW is a temporary workaround, i.e.
when the file is closed (or rotated ?) the FS_NOCOW flags is unset again.
It is true ?
Yes, but you miss the point in general. FS_NOCOW is set during the
entire time when the file is being written to, which could be months,
and then it is unset when the file will not be written to anymore. So
indeed, the file is not protected by btrfs checksums for the majority
of time, but journald does its own checksumming, so the contents are
protected in a different way.
btrfs checksumming theoretically allows you to transparently recover
after media corruption if filesystem has redundancy (more than one copy
of data). Journald checksum will probably detect corruption, but can it
repair it?
Post by Zbigniew Jędrzejewski-Szmek
Post by Goffredo Baroncelli
If so, the time window where a file is un-protect by the checksum is
quite small. I was worried not about the corruption detection but about loosing
the ability to recover the file from a good copy (if available) in case of corruption.
But this seems limited only when the file is in use (before the next rotation).
Zbyszek
_______________________________________________
systemd-devel mailing list
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Zbigniew Jędrzejewski-Szmek
2015-02-18 03:30:41 UTC
Permalink
Post by Andrei Borzenkov
В Wed, 18 Feb 2015 01:14:44 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.
If I read correctly the code, the FS_NOCOW is a temporary workaround, i.e.
when the file is closed (or rotated ?) the FS_NOCOW flags is unset again.
It is true ?
Yes, but you miss the point in general. FS_NOCOW is set during the
entire time when the file is being written to, which could be months,
and then it is unset when the file will not be written to anymore. So
indeed, the file is not protected by btrfs checksums for the majority
of time, but journald does its own checksumming, so the contents are
protected in a different way.
btrfs checksumming theoretically allows you to transparently recover
after media corruption if filesystem has redundancy (more than one copy
of data). Journald checksum will probably detect corruption, but can it
repair it?
No.

Zbyszek
Lennart Poettering
2015-02-18 10:07:38 UTC
Permalink
Post by Andrei Borzenkov
В Wed, 18 Feb 2015 01:14:44 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.
If I read correctly the code, the FS_NOCOW is a temporary workaround, i.e.
when the file is closed (or rotated ?) the FS_NOCOW flags is unset again.
It is true ?
Yes, but you miss the point in general. FS_NOCOW is set during the
entire time when the file is being written to, which could be months,
and then it is unset when the file will not be written to anymore. So
indeed, the file is not protected by btrfs checksums for the majority
of time, but journald does its own checksumming, so the contents are
protected in a different way.
btrfs checksumming theoretically allows you to transparently recover
after media corruption if filesystem has redundancy (more than one copy
of data). Journald checksum will probably detect corruption, but can it
repair it?
No it cannot.

But btrfs checksumming cannot fix things for you either if you lose
non-trivial amounts of data. It might be able to fix a few bits of
errors, but not non-trivial amounts. I mean, that's a simple property
of error correction codes: the more you want to be able to correct the
longer must your checksum be. Neither btrfs' nor journald's are
substantial enough to correct even a sector...

Lennart
--
Lennart Poettering, Red Hat
Joonas Sarajärvi
2015-02-18 10:13:29 UTC
Permalink
Post by Lennart Poettering
Post by Andrei Borzenkov
btrfs checksumming theoretically allows you to transparently recover
after media corruption if filesystem has redundancy (more than one copy
of data). Journald checksum will probably detect corruption, but can it
repair it?
No it cannot.
But btrfs checksumming cannot fix things for you either if you lose
non-trivial amounts of data. It might be able to fix a few bits of
errors, but not non-trivial amounts. I mean, that's a simple property
of error correction codes: the more you want to be able to correct the
longer must your checksum be. Neither btrfs' nor journald's are
substantial enough to correct even a sector...
Lennart
My impression is that btrfs can fix the corruption in cases where a
e.g. a RAID1 of btrfs is used. As journal performance has already been
sufficient for my needs on btrfs, I would prefer to be able to
configure journald so that it'd keep the journal files with default
flags.

-Joonas
Lennart Poettering
2015-02-18 10:19:40 UTC
Permalink
Post by Joonas Sarajärvi
Post by Lennart Poettering
Post by Andrei Borzenkov
btrfs checksumming theoretically allows you to transparently recover
after media corruption if filesystem has redundancy (more than one copy
of data). Journald checksum will probably detect corruption, but can it
repair it?
No it cannot.
But btrfs checksumming cannot fix things for you either if you lose
non-trivial amounts of data. It might be able to fix a few bits of
errors, but not non-trivial amounts. I mean, that's a simple property
of error correction codes: the more you want to be able to correct the
longer must your checksum be. Neither btrfs' nor journald's are
substantial enough to correct even a sector...
Lennart
My impression is that btrfs can fix the corruption in cases where a
e.g. a RAID1 of btrfs is used.
FS_NOCOW does no effect btrfs raid settings. If you want this kind of
data redundancy then it will continue to be available even though we
set FS_NOCOW now.

Lennart
--
Lennart Poettering, Red Hat
Joonas Sarajärvi
2015-02-18 10:27:52 UTC
Permalink
Post by Lennart Poettering
Post by Joonas Sarajärvi
Post by Lennart Poettering
Post by Andrei Borzenkov
btrfs checksumming theoretically allows you to transparently recover
after media corruption if filesystem has redundancy (more than one copy
of data). Journald checksum will probably detect corruption, but can it
repair it?
No it cannot.
But btrfs checksumming cannot fix things for you either if you lose
non-trivial amounts of data. It might be able to fix a few bits of
errors, but not non-trivial amounts. I mean, that's a simple property
of error correction codes: the more you want to be able to correct the
longer must your checksum be. Neither btrfs' nor journald's are
substantial enough to correct even a sector...
Lennart
My impression is that btrfs can fix the corruption in cases where a
e.g. a RAID1 of btrfs is used.
FS_NOCOW does no effect btrfs raid settings. If you want this kind of
data redundancy then it will continue to be available even though we
set FS_NOCOW now.
Thank you for the quick response.

Do you mean that btrfs scrub will be able to detect which of the
copies is correct, if one of the copies of a file flagged with
FS_NOCOW gets changed due to disk corruption? My impression is that
FS_NOCOW would result in the redundant copies of file data not having
checksums that'd be correctly maintained. So btrfs scrub could
possibly detect that the copies differ, but it would not be able to
decide which one to discard.

AFAIK btrfs would normally able to do this, write a new copy of the
intact file data and discard the corrupt one.

-Joonas
Goffredo Baroncelli
2015-02-18 17:21:33 UTC
Permalink
Hi Lennart
Post by Lennart Poettering
FS_NOCOW does no effect btrfs raid settings. If you want this kind of
data redundancy then it will continue to be available even though we
set FS_NOCOW now.
Whitout checksum, BTRFS was unable to restore a good copy: in case of
RAID1 a flip of a bit makes the two copies different. Only the checksum
allows to detected which is the good copy.

This was already discussed in the thread (see the answers of Zygo and
Chris Murhpy other than the my one)

http://www.spinics.net/lists/linux-btrfs/msg41024.html
Post by Lennart Poettering
Lennart
Goffredo
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Lennart Poettering
2015-02-18 10:05:05 UTC
Permalink
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
* journald now sets the special FS_NOCOW file flag for its
journal files. This should improve performance on btrfs, by
avoiding heavy fragmentation when journald's write-pattern
is used on COW file systems. It degrades btrfs' data
integrity guarantees for the files to the same levels as for
ext3/ext4 however. This should be OK though as journald does
its own data integrity checks and all its objects are
checksummed on disk. Also, journald should handle btrfs disk
full events a lot more gracefully now, by processing SIGBUS
errors, and not relying on fallocate() anymore.
If I read correctly the code, the FS_NOCOW is a temporary workaround, i.e.
when the file is closed (or rotated ?) the FS_NOCOW flags is unset again.
It is true ?
Well, we try to unset it, but this is not allowed by btrfs. However,
given that it might be allowed one day, we do it anyway.

In effect this means FS_NOCOW is set for good once we turn it on.

Lennart
--
Lennart Poettering, Red Hat
Continue reading on narkive:
Loading...