Discussion:
Slow startup of systemd-journal on BTRFS
(too old to reply)
Goffredo Baroncelli
2014-06-11 21:28:54 UTC
Permalink
Hi all,

I would like to share a my experience about a slowness of systemd when used on BTRFS.

My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").

I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.

However this link

https://bugzilla.redhat.com/show_bug.cgi?id=1006386

suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent.
It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented.
This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system.

In fact after I de-fragmented the files under /var/log/journal [*], my boot time decreased of about 20second (from 50s to 30s).

Unfortunately I don't have any data to show. The next time I will try to collect more information. But I am quite sure that when the log are highly fragmented systemd-journal becomes very slow on BTRFS.

I don't know if the problem is more on the systemd side or btrfs side. What I know is that both the projects likely will be important in the near futures, and both must work well together.

I know that I can "chattr +C" to avoid COW for some files; but I don't want to lost also the checksum protection.

If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.


BR
G.Baroncelli

[*]
# btrfs fi defrag /var/log/journal/*/*
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Chris Murphy
2014-06-12 00:40:00 UTC
Permalink
Post by Goffredo Baroncelli
If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.
So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux.

Since the journals have their own checksumming I'm not overly concerned about setting xattr +C.

Chris Murphy
Chris Murphy
2014-06-12 02:32:52 UTC
Permalink
Post by Goffredo Baroncelli
Hi all,
I would like to share a my experience about a slowness of systemd when used on BTRFS.
My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
However this link
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent.
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
Cheers,
Dave.
--
Dave Chinner
BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.
Unless a subscriber finds something really interesting, quotes it, and cross posts it.

Chris murphy
Lennart Poettering
2014-06-15 22:34:21 UTC
Permalink
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
Not really the case.

http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354

We allocate 8mb at minimum.
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Well, it would be good if you'd tell me what to do instead...

I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
says this explicitly:

"...After a successful call to posix_fallocate(), subsequent writes
to bytes in the specified range are guaranteed not to fail because
of lack of disk space."

Happy to be informed that the man page is wrong.

I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.

Lennart
--
Lennart Poettering, Red Hat
Chris Murphy
2014-06-16 04:01:04 UTC
Permalink
Post by Lennart Poettering
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
Not really the case.
http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354
We allocate 8mb at minimum.
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Well, it would be good if you'd tell me what to do instead...
I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
"...After a successful call to posix_fallocate(), subsequent writes
to bytes in the specified range are guaranteed not to fail because
of lack of disk space."
Happy to be informed that the man page is wrong.
I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.
Lennart
For what it's worth, I did not write what is attributed to me above. I was quoting Dave Chinner, and I've confirmed the original attribution correctly made it onto the systemd-devel@ list.

I don't know whether some people on this distribution list are even subscribed to systemd-devel@ so those subsequent responses aren't likely being posted to systemd-devel@ but rather to linux-***@.


Chris Murphy
Tollef Fog Heen
2014-06-18 05:51:08 UTC
Permalink
]] Lennart Poettering

(trimming Ccs)
Post by Lennart Poettering
I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
OOI, why not use write + read-only mmap for reading? In testing
performance for another project, that's significantly faster than rw
mmap.
--
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are
Lennart Poettering
2014-06-20 13:08:39 UTC
Permalink
Post by Tollef Fog Heen
]] Lennart Poettering
(trimming Ccs)
Post by Lennart Poettering
I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
OOI, why not use write + read-only mmap for reading? In testing
performance for another project, that's significantly faster than rw
mmap.
Well, non-mmap stuff might indeed be quicker, but this isn't really
optimized to the last bit of speed but more to keep the code simple. And
the code is quite simple now, as we can use the same codepaths to locate
things for appending and for finding them...

Lennart
--
Lennart Poettering, Red Hat
Goffredo Baroncelli
2014-06-12 11:05:01 UTC
Permalink
----Messaggio originale----
Data: 12/06/2014 2.40
Ogg: Re: Slow startup of systemd-journal on BTRFS
Post by Goffredo Baroncelli
If someone is able to suggest me how FRAGMENT the log file, I can try to
collect more scientific data.
So long as you're not using compression, filefrag will show you fragments of
systemd-journald journals. I can vouch for the behavior
you experience without xattr +C or autodefrag, but further it also causes
much slowness when reading journal contents. LIke if I want to
search all boots for a particular error message to see how far back it
started, this takes quite a bit longer than on other file systems.
So far I'm not experiencing this problem with autodefrag or any other
negative side effects, but my understanding is this code is still in flux.
Since the journals have their own checksumming I'm not overly concerned about
setting xattr +C.

This is true; but it can be a general solution: the checksum of the data are
needed during a
scrub and/or a RAID rebuilding.

I want to investigate doing an explicit defrag once a week.
Chris Murphy
G.Baroncelli
Goffredo Baroncelli
2014-06-12 11:07:51 UTC
Permalink
----Messaggio originale----
Data: 12/06/2014 3.18
Ogg: Re: Slow startup of systemd-journal on BTRFS
Post by Goffredo Baroncelli
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between
systemd and btrfs. NetworkManager was innocent. It seems that
systemd-journal create a very hight fragmented files when it stores its
log. And BTRFS it is know to behave slowly when a file is highly
fragmented. This had caused a slow startup of systemd-journal, which in
turn had blocked the services which depend by the loggin system.
On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
"SystemMaxUse=50M". That doesn't solve the fragmentation problem but
reduces
it enough that it doesn't bother me.
IIRC my log files are about 80/100MB. So I am not sure if this could help.
I want to investigate also the option

MaxFileSec=1d

which rotates the log file once a day (or a week)
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-12 11:13:26 UTC
Permalink
Data: 12/06/2014 3.21
Ogg: Re: Slow startup of systemd-journal on BTRFS
Post by Goffredo Baroncelli
Hi all,
I would like to share a my experience about a slowness of systemd when used
on BTRFS.
Post by Goffredo Baroncelli
My boot time was very high (about ~50 seconds); most of time it was due to
NetworkManager which took about 30-40 seconds to start (this data came from
"systemd-analyze plot").
Post by Goffredo Baroncelli
I make several attempts to address this issue. Also I noticed that sometime
this problem disappeared; but I was never able to understand why.
Post by Goffredo Baroncelli
However this link
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between
systemd and btrfs. NetworkManager was innocent.
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
I checked the code, and to me it seems that the fallocate() are
done in FILE_SIZE_INCREASE unit (actually 8MB).
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
If I am able to start a decent setup I would like to play to change some
parameters like:
- remove fallocate at all (at the beginning only ?)
- increase the fallocate allocation unit
- change the file log size and rotation time
- periodically defragment
[...[
Cheers,
Dave.
--
Dave Chinner
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-13 22:19:31 UTC
Permalink
Hi Dave
systemd has a very stupid journal write pattern. It checks if there is
space in the file for the write, and if not it fallocates the small
amount of space it needs (it does *4 byte* fallocate calls!) and then
does the write to it. All this does is fragment the crap out of the log
files because the filesystems cannot optimise the allocation patterns.
I checked the code, and to me it seems that the fallocate() are done in
FILE_SIZE_INCREASE unit (actually 8MB).
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.
So maybe it's been fixed in systemd since the last time I looked.
http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone
was complaining about...
thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change
Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into its
own extent.
I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often.

I have to investigate more what happens when the log are copied from /run to /var/log/journal: this is when journald seems to slow all.

I am prepared a PC which reboot continuously; I am collecting the time required to finish the boot vs the fragmentation of the system.journal file vs the number of boot. The results are dramatic: after 20 reboot, the boot time increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot time to the original one, but after another 20 reboot, the boot time still requires 20-30 seconds more....

It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB).

For both PC the HD is a mechanical one...
And that's now a btrfs problem.... :/
Are you sure ?

***@venice:/var/log$ sudo filefrag messages
messages: 29 extents found

***@venice:/var/log$ sudo filefrag journal/*/system.journal
journal/41d686199835445395ac629d576dfcb9/system.journal: 1378 extents found

So the old rsyslog creates files with fewer fragments. BTRFS (but it seems also xfs) for sure highlights more this problem than other filesystem. But also systemd seems to create a lot of extens.

BR
G.Baroncelli
Cheers,
Dave.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Goffredo Baroncelli
2014-06-14 07:52:39 UTC
Permalink
Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
Post by Goffredo Baroncelli
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.
So maybe it's been fixed in systemd since the last time I looked.
http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
Post by Goffredo Baroncelli
The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone was
complaining about...
thanks for pointing that. However I am performing my tests on a fedora
20 with systemd-208, which seems have this change
Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into
its own extent.
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for some
logging. So it is not called very often.
But...
If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with
nodatacow), then an fallocate of 8 MiB will increase the file size by 8
MiB and write that out. So far so good as at that point the 8 MiB should
be a single extent. But then, data gets written into 4 KiB blocks of
that 8 MiB one at a time, and because btrfs is COW, the new data in the
block must be written to a new location.
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.
=:^(
Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the
btrfs autodefrag mount option should at least catch that and rewrite
(again), this time sequentially.
Post by Goffredo Baroncelli
I have to investigate more what happens when the log are copied from
/run to /var/log/journal: this is when journald seems to slow all.
That's an interesting point.
At least in theory, during normal operation journald will write to
/var/log/journal, but there's a point during boot at which it flushes the
information accumulated during boot from the volatile /run location to
the non-volatile /var/log location. /That/ write, at least, should be
sequential, since there will be > 4 KiB of journal accumulated that needs
to be transferred at once. However, if it's being handled by the forced
pre-write fallocate described above, then that's not going to be the
case, as it'll then be a rewrite of already fallocated file blocks and
thus will get COWed exactly as I described above.
=:^(
Post by Goffredo Baroncelli
I am prepared a PC which reboot continuously; I am collecting the time
required to finish the boot vs the fragmentation of the system.journal
file vs the number of boot. The results are dramatic: after 20 reboot,
the boot time increase of 20-30 seconds. Doing a defrag of
system.journal reduces the boot time to the original one, but after
another 20 reboot, the boot time still requires 20-30 seconds more....
It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB).
For both PC the HD is a mechanical one...
The problem's duplicable. That's the first step toward a fix. =:^)
I Hope so
Post by Goffredo Baroncelli
And that's now a btrfs problem.... :/
Are you sure ?
As they say, "Whoosh!"
[...]
Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also have
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default mount
options.
Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.
Meanwhile, there's some focus on making btrfs behave better with such
rewrite-pattern files, but while I think the problem can be made /some/
better, hopefully enough that the defaults bother far fewer people in far
fewer cases, I expect it'll always be a bit of a sore spot because that's
just how the technology works, and as such, setting NOCOW for such files
and/or using autodefrag will continue to be recommended for an optimized
setup.
---
[1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of
setting NOCOW (chattr +C) unless the attribute is set while the file is
still zero size, effectively, at file creation. The easiest way to do
that is to set NOCOW on the subdir that will contain the file, such that
when the file is created it inherits the NOCOW attribute automatically.
[2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB
should be fine, over 1 GiB is known to cause issues, between the two is a
gray area that depends on the speed of the hardware and the incoming
write-stream.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Duncan
2014-06-15 05:43:06 UTC
Permalink
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
Post by Goffredo Baroncelli
Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
Post by Goffredo Baroncelli
thanks for pointing that. However I am performing my tests on a fedora
20 with systemd-208, which seems have this change
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for some
logging. So it is not called very often.
Right.
Post by Goffredo Baroncelli
But...
Exactly, _but_...
Post by Goffredo Baroncelli
[A]n fallocate of 8 MiB will increase the file size
by 8 MiB and write that out. So far so good as at that point the 8 MiB
should be a single extent. But then, data gets written into 4 KiB
blocks of that 8 MiB one at a time, and because btrfs is COW, the new
data in the block must be written to a new location.
Which effectively means that by the time the 8 MiB is filled, each 4
KiB block has been rewritten to a new location and is now an extent
unto itself. So now that 8 MiB is composed of 2048 new extents, each
one a single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with
fallocate. But the first time is a "big" write of 8MB; instead the
second write would happen in any case. What I mean is that without the
fallocate in any case journald would make small write.
To be honest, I fatigue to see the gain of having a fallocate on a COW
filesystem... may be that I don't understand very well the fallocate()
call.
The base problem isn't fallocate per se, rather, tho it's the trigger in
this case. The base problem is that for COW-based filesystems, *ANY*
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to
be rewritten (as opposed to simply appending) in this case, is because of
the fallocate. The rewrite of existing file content is the problem, but
the existing file content is only there in this case because of the
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it
rewrites into the already allocated location, thus guaranteeing extents
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place. Thus, on a non-COW filesystem, pre-allocating in something larger
than single filesystem blocks when an app knows the data is eventually
going to be written in to fill that space anyway is a GOOD thing, which
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD
thing, because an fallocate forces the file to be written out at that
size, effectively filled with nulls/blanks. Then the actual logging
comes along and rewrites those nulls/blanks with actual data, but it's
now a rewrite, which on a COW, copy-on-write, based filesystem, the
rewritten block is copied elsewhere, it does NOT overwrite the existing
null/blank block, and "elsewhere" by definition means detached from the
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in
with actual data, because they were rewrites from null/blank blocks that
fallocate had already forced to be allocated, that's now 2048 separate
extents, 2048 separate file fragments, where without the forced fallocate,
the writes would have all been appends, and there would have been at
least /some/ chance of some of those 2048 separate blocks being written
at close enough to the same time that they would have been written
together as a single extent. So while the 8 MiB might not have been a
single extent as opposed to 2048 separate extents, it might have been
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being
because fallocate meant that each block was a rewrite into an existing
file, not a new append-write at the end of an existing file.
Post by Goffredo Baroncelli
[...]
Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also
have to set it as its own subvolume, to avoid the
first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default
mount options.
Pay attention, that this remove also the checksum, which are very useful
in a RAID configuration.
Well, it can be. But this is only log data, not executable or the like
data, and (as Kai K points out) journald has its own checksumming method
in any case.

Besides which, you still haven't explained why you can't either set the
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them automatically at
hourly or daily or whatever intervals. Those don't disable btrfs
checksumming, but /should/ solve the problem.

(Tho if you're btrfs snapshotting the journals defrag has its own
implications, but they should be relatively limited in scope compared to
the fragmentation issues we're dealing with here.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Lennart Poettering
2014-06-15 22:39:39 UTC
Permalink
Post by Duncan
The base problem isn't fallocate per se, rather, tho it's the trigger in
this case. The base problem is that for COW-based filesystems, *ANY*
rewriting of existing file content results in fragmentation.
It just so happens that the only reason there's existing file content to
be rewritten (as opposed to simply appending) in this case, is because of
the fallocate. The rewrite of existing file content is the problem, but
the existing file content is only there in this case because of the
fallocate.
Taking a step back...
On a non-COW filesystem, allocating 8 MiB ahead and writing into it
rewrites into the already allocated location, thus guaranteeing extents
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place. Thus, on a non-COW filesystem, pre-allocating in something larger
than single filesystem blocks when an app knows the data is eventually
going to be written in to fill that space anyway is a GOOD thing, which
is why systemd is doing it.
Nope, that's not why we do it. We do it to avoid SIGBUS on disk full...
Post by Duncan
But on a COW-based filesystem fallocate is the exact opposite, a BAD
thing, because an fallocate forces the file to be written out at that
size, effectively filled with nulls/blanks. Then the actual logging
comes along and rewrites those nulls/blanks with actual data, but it's
now a rewrite, which on a COW, copy-on-write, based filesystem, the
rewritten block is copied elsewhere, it does NOT overwrite the existing
null/blank block, and "elsewhere" by definition means detached from the
previous blocks, thus in an extent all by itself.
Well, quite frankly I am not entirely sure why fallocate() would be any
useful like that on COW file systems, if this is really how it is
implemented... I mean, as I understood fallocate() -- and as the man
page suggests -- it is something for reserving space on disk, not for
writing out anything. This is why journald is invoking it, to reserve
the space, so that later write accesses to it will not require any
reservation anymore, and hence are unlikely to fail.

Lennart
--
Lennart Poettering, Red Hat
Duncan
2014-06-17 08:05:55 UTC
Permalink
Post by Lennart Poettering
Well, quite frankly I am not entirely sure why fallocate()
I was barking up the wrong tree with fallocate(). Sorry.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Lennart Poettering
2014-06-15 22:13:07 UTC
Permalink
Post by Goffredo Baroncelli
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't
understand the reason.
BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.

I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...

Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...

Lennart
--
Lennart Poettering, Red Hat
Lennart Poettering
2014-06-16 10:14:49 UTC
Permalink
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...

Lennart
--
Lennart Poettering, Red Hat
Andrey Borzenkov
2014-06-16 11:56:25 UTC
Permalink
On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering
Post by Lennart Poettering
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.

The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly "write once" at the
expense of additional read/write activity.
Kai Krakow
2014-06-17 00:33:31 UTC
Permalink
Post by Andrey Borzenkov
Post by Lennart Poettering
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.
The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly "write once" at the
expense of additional read/write activity.
This is an interesting point because it means: Readahead may hurt systemd
boot performance a lot if it is hitting a heavily fragmented file like those
affected journal files.

So, readahead should probably ignore these files? It may look obvious but I
don't think so...

On my system, readahead will process system.journal and thus defragment it,
so it won't fragment too much and by that won't slow the boot down. This may
also explain why in Goffredo's tests the readahead process never got into
recording the system.journal - because the boot process was finished
(readahead-done) before any process started reading system.journal - but
still the high fragmentation of the other journals causes very high IO and
the boot process becomes slower and slower.

The root fix should be to make journald not causing high fragmentation. But
I think Andrey's point should be given some thoughts, too.
--
Replies to list only preferred.
Martin
2014-06-16 19:52:07 UTC
Permalink
Post by Lennart Poettering
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
I don't believe that fallocate() makes any difference to
fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?

Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?
Was waiting for you to show up before I said anything since most systemd
related emails always devolve into how evil you are rather than what is
actually happening.
Ouch! Hope you two know each other!! :-P :-)


[...]
since we shouldn't be fragmenting this badly.
Like I said what you guys are doing is fine, if btrfs falls on it's face
then its not your fault. I'd just like an exact idea of when you guys
are fsync'ing so I can replicate in a smaller way. Thanks,
Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).


Random thoughts:

Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?

Align the data writes to 16kByte or 64kByte boundaries/chunks?

Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?


Not looked over the code so all random guesses...

Regards,
Martin
Chris Murphy
2014-06-17 17:56:38 UTC
Permalink
It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense?


Chris Murphy
Kai Krakow
2014-06-17 18:34:14 UTC
Permalink
Post by Chris Murphy
It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
Normally I'd be willing to give up ~30 seconds of journal to not have
fragmented journals. But then if I use systemd.log_level=debug I'd like
that to trigger more frequent flushing to make sure as little of the
journaling is lost as possible. Does that make sense?
AFAIR systemd-journald already does explicit flushing when "more important"
log entries hit the log. Well, of course "debug" is one of the least
important log levels in that chain. But using the same code paths, it should
be trivial to implement a configurable trigger which also forces explicit
flushing...
--
Replies to list only preferred.
Filipe Brandenburger
2014-06-17 18:46:37 UTC
Permalink
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.

When there are small appends with fsync/msync, they become new
fragments (as expected), but once the writes go past a block boundary,
btrfs could defragment the previous block in background, since it's
not really expected to change again.

That could potentially achieve performance close to chattr +C without
the drawbacks of disabling copy-on-write.

Cheers,
Filipe
Kai Krakow
2014-06-17 21:02:14 UTC
Permalink
Post by Filipe Brandenburger
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.
When there are small appends with fsync/msync, they become new
fragments (as expected), but once the writes go past a block boundary,
btrfs could defragment the previous block in background, since it's
not really expected to change again.
That could potentially achieve performance close to chattr +C without
the drawbacks of disabling copy-on-write.
I thought about something like that, too. I'm pretty sure it really doesn't
matter if your 500G image file is split across 10000 extents - as long as at
least chunks of extents are kept together and rebuilt as one extent. That
means, instead of letting autodefrag work on the whole file just let it
operate on a chunk of it within some sane boundaries - maybe 8MB chunks, -
of course without splitting existing extents if those already cross a chunk
boundary. That way, it would still reduce head movements a lot while
maintaining good performance during defragmentation. Your idea would be the
missing companion to that (it is some sort of slow-growing-file-detection).

If I remember correctly, MacOSX implements a similar adaptic defragmentation
strategy for its HFS+ filesystem, tho the underlying semantics are probably
quite different. And it acts upon opening the file instead upon writing to
the file, so it is probably limited to smallish files only (which I don't
think makes so much sense on its own, for small files locality to
semantically similar files is much more important, i.e. files needed during
boot, files needed for starting a specific application).

If, after those strategies, it is still important to get your file's chunks
cleanly aligned one after each other, one could still run a manual defrag
which does the complete story.

BTW: Is it possible to physically relocate files in btrfs? I think that is
more important than defragmentation. Is such a thing possible with the
defrag IOCTL? My current understanding is that defragmentating a file just
rewrites it somewhere random as a contiguous block - which is not always the
best option because it can hurt boot performance a lot and thus reverses the
effect of what is being tried to achieve. It feels a bit like playing
lottery when I defragment my boot files only to learn that the boot process
now is slower instead of faster. :-\
--
Replies to list only preferred.
Duncan
2014-06-18 02:03:23 UTC
Permalink
Post by Kai Krakow
I'm pretty sure it really
doesn't matter if your 500G image file is split across 10000 extents -
as long as at least chunks of extents are kept together and rebuilt as
one extent.
It's worth noting that in btrfs terms, "chunk" has a specific meaning.
Btrfs allocates "chunks" from free space, 1 GiB at a time for data
chunks, 256 MiB at a time for metadata.[1] And that btrfs-specific
meaning does tend to distort your example case, somewhat, tho I expect
most folks understand what you mean. The problem is that we're running
out of generic terms that mean "chunk-like" and are thus experiencing
namespace collision!

Anyway, btrfs chunks, data chunks being of interest here, get allocated
on demand. The practical effect is thus that the maximum btrfs extent
size is 1 GiB. As such, the least-extents-possible ideal for that 500
GiB image file would be 500 extents.
Post by Kai Krakow
That means, instead of letting autodefrag work on the whole
file just let it operate on a chunk of it within some sane boundaries -
maybe 8MB chunks, - of course without splitting existing extents if
those already cross a chunk boundary.
As stated, "chunk" has a special meaning in btrfs. Data chunks are 1 GiB
in size and no extent can cross a btrfs chunk boundary.
Post by Kai Krakow
BTW: Is it possible to physically relocate files in btrfs?
Currently, ENOTIMPLEMENTED. AFAIK it's possible and is on the list, but
so are a number of other "nice-to-haves", so it might be awhile.
Actually just checked the wiki. The closest specific feature point
listed says "hot-data-tracking and moving to faster devices", noting that
there's actually some current work on the generic (not-btrfs-specific)
feature, to be made available via VFS. Your specific use-case will
probably be part of that general implementation.

---
[1] Btrfs chunk allocation: 1 GiB data chunks size, 256 MiB metadata
chunk size by default, but there are various exceptions. The first of
course is when space gets tight. The last allocation will be whatever is
left. Second, there's mixed-data/metadata mode, the default when the
filesystem is <= 1 GiB. Mixed chunks are (like metadata) 256 MiB but
contain data and metadata mixed, sacrificing speed efficiency for more
efficient space management. Third, it's worth noting that the single-
device-btrfs default is dup-mode-metadata (with the same default for
mixed-mode), so while they're 256 MiB each, two get allocated at once,
thus allocating metadata half a GiB at a time. Multi-device-btrfs
metadata defaults to raid1-mode, still allocated in chunk pairs but one
on each of two separate devices. Data mode always defaults to single,
tho on multi-device-btrfs both data and metadata can be (individually)
set to various raid modes, each with its own specific allocation pattern.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Kai Krakow
2014-06-18 05:21:00 UTC
Permalink
Post by Duncan
Post by Kai Krakow
I'm pretty sure it really
doesn't matter if your 500G image file is split across 10000 extents -
as long as at least chunks of extents are kept together and rebuilt as
one extent.
It's worth noting that in btrfs terms, "chunk" has a specific meaning.
Btrfs allocates "chunks" from free space, 1 GiB at a time for data
chunks, 256 MiB at a time for metadata.[1] And that btrfs-specific
meaning does tend to distort your example case, somewhat, tho I expect
most folks understand what you mean. The problem is that we're running
out of generic terms that mean "chunk-like" and are thus experiencing
namespace collision!
Well, should I call that dingbat a hunk which are lumps of chunks? ;-)

Of course I meant "chunk" being a contiguous group of "blocks" in a file,
not a "chunk" in sense of btrfs specific terms of being an "allocation
group". I think it's clear we are out of words as you point out - so I hope
everyone got that right because I tried to "redefine" it for my purpose by
writing "chunks of extents"...
Post by Duncan
[...]
--
Replies to list only preferred.
Lennart Poettering
2014-06-17 21:12:44 UTC
Permalink
So you are doing all the right things from what I can tell, I'm just
a little confused about when you guys run fsync.
Kai Krakow
2014-06-17 21:45:38 UTC
Permalink
[...] Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
"journalctl -p crit").
[...]
Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.
Unless you use those HP utilities:

# journalctl -p crit
-- Logs begin at Mo 2014-05-05 02:26:26 CEST, end at Di 2014-06-17 23:33:48
CEST. --
Mai 20 10:17:21 jupiter hp-systray[1366]: hp-systray[1366]: error: option -s
not recognized
-- Reboot --
Mai 22 22:37:15 jupiter hp-systray[1413]: hp-systray[1413]: error: option -s
not recognized
-- Reboot --
Jun 03 08:49:17 jupiter hp-systray[1501]: hp-systray[1501]: error: option -s
not recognized
-- Reboot --
Jun 09 13:45:31 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s
not recognized
-- Reboot --
Jun 11 00:21:16 jupiter hp-systray[1405]: hp-systray[1405]: error: option -s
not recognized
-- Reboot --
Jun 14 00:30:49 jupiter hp-systray[1434]: hp-systray[1434]: error: option -s
not recognized
-- Reboot --
Jun 14 22:54:43 jupiter hp-systray[1416]: hp-systray[1416]: error: option -s
not recognized
-- Reboot --
Jun 14 23:11:11 jupiter hp-systray[1437]: hp-systray[1437]: error: option -s
not recognized
-- Reboot --
Jun 15 14:53:44 jupiter hp-systray[1447]: hp-systray[1447]: error: option -s
not recognized
-- Reboot --
Jun 15 16:53:39 jupiter hp-systray[1404]: hp-systray[1404]: error: option -s
not recognized
-- Reboot --
Jun 15 17:38:40 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s
not recognized
-- Reboot --
Jun 15 18:34:37 jupiter hp-systray[1375]: hp-systray[1375]: error: option -s
not recognized
-- Reboot --
Jun 17 20:38:32 jupiter hp-systray[1406]: hp-systray[1406]: error: option -s
not recognized
Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?
Nasty message but no slowdowns. But it's printed just during desktop session
start so it's no problem. Still, I wonder what this utility tries to do
(seems to put itself into autostart with an "-s" switch but without
supporting it, probably a packaging and/or KDE issue). At least HP thinks:
Oh boy - that's really critical! Whoohoo. Why not let the system explode?
:~)
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-16 16:38:55 UTC
Permalink
Hi Lennart,
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.

In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.


[1] Let me to revise the english, the I will post the results.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Goffredo Baroncelli
2014-06-16 18:47:49 UTC
Permalink
Hi all,

in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary.

I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70). I tested the following scenarios

1) standard (without defragmenting any file, without readahead)
2) defragment the journal file at the end of the boot
3) defragment the journal file before the flushing
4) mark as NOCOW the journald log file
5) enable the systemd-readahead
6) remove the fsync(2) call from journald
7) remove the posix_fallocate(3) call from journald
8) do a defrag when posix_fallocate(3) is called
9) do a defrag when the journal log file is opened

The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time.

The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments.

The test #4 suggested that marking NOCOW the system.journal file reduces its fragmentation and so the boot time.

The test #2,#3,#9 suggested that performing a periodic defrag reduces significantly the fragmentation of system.journal and so the boot time.

The test #5 revealed that the readahead capability of systemd was not efficacy because it seems that the system.journal file was unaffected (but other *.journal files were). Further investigation is required.

BR
G.Baroncelli



[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.
In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.
[1] Let me to revise the english, the I will post the results.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Martin
2014-06-16 22:35:16 UTC
Permalink
Post by Goffredo Baroncelli
Hi all,
in this blog [1] I collected all the results of the tests which I
performed in order to investigate a bit this performance problem
between systemd and btrfs. I had to put these results in a blog,
because there are several images. Below a brief summary.
The test #1 highlight the problem. It shows that the boot time may
require up to 50 seconds. During the reboots the number of extents of
the file system.journal increases up to 6000. De-fragmenting the
system.journal file the boot time decreases by ~20 seconds. My
conclusion is that in BTRFS the fragmentation of this file increases
the boot time.
The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3)
calls aren't the root cause of the problem. Even without these the
system.journal file still fragments.
[1]
http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
Very good demonstration and summary, thanks.

The charts very clearly show the correspondence between time-to-boot and
the level of fragmentation.

... And I thought Linux was getting ever faster to boot!


Regards,
Martin
Duncan
2014-06-17 07:52:07 UTC
Permalink
Goffredo Baroncelli posted on Mon, 16 Jun 2014 20:47:49 +0200 as
Post by Goffredo Baroncelli
The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3)
calls aren't the root cause of the problem. Even without these the
system.journal file still fragments.
[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
Thanks for injecting some concrete facts in the form of test results into
things. =:^)

So I was barking up the wrong tree with fallocate.

But I had the base problem correct -- fragmentation -- and good
recommendations for working around it -- NOCOW and/or (auto)defrag.
(And for that matter, for those like me where /var/log is a dedicated
partition, simply putting /var/log on something other than btrfs would
nicely work around the issue too.

The experts are working on it now, so I'll step out for the most part as
what else I could add at this point would be simply noise. But I did
consider it important to acknowledge that I had been wrong about the
fallocate.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Kai Krakow
2014-06-14 10:59:31 UTC
Permalink
As they say, "Whoosh!"
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems because
it's perfect. Therefore, any problems you have with systemd must instead
be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
IOW, it's a btrfs problem now in practice, not because it is so in a
technical sense, but because systemd defines it as such and is unlikely
to budge, so the only way to achieve progress is for btrfs to deal with
it.
I think that systemd is even one of the early supporters of btrfs because it
will defragment readahead files on boot from btrfs. I'd suggest the problem
is to be found in the different semantics with COW filesystems. And if
someone loudly complains to the systemd developers how bad they are at doing
their stuff - hmm, well, I would be disapointed/offended, too, as a
programmer because much very well done work has been put into systemd and
I'd start ignoring such people. In Germany we have a saying for this: "Wie
man in den Wald hineinruft, so schallt es heraus." [1] They are doing many
things right that have not been adopted to modern systems in the last twenty
years (or so) with the legacy init systems.

So let's start with my journals, on btrfs:

$ sudo filefrag *
***@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
***@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal:
5 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal:
5 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal:
5 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal:
5 extents found
***@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal:
5 extents found
***@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal:
5 extents found
system.journal: 9 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal:
2 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal:
2 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal:
2 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal:
2 extents found
user-500.journal: 2 extents found

I don't think these are too bad values, eh?

Well, how did I accomblish that?

First, I've set the journal directories nocow. Of course, systemd should do
this by default. I'm not sure if this is a packaging or systemd code issue,
tho. But I think the systemd devs are in common that for cow fs, the journal
directories should be set nocow. After all, the journal is a transactional
database - it does not need cow protection at all costs. And I think they
have their own checksumming protection. So, why let systemd bother with
that? A lot of other software has the same semantic problems with btrfs, too
(ex. MySQL) where nobody shouts at the "inabilities" of the programmers. So
why for systemd? Just because it's intrusive by its nature for being a
radically and newly designed init system and thus requires some learning by
its users/admins/packagers? Really? Come on... As admin and/or packager you
have to stay with current technologies and developments anyways. It's only
important to hide the details from the users.

Back to the extents counts: What I did next was implementing a defrag job
that regularly defrags the journal (actually, the complete log directory as
other log files suffer the same problem):

$ cat /usr/local/sbin/defrag-logs.rb
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log

It can be easily converted into a timer job with systemd. This is left as an
excercise to the reader.

BTW: Actually, that job isn't currently executed on my system which makes
the numbers above pretty impressive... However, autodefrag is turned on
which may play into the mix. I'm not sure. I stopped automatically running
those defrag jobs a while ago (I have a few more).
An arguably fairer and more impartial assessment of this particular
situations suggests that neither btrfs, which as a COW-based filesystem,
like all COW-based filesystems has the existing-file-rewrite as a major
technical challenge that it must deal with /somehow/, nor systemd, which
in choosing to use fallocate is specifically putting itself in that
existing-file-rewrite class, are entirely at fault.
This challenge is not only affecting systemd but also a lot of other
packages which do not play nice with btrfs semantics. But - as you correctly
write - you cannot point your finger at just one party. FS and user space
have to come together to evaluate and fix the problems on both sides. In
But that doesn't matter if one side refuses to budge, because then the
other side must do so regardless of where the fault was, if there is to
be any progress at all.
Meanwhile, I've predicted before and do so here again, that as btrfs
moves toward mainstream and starts supplanting ext* as the assumed Linux
default filesystem, some of these problems will simply "go away", because
at that point, various apps are no longer optimized for the assumed
default filesystem, and they'll either be patched at some level (distro
level if not upstream) to work better on the new default filesystem, or
will be replaced by something that does. And neither upstream nor distro
level does that patching, then at some point, people are going to find
that said distro performs worse than other distros that do that patching.
Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also have
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default mount
options.
Meanwhile, there's some focus on making btrfs behave better with such
rewrite-pattern files, but while I think the problem can be made /some/
better, hopefully enough that the defaults bother far fewer people in far
fewer cases, I expect it'll always be a bit of a sore spot because that's
just how the technology works, and as such, setting NOCOW for such files
and/or using autodefrag will continue to be recommended for an optimized
setup.
Goffredo Baroncelli
2014-06-14 13:30:49 UTC
Permalink
On 06/14/2014 12:59 PM, Kai Krakow wrote:
[...]
Post by Kai Krakow
I think that systemd is even one of the early supporters of btrfs because it
will defragment readahead files on boot from btrfs.
In know that systemd does readahead, but it is the first time that I heard that it does defrag too. Could you elaborate ?
Post by Kai Krakow
I'd suggest the problem
is to be found in the different semantics with COW filesystems.
[--cut--cut--cut--]

The problem is due to the interaction between BTRFS and systemd. Both behave differently regarding their predecessor.

[--cut--cut--cut--]
Post by Kai Krakow
$ sudo filefrag *
5 extents found
5 extents found
5 extents found
5 extents found
5 extents found
5 extents found
system.journal: 9 extents found
2 extents found
2 extents found
2 extents found
2 extents found
user-500.journal: 2 extents found
I don't think these are too bad values, eh?
Well, how did I accomblish that?
First, I've set the journal directories nocow.
If you use nocow, you lost the btrfs ability to rebuild a RAID array discarding the wrong sector. With the systemd journal checksum, you can say that a data is wrong, but BTRFS with its checksum (when used in raid mode) is able to *correct* the data too.

[...]
Post by Kai Krakow
Back to the extents counts: What I did next was implementing a defrag job
that regularly defrags the journal (actually, the complete log directory as
$ cat /usr/local/sbin/defrag-logs.rb
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log
I think that this is a more viable solution; but what seems to me strange is the fact that the fragmentation level of my rsyslog file is a lot lower.

[...]
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Kai Krakow
2014-06-14 14:03:10 UTC
Permalink
Post by Goffredo Baroncelli
[...]
Post by Kai Krakow
I think that systemd is even one of the early supporters of btrfs because
it will defragment readahead files on boot from btrfs.
In know that systemd does readahead, but it is the first time that I heard
that it does defrag too. Could you elaborate ?
Look at src/readahead/readahead-collect.c. It works for btrfs and ext4. With
ext4 it will also relocate the files. Not sure if it does for btrfs. If it
does, the question is: where to relocate in a multi devices file system?

That means, you have to enable systemd-readahead-collect, tho.
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-14 14:45:50 UTC
Permalink
Post by Kai Krakow
Post by Goffredo Baroncelli
[...]
Post by Kai Krakow
I think that systemd is even one of the early supporters of btrfs because
it will defragment readahead files on boot from btrfs.
In know that systemd does readahead, but it is the first time that I heard
that it does defrag too. Could you elaborate ?
Look at src/readahead/readahead-collect.c. It works for btrfs and ext4. With
ext4 it will also relocate the files. Not sure if it does for btrfs.
I am looking at the source, and yes, it does. To be honest it seems to defragment only on btrfs.
Post by Kai Krakow
If it
does, the question is: where to relocate in a multi devices file system?
Systemd uses the defrag capability of btrfs.
Post by Kai Krakow
That means, you have to enable systemd-readahead-collect, tho.
I have to admit that I disabled it. I will make some test also with readhead enabled.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Kai Krakow
2014-06-14 20:13:15 UTC
Permalink
Post by Goffredo Baroncelli
I am looking at the source, and yes, it does. To be honest it seems to
defragment only on btrfs.
Oh that seems true... Then defragmenting and relocating is probably a todo
item. I'm sure I've read about it somewhere.
Post by Goffredo Baroncelli
Post by Kai Krakow
If it
does, the question is: where to relocate in a multi devices file system?
Systemd uses the defrag capability of btrfs.
man:systemd-readahead says it relocates and defrags if supported. Scanning
through the source, it only defrags.
Post by Goffredo Baroncelli
Post by Kai Krakow
That means, you have to enable systemd-readahead-collect, tho.
I have to admit that I disabled it. I will make some test also with readhead enabled.
Take care to enable all needed services to enable defrag... If your services
make use of journal file loading these files should also become part of the
process. You can check with "/usr/lib/systemd/systemd-readahead analyze".
The state is stored in /.readahead.
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-14 21:04:24 UTC
Permalink
Post by Kai Krakow
Post by Goffredo Baroncelli
I am looking at the source, and yes, it does. To be honest it seems to
defragment only on btrfs.
Oh that seems true... Then defragmenting and relocating is probably a todo
item. I'm sure I've read about it somewhere.
Post by Goffredo Baroncelli
Post by Kai Krakow
If it
does, the question is: where to relocate in a multi devices file system?
Systemd uses the defrag capability of btrfs.
man:systemd-readahead says it relocates and defrags if supported. Scanning
through the source, it only defrags.
Post by Goffredo Baroncelli
Post by Kai Krakow
That means, you have to enable systemd-readahead-collect, tho.
I have to admit that I disabled it. I will make some test also with readhead enabled.
Take care to enable all needed services to enable defrag... If your services
make use of journal file loading these files should also become part of the
process. You can check with "/usr/lib/systemd/systemd-readahead analyze".
The state is stored in /.readahead.
I have enabled all the services (collect, replay, done), but I was unable to see any gain.

I don't know why but system.journal is not considered by readahead:

# /usr/lib/systemd/systemd-readahead analyze | grep journal


100% ( 1) 770: /etc/systemd/journald.conf
50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
50% ( 1) 4194304: /var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Kai Krakow
2014-06-14 21:37:04 UTC
Permalink
Post by Goffredo Baroncelli
Post by Kai Krakow
Take care to enable all needed services to enable defrag... If your
services make use of journal file loading these files should also become
part of the process. You can check with
"/usr/lib/systemd/systemd-readahead analyze". The state is stored in
/.readahead.
I have enabled all the services (collect, replay, done), but I was unable to see any gain.
I don't think you can actually enable the "done" service because it is
static... But if it placed a symlink in /etc/systemd that is probably wrong
and you should undo it. It is automatically activated a few seconds after
booting is done to store the collected data for replaying on next reboot.

Check with "find /etc/systemd -name '*readahead*' -type l"...
Post by Goffredo Baroncelli
# /usr/lib/systemd/systemd-readahead analyze | grep journal
100% ( 1) 770: /etc/systemd/journald.conf
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
I tried that on my system and it looks vice-versa somehow:

$ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1) 812: /etc/systemd/journald.conf
100% ( 1) 8388608:
/var/log/journal/121b87ca633e8ac0016656680000001b/system.journal

Strange... ;-)

To get back to your performance problem: Did you try "systemctl enable
bootchart" and then (after a reboot) look at the png generated in /run/log
(at least in my system it is placed there, look at the configs if in doubt).
It can give some interesting clues about why a service starts late or takes
long to execute.
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-15 09:40:42 UTC
Permalink
Post by Kai Krakow
Post by Goffredo Baroncelli
Post by Kai Krakow
Take care to enable all needed services to enable defrag... If your
services make use of journal file loading these files should also become
part of the process. You can check with
"/usr/lib/systemd/systemd-readahead analyze". The state is stored in
/.readahead.
I have enabled all the services (collect, replay, done), but I was unable
to see any gain.
I don't think you can actually enable the "done" service because it is
static... But if it placed a symlink in /etc/systemd that is probably wrong
and you should undo it. It is automatically activated a few seconds after
booting is done to store the collected data for replaying on next reboot.
I checked the output of systemd-analyze: systemd-readahead-done is called at the end of the chart. The sym-links exist only for the collect and replay services, which were called at the beginning.
Post by Kai Krakow
Check with "find /etc/systemd -name '*readahead*' -type l"...
Post by Goffredo Baroncelli
# /usr/lib/systemd/systemd-readahead analyze | grep journal
100% ( 1) 770: /etc/systemd/journald.conf
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
$ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1) 812: /etc/systemd/journald.conf
/var/log/journal/121b87ca633e8ac0016656680000001b/system.journal
Strange... ;-)
May be this is a reason of my degraded performance. However I tried on two machines (all fedora) and after several reboot; I tried also removing the /.readahead file...
Nothing.


On another fedora machine, I found

$ sudo /lib/systemd/systemd-readahead analyze | grep journal
100% ( 1) 220384: /usr/lib/systemd/systemd-journald
100% ( 1) 770: /etc/systemd/journald.conf
100% ( 1) 24544: /usr/lib64/rsyslog/imjournal.so
100% ( 1) 122880: /usr/lib64/libsystemd-journal.so.0.11.3
50% ( 1) 4194304: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-***@0004f1b1019bc549-09d9ae048ff8eeef.journal~
100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-***@0004f1b7dcfd728c-38950862a134194d.journal~
50% ( 1) 4194304: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-***@0004f1c3fa738010-b0829bd837e30f61.journal~
50% ( 1) 4194304: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-995.journal
100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/***@0004f6ed6604ebc7-bb2f3b30ac6eb32c.journal~
100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-***@0004f6ed66a05b08-9d244b02cc294baf.journal~
77% ( 1) 6508544: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/***@0004f6ed6d02e69f-d9c815aa06264d7a.journal~
100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-***@0004f6ed6db6dd7d-cc84da57d5b01680.journal~
100% ( 1) 8388608: /var/log/journal/189323cd4cc348159b9fd5b32b566b05/user-1000.journal
100% ( 1) 121: /var/lib/rsyslog/imjournal.state


So some journal files are present, but not the current one (system.journal)
Post by Kai Krakow
To get back to your performance problem: Did you try "systemctl enable
bootchart" and then (after a reboot) look at the png generated in /run/log
(at least in my system it is placed there, look at the configs if in doubt).
It can give some interesting clues about why a service starts late or takes
long to execute.
Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS is not enabled in the kernel.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Kai Krakow
2014-06-15 10:50:55 UTC
Permalink
Post by Goffredo Baroncelli
Post by Kai Krakow
Post by Goffredo Baroncelli
I have enabled all the services (collect, replay, done), but I was
unable to see any gain.
I don't think you can actually enable the "done" service because it is
static... But if it placed a symlink in /etc/systemd that is probably
wrong and you should undo it. It is automatically activated a few seconds
after booting is done to store the collected data for replaying on next
reboot.
I checked the output of systemd-analyze: systemd-readahead-done is called
at the end of the chart. The sym-links exist only for the collect and
replay services, which were called at the beginning.
Then all is good.
Post by Goffredo Baroncelli
Post by Kai Krakow
Post by Goffredo Baroncelli
# /usr/lib/systemd/systemd-readahead analyze | grep journal
100% ( 1) 770: /etc/systemd/journald.conf
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
$ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1) 812: /etc/systemd/journald.conf
/var/log/journal/121b87ca633e8ac0016656680000001b/system.journal
Strange... ;-)
May be this is a reason of my degraded performance. However I tried on two
machines (all fedora) and after several reboot; I tried also removing the
/.readahead file... Nothing.
I'm not sure but generally it is said that readahead utilities may need a
few reboots before fully working. Did you try that? Will it take the journal
into account then?

Maybe there is something special about your setup and your system.journal
does not become touched during boot because of this? You said you also use
rsyslogd. Maybe logs are redirected there instead of written to both logging
mechanisms? Maybe most logging by system daemons is turned off or they write
directly to log files?

I tuned all my daemons and services to not write to any log files at all,
instead log everything to stderr/stdout (which is captured by systemd) or
into the syslogger which is journald as the only mechanism in my system. I
just prefer to have everything in one place, and journalctl is a great way
to filter the logs or create custom views. Maybe that's a reason why it is
taken into account by readahead on my system? I can only guess but maybe
worth a try for you...
Post by Goffredo Baroncelli
Post by Kai Krakow
To get back to your performance problem: Did you try "systemctl enable
bootchart" and then (after a reboot) look at the png generated in
/run/log (at least in my system it is placed there, look at the configs
if in doubt). It can give some interesting clues about why a service
starts late or takes long to execute.
Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS
is not enabled in the kernel.
Well, I'm on Gentoo so my kernel is custom built by design. But I checked my
config, I even have no config option "CONFIG_SCHEDSTATS" but bootchart works
for me:

$ zfgrep SCHED /proc/config.gz
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
CONFIG_IOSCHED_BFQ=y
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
# CONFIG_NET_SCHED is not set
CONFIG_USB_EHCI_TT_NEWSCHED=y

BTW: After migrating to systemd, I had a boot time of 15s to DM on
rotational with serveral heavy services enabled. I did not measure time to
desktop back then. But after I did a btrfs balance (I use multidevice
btrfs), the time to DM went up to around 25s and never went back no matter
what I tried (even when I disabled some services). I switched to autologin
DM since then so at least systemd-readahead catches some of the KDE
services, too, now. But it still feels slow. I'm probably redoing my btrfs
in the future, switching to 16k leaf size in that process. But I'll wait
with that until I migrate to bigger hard disks. Currently I have still
plenty of space left. According to bootchart I never had a problem with time
spent processing the journal. But since a few kernel versions, auto-suspend
works great for me, so I usually do not switch the system off, just leave it
running and it suspends to RAM after 2 hours. As a nice side effect, it
wakes up instantly (especially since kernel 3.15), and I save a lot of
power. Still waiting for the systemd auto-wakeup facility to run timer jobs
like a backup. ;-)
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-15 11:12:24 UTC
Permalink
Post by Kai Krakow
Post by Goffredo Baroncelli
Post by Kai Krakow
Post by Goffredo Baroncelli
I have enabled all the services (collect, replay, done), but I was
unable to see any gain.
I don't think you can actually enable the "done" service because it is
static... But if it placed a symlink in /etc/systemd that is probably
wrong and you should undo it. It is automatically activated a few seconds
after booting is done to store the collected data for replaying on next
reboot.
I checked the output of systemd-analyze: systemd-readahead-done is called
at the end of the chart. The sym-links exist only for the collect and
replay services, which were called at the beginning.
Then all is good.
Post by Goffredo Baroncelli
Post by Kai Krakow
Post by Goffredo Baroncelli
# /usr/lib/systemd/systemd-readahead analyze | grep journal
100% ( 1) 770: /etc/systemd/journald.conf
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-997.journal
/var/log/journal/36f10f5379ec4a1398ac303a0ce20fd0/user-1000.journal
$ sudo /usr/lib/systemd/systemd-readahead analyze|fgrep journal
100% ( 1) 812: /etc/systemd/journald.conf
/var/log/journal/121b87ca633e8ac0016656680000001b/system.journal
Strange... ;-)
May be this is a reason of my degraded performance. However I tried on two
machines (all fedora) and after several reboot; I tried also removing the
/.readahead file... Nothing.
I'm not sure but generally it is said that readahead utilities may need a
few reboots before fully working. Did you try that? Will it take the journal
into account then?
I hope that today I will publish my results. Anyway I performed about 50-70 reboot on each configuration.
Post by Kai Krakow
Maybe there is something special about your setup and your system.journal
does not become touched during boot because of this? You said you also use
rsyslogd. Maybe logs are redirected there instead of written to both logging
mechanisms? Maybe most logging by system daemons is turned off or they write
directly to log files?
I tuned all my daemons and services to not write to any log files at all,
instead log everything to stderr/stdout (which is captured by systemd) or
into the syslogger which is journald as the only mechanism in my system. I
just prefer to have everything in one place, and journalctl is a great way
to filter the logs or create custom views. Maybe that's a reason why it is
taken into account by readahead on my system? I can only guess but maybe
worth a try for you...
I don't have any idea... definitely I will investigate a bit.
Post by Kai Krakow
Post by Goffredo Baroncelli
Post by Kai Krakow
To get back to your performance problem: Did you try "systemctl enable
bootchart" and then (after a reboot) look at the png generated in
/run/log (at least in my system it is placed there, look at the configs
if in doubt). It can give some interesting clues about why a service
starts late or takes long to execute.
Unfortunately bootchart doesn't work on fedora, because CONFIG_SCHEDSTATS
is not enabled in the kernel.
Well, I'm on Gentoo so my kernel is custom built by design. But I checked my
config, I even have no config option "CONFIG_SCHEDSTATS" but bootchart works
I am talking about systemd-bootchart. I have to look if Fedora has the classic bootchart.
Post by Kai Krakow
$ zfgrep SCHED /proc/config.gz
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
CONFIG_IOSCHED_BFQ=y
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
# CONFIG_NET_SCHED is not set
CONFIG_USB_EHCI_TT_NEWSCHED=y
BTW: After migrating to systemd, I had a boot time of 15s to DM on
rotational with serveral heavy services enabled. I did not measure time to
desktop back then. But after I did a btrfs balance (I use multidevice
btrfs), the time to DM went up to around 25s and never went back no matter
what I tried (even when I disabled some services). I switched to autologin
DM since then so at least systemd-readahead catches some of the KDE
services, too, now. But it still feels slow. I'm probably redoing my btrfs
in the future, switching to 16k leaf size in that process. But I'll wait
with that until I migrate to bigger hard disks. Currently I have still
plenty of space left. According to bootchart I never had a problem with time
spent processing the journal. But since a few kernel versions, auto-suspend
works great for me, so I usually do not switch the system off, just leave it
running and it suspends to RAM after 2 hours. As a nice side effect, it
wakes up instantly (especially since kernel 3.15), and I save a lot of
power. Still waiting for the systemd auto-wakeup facility to run timer jobs
like a backup. ;-)
If readahead is able to defrag too, and if it handles also the journal file, your data seems reasonable.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Kai Krakow
2014-06-15 11:57:20 UTC
Permalink
Post by Goffredo Baroncelli
Post by Kai Krakow
Post by Goffredo Baroncelli
Post by Kai Krakow
To get back to your performance problem: Did you try "systemctl enable
bootchart" and then (after a reboot) look at the png generated in
/run/log (at least in my system it is placed there, look at the configs
if in doubt). It can give some interesting clues about why a service
starts late or takes long to execute.
Unfortunately bootchart doesn't work on fedora, because
CONFIG_SCHEDSTATS is not enabled in the kernel.
Well, I'm on Gentoo so my kernel is custom built by design. But I checked
my config, I even have no config option "CONFIG_SCHEDSTATS" but bootchart
I am talking about systemd-bootchart. I have to look if Fedora has the classic bootchart.
I simply did "systemctl enable bootchart.service" which is probably part of
bootchart2. It worked for me. I will look into systemd-bootchart, did never
realize it existed as a standalone solution... Thanks for the pointer.
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-17 17:25:14 UTC
Permalink
Hi Kai,


I investigate a bit why readahead doesn't defrag the journal.
I put in CC also the mailing list for "the record".

On 06/17/2014 03:33 AM, Kai Krakow wrote:
[...]
Instead, for me, the readahead collector catches access to my system
journal and thus defragments it. That's not the case for your system
which explains why it won't be defragmented for you when enabling
readahead.
The readahead program skips file greater than 10M. If you look at its man page, it documents the switch --file-size-max. The default value is READAHEAD_FILE_SIZE_MAX which is equal to 10M.
So only when the system.journal is smaller than this value, the readahead defrags it.
But this happens only few times.

This behavior seems to me reasonable to avoid blocking a system when a big file is touched during the boot.

BR
G.Baroncelli
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Kai Krakow
2014-06-17 18:26:05 UTC
Permalink
Post by Goffredo Baroncelli
I investigate a bit why readahead doesn't defrag the journal.
I put in CC also the mailing list for "the record".
[...]
Instead, for me, the readahead collector catches access to my system
journal and thus defragments it. That's not the case for your system
which explains why it won't be defragmented for you when enabling
readahead.
The readahead program skips file greater than 10M. If you look at its man
page, it documents the switch --file-size-max. The default value is
READAHEAD_FILE_SIZE_MAX which is equal to 10M. So only when the
system.journal is smaller than this value, the readahead defrags it. But
this happens only few times.
This behavior seems to me reasonable to avoid blocking a system when a big
file is touched during the boot.
Ah okay, good catch... Well, systemd does not generate massive amounts of
logs for me. Maybe you want to play around with "SystemMaxUse" in
journald.conf to let the file size stay below 10M?

Actually, for servers using syslog, I set the log file rotation to kick in
after 10M worth of logging (or after 7 days, whatever comes first) so our
rsync-diffed backups stay small for log files as otherwise it would start
creating huge diffs just for backups when I do the "rsync --backup" way of
doing backups.

So that may actually be a not too bad option. I'm not using it with my home
system because I do snapshot-based backups to btrfs and diffs are just much
smaller (using rsync --no-whole-file --inplace).

We recently switched to zfs based backup storage for our servers, and in
that process switched to using "rsync --no-whole-file --inplace", too. So we
won't need the small log rotation scheme any longer but I feel comfortable
with it so I stick to this.

BTW: I checked our systemd-enabled servers and fragmentation of journal
files is very very high there, too. Way above 5000 extents. The journals are
on xfs so it suffers that problem, too (which already has been pointed out
somewhere in this thread multiple times). But to my surprise, this does not
affect the performance of journalctl there. The bad performance is only for
fragmentation on btrfs, not xfs. IMHO, this pushes the pointer a little bit
more towards systemd, tho I couldn't suggest how to fix it. Time for the
filesystem hackers to kick in.

BTW: The servers are virtualized so by default, systemd-readahead does
nothing there because of "ConditionVirtual" in the service files. OTOH, it
won't defragment there anyways because xfs is not supported for this in
systemd, altough it would probably be possible because there is "xfs_fsr".
But my feeling is that xfs_fsr is more or less just locking the file,
copying it, then replaces the original. It cannot work on files in use. This
is different for btrfs which only chokes on defragmenting files currently
mmap'ed (e.g., binaries being executed, read: busy text files). Following
that thought, I wonder why it seems to defragment system.journal for me
because after all I learned here, systemd writes to the files mmap'ed.
--
Replies to list only preferred.
Zbigniew Jędrzejewski-Szmek
2014-06-16 03:27:04 UTC
Permalink
Post by Goffredo Baroncelli
Unfortunately bootchart doesn't work on fedora, because
CONFIG_SCHEDSTATS is not enabled in the kernel.
It's being investigated, unfortunatly with no progress since April:

https://bugzilla.redhat.com/show_bug.cgi?id=1013225

Zbyszek
Kai Krakow
2014-06-14 14:31:22 UTC
Permalink
Post by Goffredo Baroncelli
Post by Kai Krakow
First, I've set the journal directories nocow.
If you use nocow, you lost the btrfs ability to rebuild a RAID array
discarding the wrong sector. With the systemd journal checksum, you can
say that a data is wrong, but BTRFS with its checksum (when used in raid
mode) is able to *correct* the data too.
The decision is up to you. For me it was easy:

First: I do not care much if some of my logs become broken. I have backups
anyway with a backlog several weeks long.

Second: I don't use data RAID (at least not yet), just data striping. So the
btrfs checksum won't help me much - it has about the same meaning as the
journal checksums.

YMMV... For me, nocow'ing the journal was the clear winner. If you persist
Post by Goffredo Baroncelli
Post by Kai Krakow
Back to the extents counts: What I did next was implementing a defrag job
that regularly defrags the journal (actually, the complete log directory
$ cat /usr/local/sbin/defrag-logs.rb
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log
I think that this is a more viable solution; but what seems to me strange
is the fact that the fragmentation level of my rsyslog file is a lot
lower.
Are you sure? As Duncan pointed out at many occassions, if your journal file
is compressed by btrfs (for whatever reason, depends on your mount options
and defragging habbits), then filefrag does not show correct values. It
shows the count of compression blocks. Only if you are sure about that, it
is worth considering further steps.

If you are sure about the journals not being compressed (and the rsyslog
file, too), you can compare values. I think, only then, you could say that
systemd's allocation strategy for journal files does no good on btrfs. Maybe
then it's time to fix that by patching systemd to use another strategy,
either by configuration or auto-detection.

Most of those strategies will probably result in coalescing writes to the
journal into bigger blocks, which means buffering, which in turn is usually
not wanted for log files. That in turn means that cowed btrfs is probably
the wrong medium for storing journal files. You may want to play with
SyncIntervalSec in journald.conf...

BTW: I'm not sure if autodefrag would kick in on journal files because those
appends are no random writes. I think I remember that autodefrag was about
detecting random writes and then rewrite the file as a whole to reduce
extent fragmentation. Read: I'm not sure if autodefrag would have an effect
on this. A btrfs dev may comment on that.
--
Replies to list only preferred.
Lennart Poettering
2014-06-15 22:43:34 UTC
Permalink
Post by Kai Krakow
As they say, "Whoosh!"
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems because
it's perfect. Therefore, any problems you have with systemd must instead
be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
"attitude" from. God, with behaviour like that you just make us ignore
you, Duncan.

Lennart
--
Lennart Poettering, Red Hat
Duncan
2014-06-17 08:17:08 UTC
Permalink
Post by Lennart Poettering
Post by Kai Krakow
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems
because it's perfect. Therefore, any problems you have with systemd
must instead be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
"attitude" from. God, with behaviour like that you just make us ignore
you, Duncan.
Sorry.

As you'll note, I said "can be "controversially"...

I never stated that *I* held that position personally, as I was taking
the third-person observer position. As such, I've seen that attitude
expressed by others in multiple threads when systemd comes up (as Josef
alluded to as well), and that I simply interpreted the remark to which I
was alluding ("And that's now a btrfs problem.... :/") as a sarcastic
reference to that attitude... which again I never claimed as my own.

Thus the "whoosh", in reference to what I interpreted as sarcasm, which
certainly doesn't require fully agreement with in ordered to understand.
(Tho as I mentioned elsewhere, I can certainly see where they're coming
from given the recent kernel debug kerfuffle and the like... and I'm
certainly not alone there... but I'm /trying/ to steer a reasonably
neutral path while appreciating both sides.)

In fact, I specifically stated elsewhere that in fact I recently switched
to systemd -- by choice as I'm on gentoo which still defaults to openrc
-- myself. Certainly I would not have done so if I believed systemd was
as bad as all that, and the fact that I HAVE done so definitely implies a
rather large amount of both trust and respect in the systemd devs, or I'd
not be willing to run their code.

But never-the-less I can see the viewpoint from both sides now, and do
try to maintain a reasonable neutrality. I guess I should have made that
more explicit in the original post, but as they say, hindsight is 20/20.
=:^\
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Martin
2014-06-17 12:14:29 UTC
Permalink
On 17/06/14 09:17, Duncan wrote:
[...]
Post by Duncan
But never-the-less I can see the viewpoint from both sides now, and do
try to maintain a reasonable neutrality. I guess I should have made that
more explicit in the original post, but as they say, hindsight is 20/20.
=:^\
Hey! All good for a giggle and excellent for friendly attention to get
things properly fixed :-)

And this thread may even hit the news for Phoronix and LWN and others,
all in true FLOSS open development and thorough debug :-P

And... And... We get a few bits of systemd, btrfs, and filesystem
interactions fixed for how things /should/ work best.


Thanks (and for others also) for stirring a rapid chase! ;-)

Regards,
Martin

(A fellow Gentoo-er ;-) )
Continue reading on narkive:
Loading...