Discussion:
Fragile journal interleaving
(too old to reply)
Uoti Urpala
2017-12-10 22:36:21 UTC
Permalink
Raw Message
The code that tries to determine which of two journal entries is
"earlier" (similar logic in journal_file_comparare_locations() and
compare_with_location()) does this by trying different progressively
less reliable ways to compare - first by seqnum if seqnum id matches,
then by monotonic time if boot id matches, then by realtime timestamp.
Unfortunately, while this method could be said to give the "right"
answer more often than just comparing by the always-available realtime
timestamp, it does not give a valid ordering relation. As in, unless
you can already rely on all those comparisons giving the same
consistent order (when you could probably just use realtime anyway),
it's possible that you get a relation like A<B<C<A.

For example, these would have such a relation (assume that the omitted
monotonic values never determine order):
A: seqnum_id=0, seqnum=0, time=2
B: seqnum_id=0, seqnum=1, time=0
C: seqnum_id=1, seqnum=0, time=1

A<B by seqnum, B<C by time, C<A by time.

Such ordering means that the code interleaving entries from different
files can give rather arbitrary results depending on exactly which
order the entries are tested in. It triggers at least one issue that I
consider a clear bug: there's code in next_beyond_location() which
skips the next entry in a file if it's not in the expected direction
from the previous globally iterated entry, and this can discard valid
entries. A comment there says it's meant to discard duplicate entries
which were somehow recorded in multiple journal files (which I'd assume
to compare equal), but it also discards non-duplicate entries which
compare backwards from the previously shown one.

The reason I originally looked into this code was a set of old journal
files for which plain "journalctl" only showed 5 days of entries
despite the files containing months of data. This was triggered by a
user journal file which matched seqnum ids with some system journal
files, but had smaller seqnums from a much later time (the earliest
logs were old enough that the file could have been initialized by
journald before commit cbd671772c, so the exact same way to trigger
issues probably won't occur with files created by later versions). The
code searching for earlier entries first found the early system
journal, then moved to the user journal because it had smaller seqnums,
and finally moved to some other file that started 5 days before the end
(earlier than the user journal and with a different seqnum id - there
didn't happen to be other earlier files later in the iteration order).
After printing one entry from there, the next_beyond_location()
"duplicate" code then discarded all the earlier valid entries.

I'm not sure about the best way to fix these issues - are there
supposed to be any guarantees about interleaving? At the very least the
"duplicate" check should be fixed to not throw away arbitrary amounts
of valid entries. Any other code relying on assumptions about valid
ordering? Is the interleaving order supposed to be stable independent
of things like in which order the files are iterated over?
Lennart Poettering
2017-12-12 16:09:10 UTC
Permalink
Raw Message
Post by Uoti Urpala
The code that tries to determine which of two journal entries is
"earlier" (similar logic in journal_file_comparare_locations() and
compare_with_location()) does this by trying different progressively
less reliable ways to compare - first by seqnum if seqnum id matches,
then by monotonic time if boot id matches, then by realtime timestamp.
Unfortunately, while this method could be said to give the "right"
answer more often than just comparing by the always-available realtime
timestamp, it does not give a valid ordering relation. As in, unless
you can already rely on all those comparisons giving the same
consistent order (when you could probably just use realtime anyway),
it's possible that you get a relation like A<B<C<A.
For example, these would have such a relation (assume that the omitted
A: seqnum_id=0, seqnum=0, time=2
B: seqnum_id=0, seqnum=1, time=0
C: seqnum_id=1, seqnum=0, time=1
A<B by seqnum, B<C by time, C<A by time.
Such ordering means that the code interleaving entries from different
files can give rather arbitrary results depending on exactly which
order the entries are tested in. It triggers at least one issue that I
consider a clear bug: there's code in next_beyond_location() which
skips the next entry in a file if it's not in the expected direction
from the previous globally iterated entry, and this can discard valid
entries. A comment there says it's meant to discard duplicate entries
which were somehow recorded in multiple journal files (which I'd assume
to compare equal), but it also discards non-duplicate entries which
compare backwards from the previously shown one.
Note that two entries will only compare as fully identical if their
"xor_hash" is equal too. The xor_hash is the XOR combination of the
hashes of all of the entry's fields. That means realistically only
records that actually are identical should be considered as such.
Post by Uoti Urpala
The reason I originally looked into this code was a set of old journal
files for which plain "journalctl" only showed 5 days of entries
despite the files containing months of data. This was triggered by a
user journal file which matched seqnum ids with some system journal
files, but had smaller seqnums from a much later time (the earliest
logs were old enough that the file could have been initialized by
journald before commit cbd671772c, so the exact same way to trigger
issues probably won't occur with files created by later versions). The
code searching for earlier entries first found the early system
journal, then moved to the user journal because it had smaller seqnums,
and finally moved to some other file that started 5 days before the end
(earlier than the user journal and with a different seqnum id - there
didn't happen to be other earlier files later in the iteration order).
After printing one entry from there, the next_beyond_location()
"duplicate" code then discarded all the earlier valid entries.
I'm not sure about the best way to fix these issues - are there
supposed to be any guarantees about interleaving? At the very least the
"duplicate" check should be fixed to not throw away arbitrary amounts
of valid entries. Any other code relying on assumptions about valid
ordering? Is the interleaving order supposed to be stable independent
of things like in which order the files are iterated over?
So, the current "duplicate" check is not just that actually, it's also
a check that guarantees that we progress monotonically, and never go
backwards in time, even when journal files are added or removed while
we iterate through the files.

I am not entirely sure what we can do here. Maybe this can work: beef
up the comparison logic so that it returns more than
smaller/equal/larger but also a special value "ambiguous". And when
that is returned we don't enforce monotonicity strictly but instead go
record-by-record, if you follow what I mean?

Lennart
--
Lennart Poettering, Red Hat
Uoti Urpala
2017-12-12 19:00:31 UTC
Permalink
Raw Message
Post by Lennart Poettering
Post by Uoti Urpala
consider a clear bug: there's code in next_beyond_location() which
skips the next entry in a file if it's not in the expected direction
from the previous globally iterated entry, and this can discard valid
entries. A comment there says it's meant to discard duplicate entries
which were somehow recorded in multiple journal files (which I'd assume
to compare equal), but it also discards non-duplicate entries which
compare backwards from the previously shown one.
Note that two entries will only compare as fully identical if their
"xor_hash" is equal too. The xor_hash is the XOR combination of the
hashes of all of the entry's fields. That means realistically only
records that actually are identical should be considered as such.
I assume that would be suitable for handling the case of actual
duplicates? How do those happen anyway?
Post by Lennart Poettering
Post by Uoti Urpala
code searching for earlier entries first found the early system
journal, then moved to the user journal because it had smaller seqnums,
and finally moved to some other file that started 5 days before the end
(earlier than the user journal and with a different seqnum id - there
didn't happen to be other earlier files later in the iteration order).
After printing one entry from there, the next_beyond_location()
"duplicate" code then discarded all the earlier valid entries.
I'm not sure about the best way to fix these issues - are there
supposed to be any guarantees about interleaving? At the very least the
"duplicate" check should be fixed to not throw away arbitrary amounts
of valid entries. Any other code relying on assumptions about valid
ordering? Is the interleaving order supposed to be stable independent
of things like in which order the files are iterated over?
So, the current "duplicate" check is not just that actually, it's also
a check that guarantees that we progress monotonically, and never go
backwards in time, even when journal files are added or removed while
we iterate through the files.
If journal files can be added while iterating, I assume that could be
handled by special code run when such an event happens. IMO it's
important to have a basic guarantee that all entries in the journal
files will be shown at least in _some_ order when you try to iterate
through them all, such as with plain "journalctl". And thus the code
should make sure not to discard any entry unless it's confirmed to be a
full duplicate.
Post by Lennart Poettering
I am not entirely sure what we can do here. Maybe this can work: beef
up the comparison logic so that it returns more than
smaller/equal/larger but also a special value "ambiguous". And when
that is returned we don't enforce monotonicity strictly but instead go
record-by-record, if you follow what I mean?
I don't see how that would help, at least not without some extra
assumptions/changes. In my example problem case above, the ambiguous
comparisons happen when deciding which file to get the first entry
from. There's no natural default "first file", so even if you only know
it's ambiguous you have to pick some anyway. If you pick the one the
current code does, the following discard check is not ambiguous - it's
discarding entries with earlier realtime and non-comparable other
values. Or do you mean that if an ambiguous comparison was EVER seen,
monotonicity would be permanently disabled? I don't really see an
advantage for that over just not enforcing monotonicity at all, and
handling any added-file special cases separately.
Lennart Poettering
2017-12-12 20:38:06 UTC
Permalink
Raw Message
Post by Uoti Urpala
Post by Lennart Poettering
Post by Uoti Urpala
consider a clear bug: there's code in next_beyond_location() which
skips the next entry in a file if it's not in the expected direction
from the previous globally iterated entry, and this can discard valid
entries. A comment there says it's meant to discard duplicate entries
which were somehow recorded in multiple journal files (which I'd assume
to compare equal), but it also discards non-duplicate entries which
compare backwards from the previously shown one.
Note that two entries will only compare as fully identical if their
"xor_hash" is equal too. The xor_hash is the XOR combination of the
hashes of all of the entry's fields. That means realistically only
records that actually are identical should be considered as such.
I assume that would be suitable for handling the case of actual
duplicates? How do those happen anyway?
it's supposed to handle people copying things around between machines,
backups, rsync/scp and so on... it's also supposed to deal with remote
push/pull getting out of sync and so on.
Post by Uoti Urpala
Post by Lennart Poettering
I am not entirely sure what we can do here. Maybe this can work: beef
up the comparison logic so that it returns more than
smaller/equal/larger but also a special value "ambiguous". And when
that is returned we don't enforce monotonicity strictly but instead go
record-by-record, if you follow what I mean?
I don't see how that would help, at least not without some extra
assumptions/changes. In my example problem case above, the ambiguous
comparisons happen when deciding which file to get the first entry
from. There's no natural default "first file", so even if you only know
it's ambiguous you have to pick some anyway. If you pick the one the
current code does, the following discard check is not ambiguous - it's
discarding entries with earlier realtime and non-comparable other
values. Or do you mean that if an ambiguous comparison was EVER seen,
monotonicity would be permanently disabled? I don't really see an
advantage for that over just not enforcing monotonicity at all, and
handling any added-file special cases separately.
Hmm, I see, the problem is that the ambiguity is not obvious when just
looking at two of the entries...

Maybe the approach needs to be that we immedately increase the read
record ptr of a specific file by one when we read it, so that we know
we monotonically progress through the file. And then change the logic
that looks for the next entry across all files to honour that, and
then simply skip over fully identical entries, but not insist on
monotonic timers otherwise.

With that approach we can be sure that we never enter a loop, and
we'll most filter out duplicates (well, except if duplicate entries
show up in multiple files in different orders, but that's OK then,
dunno).

Lennart
--
Lennart Poettering, Red Hat
Uoti Urpala
2017-12-12 22:43:53 UTC
Permalink
Raw Message
Post by Lennart Poettering
Maybe the approach needs to be that we immedately increase the read
record ptr of a specific file by one when we read it, so that we know
we monotonically progress through the file. And then change the logic
that looks for the next entry across all files to honour that, and
then simply skip over fully identical entries, but not insist on
monotonic timers otherwise.
Isn't this pretty much what the code already does (except for the
"fully identical entries" part)? The next_beyond_location() function
already gives the next entry from a given file, and seems to use the
internal order of the file when doing that. So the only change would be
making the duplicate check only discard actually identical entries. And
now that I checked the surrounding code, it looks like even in the new-
file-added case you mentioned next_beyond_location() would call
find_location_with_matches(), which should seek the new file to the
correct position, so I don't see why the monotonicity discard would be
needed for that case either?
Post by Lennart Poettering
With that approach we can be sure that we never enter a loop, and
we'll most filter out duplicates (well, except if duplicate entries
show up in multiple files in different orders, but that's OK then,
dunno).
You could probably make code loop by removing and adding back files
with the right timing, but that's not a very realistic case. So unless
there's some other part of the code I've missed, it looks like just
changing the discard to check for equality rather than non-monotonicity
would be an improvement. That would still leave issues like the order
being "non-deterministic" (it depends on directory read order for files
for example) at least.
Lennart Poettering
2017-12-13 17:40:20 UTC
Permalink
Raw Message
Post by Uoti Urpala
Post by Lennart Poettering
Maybe the approach needs to be that we immedately increase the read
record ptr of a specific file by one when we read it, so that we know
we monotonically progress through the file. And then change the logic
that looks for the next entry across all files to honour that, and
then simply skip over fully identical entries, but not insist on
monotonic timers otherwise.
Isn't this pretty much what the code already does (except for the
"fully identical entries" part)? The next_beyond_location() function
already gives the next entry from a given file, and seems to use the
internal order of the file when doing that. So the only change would be
making the duplicate check only discard actually identical entries. And
now that I checked the surrounding code, it looks like even in the new-
file-added case you mentioned next_beyond_location() would call
find_location_with_matches(), which should seek the new file to the
correct position, so I don't see why the monotonicity discard would be
needed for that case either?
I figure you are right.

Any chance you can prep a patch for this change?

I'd lie if I'd claim that I fully see the full effect of such a
change, but I think what you propose should work.

Thanks,

Lennart
--
Lennart Poettering, Red Hat
Uoti Urpala
2017-12-14 01:56:26 UTC
Permalink
Raw Message
Post by Lennart Poettering
Post by Uoti Urpala
correct position, so I don't see why the monotonicity discard would be
needed for that case either?
I figure you are right.
Any chance you can prep a patch for this change?
A patch with a fairly long commit message explaining the rationale and
related issues attached.

BTW current sd_journal_next() documentation claims "The journal is
strictly ordered by reception time, and hence advancing to the next
entry guarantees that the entry then pointing to is later in time than
then previous one, or has the same timestamp." Isn't this OBVIOUSLY
false when talking about realtime timestamps if you don't include
caveats about changing the machine time?
Lennart Poettering
2017-12-14 09:35:20 UTC
Permalink
Raw Message
Post by Uoti Urpala
Post by Lennart Poettering
Post by Uoti Urpala
correct position, so I don't see why the monotonicity discard would be
needed for that case either?
I figure you are right.
Any chance you can prep a patch for this change?
A patch with a fairly long commit message explaining the rationale and
related issues attached.
Could you submit through github please?
Post by Uoti Urpala
BTW current sd_journal_next() documentation claims "The journal is
strictly ordered by reception time, and hence advancing to the next
entry guarantees that the entry then pointing to is later in time than
then previous one, or has the same timestamp." Isn't this OBVIOUSLY
false when talking about realtime timestamps if you don't include
caveats about changing the machine time?
Well, but the philosophical concept of time beyond the implementation
detail of Linux CLOCK_REALTIME is of course monotonic, hence I think
it's actually correct what is written there. That said a patch that
clarifies this would of course be very welcome...
Post by Uoti Urpala
for (;;) {
bool found;
- if (j->current_location.type == LOCATION_DISCRETE) {
- int k;
-
- k = compare_with_location(f, &j->current_location);
-
- found = direction == DIRECTION_DOWN ? k > 0 : k < 0;
- } else
+ if (j->current_location.type == LOCATION_DISCRETE)
+ // only == 0 or not matters
Please use C comments, /* */, see CODING_STYLE.
Post by Uoti Urpala
+ found = compare_with_location(f,
&j->current_location);
this returns an int, and you assign it to a bool. If the "int" was
actually used in a "boolean" way that would be OK, but it actually
returns < 0, 0, > 0, hence we should make the conversion explicit here
and write this as "found = c_w_l(…) != 0".

Thanks!

Lennart
--
Lennart Poettering, Red Hat
Uoti Urpala
2017-12-14 19:14:16 UTC
Permalink
Raw Message
Post by Lennart Poettering
Post by Uoti Urpala
BTW current sd_journal_next() documentation claims "The journal is
strictly ordered by reception time, and hence advancing to the next
entry guarantees that the entry then pointing to is later in time than
then previous one, or has the same timestamp." Isn't this OBVIOUSLY
false when talking about realtime timestamps if you don't include
caveats about changing the machine time?
Well, but the philosophical concept of time beyond the implementation
detail of Linux CLOCK_REALTIME is of course monotonic, hence I think
it's actually correct what is written there. That said a patch that
clarifies this would of course be very welcome...
Well if you talk about actual timestamp values then it's not true due
to time changes. If you talk about the "physical reality" time when it
was logged, then it's true for a single journal file, but not true when
interleaving multiple journal files, as the interleaving process may
have to use the timestamps and they're not perfect (even ignoring the
issues discussed in this thread).
Post by Lennart Poettering
Post by Uoti Urpala
+ if (j->current_location.type == LOCATION_DISCRETE)
+ // only == 0 or not matters
Please use C comments, /* */, see CODING_STYLE.
That is a standard C comment... IMO it's quite silly to insist on this
kind of "This is not the C I first learned! I refuse to update the way
I do things!" attitude for systemd. "//" comments and free placement of
variable declarations are normal C and no more questionable than any
other random feature. In practice, the ONLY reason anyone opposes them
is that they were added at a later point. This is exactly the same kind
of resistance to change that underlies most "sysvinit worked just fine
for me, systemd is horrible" attitudes. So systemd in particular really
should do better.
Post by Lennart Poettering
Post by Uoti Urpala
+ found = compare_with_location(f,
&j->current_location);
this returns an int, and you assign it to a bool. If the "int" was
actually used in a "boolean" way that would be OK, but it actually
returns < 0, 0, > 0, hence we should make the conversion explicit here
and write this as "found = c_w_l(…) != 0".
Well that int/bool distinction was basically why I added the comment
above. I guess adding a redundant comparison is an alternative, matter
of taste mostly...
Lennart Poettering
2017-12-14 19:34:27 UTC
Permalink
Raw Message
Post by Uoti Urpala
Post by Lennart Poettering
Post by Uoti Urpala
+ if (j->current_location.type == LOCATION_DISCRETE)
+ // only == 0 or not matters
Please use C comments, /* */, see CODING_STYLE.
That is a standard C comment... IMO it's quite silly to insist on this
kind of "This is not the C I first learned! I refuse to update the way
I do things!" attitude for systemd. "//" comments and free placement of
variable declarations are normal C and no more questionable than any
other random feature. In practice, the ONLY reason anyone opposes them
is that they were added at a later point. This is exactly the same kind
of resistance to change that underlies most "sysvinit worked just fine
for me, systemd is horrible" attitudes. So systemd in particular really
should do better.
Well, the idea is that /* */ is used for explanatory comments and //
is left for local, temporary, non-commitable stuff.

i.e. if you are testing some stuff, and want to comment out some bits
briefly, use //, but if you add explanations for code that shall be
commited then please use /* */. That way you can see very easily what
is temporary stuff you should fix up before commiting and what should
go in.

different projects have different rules there. Some people use "FIXME"
or "XXX" for this purpose (and in fact its up to you if you do, after
all this kind of stuff should never end up being commited upstream),
but so far we just used C vs. C++ comments for this, and I think that
makes sense.
Post by Uoti Urpala
Post by Lennart Poettering
Post by Uoti Urpala
&j->current_location);
this returns an int, and you assign it to a bool. If the "int" was
actually used in a "boolean" way that would be OK, but it actually
returns < 0, 0, > 0, hence we should make the conversion explicit here
and write this as "found = c_w_l(…) != 0".
Well that int/bool distinction was basically why I added the comment
above. I guess adding a redundant comparison is an alternative, matter
of taste mostly...
Yeah, we try to not to rely too much on "magic" implicit C
casts, and try to use the language to clarify in code that we are aware of
the different types.

Lennart
--
Lennart Poettering, Red Hat
Uoti Urpala
2017-12-14 21:28:18 UTC
Permalink
Raw Message
Post by Lennart Poettering
Well, the idea is that /* */ is used for explanatory comments and //
is left for local, temporary, non-commitable stuff.
i.e. if you are testing some stuff, and want to comment out some bits
briefly, use //, but if you add explanations for code that shall be
commited then please use /* */. That way you can see very easily what
is temporary stuff you should fix up before commiting and what should
go in.
different projects have different rules there. Some people use "FIXME"
or "XXX" for this purpose (and in fact its up to you if you do, after
all this kind of stuff should never end up being commited upstream),
but so far we just used C vs. C++ comments for this, and I think that
makes sense.
It doesn't really make sense. Most comments are until end of line, and
'//' is the natural match for that. That's what the comment feature in
most other languages does too. '/*' comments take more space, are more
effort to type, and cause extra problems with unterminated comments.
There can be variation in whether you use '/*' for larger block
comments etc, but IMO it's not at all reasonable to say that the
standard comment marker is suddenly reserved for "TODO" stuff only in
one project. I consider such banning of "random" features about as
reasonable as "in our project you must not use the newfangled a->b
operator, you have to write (*a).b" - that is, not reasonable at all.

I'm also generally opposed to arguments like "different projects have
different rules" for completely arbitrary rules with no real rationale.
Standardization and best practices have value. Again, this is IMO
comparable to the position of the systemd project. "Our project just
happens to require you do stuff differently or require old C dialects"
is analogous to "well our Linux distro just happens to use different
paths etc for no real reason beyond historical accident, you should
just deal with that".

I checked CODING_STYLE, and the entries that look like they're
ultimately purely the result of resistance to change without any valid
justification are:

- trying to prohibit // comments
- trying to prohibit free declaration placement - it's often better for
readable code, and the reason against it is really just not being used
to reading such code
- trying to prohibit variable declarations together with function
invocations. "int res = func(foo);" is perfectly reasonable code.
Again, just a matter of getting used to reading such code.

These could be replaced with more modern best practices, for example:

// This form implies you use the value of "i" after the loop
for (i = 0; i < n; i++)
...;

// Use this form if the value of "i" after the loop does not matter
for (int i = 0; i < n; i++)
...;

Loading...