Discussion:
systemd-resolved continuously switching DNS servers
(too old to reply)
Kai Krakow
2017-05-04 23:01:07 UTC
Permalink
Hello!

Why is systemd-resolved switching DNS servers all day long? This
doesn't seem to be right...

Mai 05 00:52:46 jupiter systemd-resolved[658]: Switching to DNS server
192.168.4.254 for interface enp5s0. Mai 05 00:52:53 jupiter
systemd-resolved[658]: Switching to DNS server
fe80::b248:7aff:fee7:f438%2 for interface enp5s0. Mai 05 00:52:53
jupiter systemd-resolved[658]: Switching to DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0. Mai 05
00:52:58 jupiter systemd-resolved[658]: Switching to DNS server
192.168.4.254 for interface enp5s0. Mai 05 00:52:59 jupiter
systemd-resolved[658]: Switching to DNS server
fe80::b248:7aff:fee7:f438%2 for interface enp5s0. Mai 05 00:53:02
jupiter systemd-resolved[658]: Switching to DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0. Mai 05
00:53:06 jupiter systemd-resolved[658]: Switching to DNS server
192.168.4.254 for interface enp5s0. Mai 05 00:53:07 jupiter
systemd-resolved[658]: Switching to DNS server
fe80::b248:7aff:fee7:f438%2 for interface enp5s0. Mai 05 00:53:12
jupiter systemd-resolved[658]: Switching to DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0. Mai 05
00:53:33 jupiter systemd-resolved[658]: Switching to DNS server
192.168.4.254 for interface enp5s0. Mai 05 00:53:35 jupiter
systemd-resolved[658]: Switching to DNS server
fe80::b248:7aff:fee7:f438%2 for interface enp5s0. Mai 05 00:53:40
jupiter systemd-resolved[658]: Switching to DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0. Mai 05
00:54:01 jupiter systemd-resolved[658]: Switching to DNS server
192.168.4.254 for interface enp5s0. Mai 05 00:54:02 jupiter
systemd-resolved[658]: Switching to DNS server
fe80::b248:7aff:fee7:f438%2 for interface enp5s0. Mai 05 00:54:08
jupiter systemd-resolved[658]: Switching to DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0.

Also, name resolving seems to be very slow. When I directly use these
name servers in resolv.conf instead of going through the stub resolver,
there are no slow-downs and these logs stop.
--
Regards,
Kai

Replies to list-only preferred.
Lennart Poettering
2017-05-05 18:18:41 UTC
Permalink
Post by Kai Krakow
Hello!
Why is systemd-resolved switching DNS servers all day long? This
doesn't seem to be right...
If you turn on debug logging, you should see an explanation right
before each switch. I figure we should choose the log levels more
carefully, so that whenever we switch we also log the reason at the
same level...


Lennart
--
Lennart Poettering, Red Hat
Kai Krakow
2017-05-06 12:22:21 UTC
Permalink
Am Fri, 5 May 2017 20:18:41 +0200
Post by Lennart Poettering
Post by Kai Krakow
Hello!
Why is systemd-resolved switching DNS servers all day long? This
doesn't seem to be right...
If you turn on debug logging, you should see an explanation right
before each switch. I figure we should choose the log levels more
carefully, so that whenever we switch we also log the reason at the
same level...
It looks like this all has to do with timeouts:

Mai 06 14:17:09 jupiter systemd-resolved[5585]: Cache miss for ssl.gstatic.com IN AAAA
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Transaction 54375 for <ssl.gstatic.com IN AAAA> scope dns on enp5s0/*.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Using feature level UDP for transaction 54375.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Using DNS server fe80::b248:7aff:fee7:f438%2 for transaction 54375.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sending query packet with id 54375.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Timeout reached on transaction 33004.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Retrying transaction 33004.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Switching to DNS server 2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Cache miss for ssl.gstatic.com IN A
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Transaction 33004 for <ssl.gstatic.com IN A> scope dns on enp5s0/*.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Using feature level UDP for transaction 33004.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Using DNS server 2a02:8109:1ec0:6f5:5667:51ff:feea:385f for transaction 33004.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sending query packet with id 33004.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Processing incoming packet on transaction 33004. (rcode=SUCCESS)
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Not validating response for 33004, used server feature level does not support DNSSEC.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Added positive unauthenticated cache entry for ssl.gstatic.com IN A 143s on */INET6/2a02:8109:1ec0:6f5:5667:51ff:feea:385f
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Transaction 33004 for <ssl.gstatic.com IN A> on scope dns on enp5s0/* now complete with <success> from network (unsigned).
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sending response packet with id 42127 on interface 1/AF_INET.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sending response packet with id 22131 on interface 1/AF_INET.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Processing incoming packet on transaction 54375. (rcode=SUCCESS)
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Not validating response for 54375, used server feature level does not support DNSSEC.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Added positive unauthenticated cache entry for ssl.gstatic.com IN AAAA 203s on enp5s0/INET6/fe80::b248:7aff:fee7:f438
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Transaction 54375 for <ssl.gstatic.com IN AAAA> on scope dns on enp5s0/* now complete with <success> from network (unsigned).
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Freeing transaction 33004.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sent message type=method_return sender=n/a destination=:1.352 object=n/a interface=n/a member=n/a cookie=234 reply_cookie=2 error=n/a
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus object=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RemoveMatch cookie=235 reply_cookie=0 erro
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.273 object=n/a interface=n/a member=n/a cookie=181 reply_cookie=235 error=n/a
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Freeing transaction 54375.

I just don't unterstand why, because all these nameservers work
perfectly well when used directly and not through the stub resolver.
--
Regards,
Kai

Replies to list-only preferred.
Kai Krakow
2017-05-08 22:42:43 UTC
Permalink
Am Sat, 6 May 2017 14:22:21 +0200
Post by Kai Krakow
Am Fri, 5 May 2017 20:18:41 +0200
Post by Lennart Poettering
Post by Kai Krakow
Hello!
Why is systemd-resolved switching DNS servers all day long? This
doesn't seem to be right...
If you turn on debug logging, you should see an explanation right
before each switch. I figure we should choose the log levels more
carefully, so that whenever we switch we also log the reason at the
same level...
Fixed by restarting the router. The cable modem seems to be buggy with
UDP packets after a lot of uptime: it simply silently drops UDP
packets at regular intervals, WebUI was also very slow, probably a CPU
issue.

I'll follow up on this with the cable provider.

When the problem starts to show up, systemd-resolved is affected more
by this than direct resolving. I don't know if there's something that
could be optimized in systemd-resolved to handle such issues better but
I don't consider it a bug in systemd-resolved, it was a local problem.

Thanks,
Kai
Post by Kai Krakow
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Cache miss for
ssl.gstatic.com IN AAAA Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Transaction 54375 for <ssl.gstatic.com IN
AAAA> scope dns on enp5s0/*. Mai 06 14:17:09 jupiter
AAAA> systemd-resolved[5585]: Using feature level UDP for transaction
AAAA> 54375. Mai 06 14:17:09 jupiter systemd-resolved[5585]: Using
AAAA> DNS server fe80::b248:7aff:fee7:f438%2 for transaction 54375.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sending query packet
Timeout reached on transaction 33004. Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Retrying transaction 33004. Mai 06 14:17:09
jupiter systemd-resolved[5585]: Switching to DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for interface enp5s0. Mai 06
14:17:09 jupiter systemd-resolved[5585]: Cache miss for
Transaction 33004 for <ssl.gstatic.com IN A> scope dns on enp5s0/*.
Mai 06 14:17:09 jupiter systemd-resolved[5585]: Using feature level
UDP for transaction 33004. Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Using DNS server
2a02:8109:1ec0:6f5:5667:51ff:feea:385f for transaction 33004. Mai 06
14:17:09 jupiter systemd-resolved[5585]: Sending query packet with id
33004. Mai 06 14:17:09 jupiter systemd-resolved[5585]: Processing
incoming packet on transaction 33004. (rcode=SUCCESS) Mai 06 14:17:09
jupiter systemd-resolved[5585]: Not validating response for 33004,
used server feature level does not support DNSSEC. Mai 06 14:17:09
jupiter systemd-resolved[5585]: Added positive unauthenticated cache
entry for ssl.gstatic.com IN A 143s on
*/INET6/2a02:8109:1ec0:6f5:5667:51ff:feea:385f Mai 06 14:17:09
jupiter systemd-resolved[5585]: Transaction 33004 for
<ssl.gstatic.com IN A> on scope dns on enp5s0/* now complete with
<success> from network (unsigned). Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Sending response packet with id 42127 on
Sending response packet with id 22131 on interface 1/AF_INET. Mai 06
14:17:09 jupiter systemd-resolved[5585]: Processing incoming packet
on transaction 54375. (rcode=SUCCESS) Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Not validating response for 54375, used
server feature level does not support DNSSEC. Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Added positive unauthenticated cache entry
for ssl.gstatic.com IN AAAA 203s on
enp5s0/INET6/fe80::b248:7aff:fee7:f438 Mai 06 14:17:09 jupiter
systemd-resolved[5585]: Transaction 54375 for <ssl.gstatic.com IN
AAAA> on scope dns on enp5s0/* now complete with <success> from
AAAA> network (unsigned). Mai 06 14:17:09 jupiter
AAAA> systemd-resolved[5585]: Freeing transaction 33004. Mai 06
AAAA> 14:17:09 jupiter systemd-resolved[5585]: Sent message
AAAA> type=method_return sender=n/a destination=:1.352 object=n/a
AAAA> interface=n/a member=n/a cookie=234 reply_cookie=2 error=n/a
AAAA> Mai 06 14:17:09 jupiter systemd-resolved[5585]: Sent message
AAAA> type=method_call sender=n/a destination=org.freedesktop.DBus
AAAA> object=/org/freedesktop/DBus interface=org.freedesktop.DBus
AAAA> member=RemoveMatch cookie=235 reply_cookie=0 erro Mai 06
AAAA> 14:17:09 jupiter systemd-resolved[5585]: Got message
AAAA> type=method_return sender=org.freedesktop.DBus
AAAA> destination=:1.273 object=n/a interface=n/a member=n/a
AAAA> cookie=181 reply_cookie=235 error=n/a Mai 06 14:17:09 jupiter
AAAA> systemd-resolved[5585]: Freeing transaction 54375.
I just don't unterstand why, because all these nameservers work
perfectly well when used directly and not through the stub resolver.
--
Regards,
Kai

Replies to list-only preferred.
Lennart Poettering
2017-05-09 18:37:16 UTC
Permalink
Post by Kai Krakow
Am Sat, 6 May 2017 14:22:21 +0200
Post by Kai Krakow
Am Fri, 5 May 2017 20:18:41 +0200
Post by Lennart Poettering
Post by Kai Krakow
Hello!
Why is systemd-resolved switching DNS servers all day long? This
doesn't seem to be right...
If you turn on debug logging, you should see an explanation right
before each switch. I figure we should choose the log levels more
carefully, so that whenever we switch we also log the reason at the
same level...
Fixed by restarting the router. The cable modem seems to be buggy with
UDP packets after a lot of uptime: it simply silently drops UDP
packets at regular intervals, WebUI was also very slow, probably a CPU
issue.
I'll follow up on this with the cable provider.
When the problem starts to show up, systemd-resolved is affected more
by this than direct resolving. I don't know if there's something that
could be optimized in systemd-resolved to handle such issues better but
I don't consider it a bug in systemd-resolved, it was a local problem.
Normally configured DNS servers should be equivalent, and hence
switching them for each retry should not come at any cost, hence,
besides the extra log output, do you experience any real issues?

Lennart
--
Lennart Poettering, Red Hat
Kai Krakow
2017-05-10 20:12:37 UTC
Permalink
Am Tue, 9 May 2017 20:37:16 +0200
Post by Lennart Poettering
Post by Kai Krakow
Am Sat, 6 May 2017 14:22:21 +0200
Post by Kai Krakow
Am Fri, 5 May 2017 20:18:41 +0200
[...]
[...]
[...]
Post by Kai Krakow
Fixed by restarting the router. The cable modem seems to be buggy
with UDP packets after a lot of uptime: it simply silently drops UDP
packets at regular intervals, WebUI was also very slow, probably a
CPU issue.
I'll follow up on this with the cable provider.
When the problem starts to show up, systemd-resolved is affected
more by this than direct resolving. I don't know if there's
something that could be optimized in systemd-resolved to handle
such issues better but I don't consider it a bug in
systemd-resolved, it was a local problem.
Normally configured DNS servers should be equivalent, and hence
switching them for each retry should not come at any cost, hence,
besides the extra log output, do you experience any real issues?
Since I restarted the router, there are no longer any such logs except
maybe a few per day (less than 4).

But when I got those logs spammed to the journal, the real problem was
the DNS resolver taking 10s about once per minute to resolve a website
address - which really was a pita.

But well, what could systemd-resolved have done about it when the real
problem was some network equipment?

I just wonder why it was less visible when directly using those DNS
servers. Since DNS must have been designed with occassional packet loss
in mind (because it uses UDP), there must be a way to handle this
better. So I read a little bit in https://www.ietf.org/rfc/rfc1035.txt.

RFC1035 section 4.2.1 suggests that the retransmission interval for
queries should be 2-5 seconds, depending on statistics of previous
queries. To me, "retransmissions" means the primary DNS server should
not be switched for each query timeout it got (while still allowing to
transfer the same request to the next available server).

RFC1035 section 7 discusses the suggested implementation of the
resolver and covers retransmission and server selection algorithms:

It suggests to record average response time for each server it queries
to select the ones which respond faster first. Without query history,
the selection algorithm should pretend a response time of 5-10 seconds.

It also suggests to switch the primary server only when there was some
"bizarre" error or a server error reply. However, I don't think it
should actually remove them from the list as suggested there since we
are a client resolver, not a server resolver which can update its peer
lists from neighbor servers. However, we could reset query time
statistics to move it to the end of the list, and/or blacklist it for a
while.

Somewhere else in that document I've read that it is well permitted to
interleave multiple parallel requests to multiple DNS servers in the
list. So I guess it would be nice and allowed if systemd-resolved used
more than only one DNS server at the same time by alternating between
them each request - maybe taking the two best according to query time
statistics.

I also guess that it should maybe use shorter timeouts for queries as
you could have more than one DNS server and the initial query time
statistic should pretend 5-10 seconds, while the rotation interval
suggests 2-5 seconds.

I think it would work to have "10 seconds divided by servers count" or
2 seconds, whatever is bigger, as a timeout for query rotation. But a
late reply should still be accepted as pointed out in section 7.3, even
when the query was already rotated to the next DNS server. Using only a
single DNS server can skip all this logic as there's no rotation and
would work with timeouts of 10 seconds.

That way, systemd-resolved would "learn" to use only the fastest DNS
server and when it becomes too slow (which is 5-10 seconds based on the
RFC), it would switch to the next server. If parallel requests come in,
it would use more DNS servers from the list in parallel, auto-sorted by
query reply time. The startup order is the one given by the
administrator (or whatever provides the DNS server list).

Applied to my UDP packet loss (which seem to be single packet losses as
an immediate next request would've got a reply), it would mean that the
systemd resolver gives me an address after 2-3 seconds instead of 5 or
10 because I had 4 DNS servers on that link. This is more or less what
I've seen previously in my situation when I switched back to direct
resolving instead of through systemd-resolved.

What do you think? I could think of this being an implementation
improvement project in the Github bug tracker. I would be willing to
work on such an improvement given that the existing code is not too
difficult to understand since I'm not a C pro (my last bigger project
is like 10 years ago). Apparently, such an improvement could only be a
spare time project for me, taking some time.

Of course, all this only works when all DNS servers on the same link
can resolve the same zones. Otherwise you will get a lot of timeouts
and switching. I know that some people prefer to give DNS server IPs
that resolve different zones (or can only resolve some). I never
understood why one would supply such a mis-configuration, and it is
mostly seen only in the Windows world, but well: it exists. A warning
in the log when detecting such a situation could fix this. In the end,
systemd-resolved is able to correctly handle per-link DNS servers.
--
Regards,
Kai

Replies to list-only preferred.
Kai Krakow
2017-05-13 12:44:09 UTC
Permalink
Am Wed, 10 May 2017 22:12:37 +0200
Post by Kai Krakow
Am Tue, 9 May 2017 20:37:16 +0200
Post by Lennart Poettering
Post by Kai Krakow
Am Sat, 6 May 2017 14:22:21 +0200
[...]
Post by Lennart Poettering
[...]
[...]
[...]
[...]
Post by Lennart Poettering
Post by Kai Krakow
Fixed by restarting the router. The cable modem seems to be buggy
with UDP packets after a lot of uptime: it simply silently drops
UDP packets at regular intervals, WebUI was also very slow,
probably a CPU issue.
I'll follow up on this with the cable provider.
When the problem starts to show up, systemd-resolved is affected
more by this than direct resolving. I don't know if there's
something that could be optimized in systemd-resolved to handle
such issues better but I don't consider it a bug in
systemd-resolved, it was a local problem.
Normally configured DNS servers should be equivalent, and hence
switching them for each retry should not come at any cost, hence,
besides the extra log output, do you experience any real issues?
Since I restarted the router, there are no longer any such logs except
maybe a few per day (less than 4).
But when I got those logs spammed to the journal, the real problem was
the DNS resolver taking 10s about once per minute to resolve a website
address - which really was a pita.
But well, what could systemd-resolved have done about it when the real
problem was some network equipment?
I just wonder why it was less visible when directly using those DNS
servers. Since DNS must have been designed with occassional packet
loss in mind (because it uses UDP), there must be a way to handle this
better. So I read a little bit in
https://www.ietf.org/rfc/rfc1035.txt.
RFC1035 section 4.2.1 suggests that the retransmission interval for
queries should be 2-5 seconds, depending on statistics of previous
queries. To me, "retransmissions" means the primary DNS server should
not be switched for each query timeout it got (while still allowing to
transfer the same request to the next available server).
RFC1035 section 7 discusses the suggested implementation of the
It suggests to record average response time for each server it queries
to select the ones which respond faster first. Without query history,
the selection algorithm should pretend a response time of 5-10
seconds.
It also suggests to switch the primary server only when there was some
"bizarre" error or a server error reply. However, I don't think it
should actually remove them from the list as suggested there since we
are a client resolver, not a server resolver which can update its peer
lists from neighbor servers. However, we could reset query time
statistics to move it to the end of the list, and/or blacklist it for
a while.
Somewhere else in that document I've read that it is well permitted to
interleave multiple parallel requests to multiple DNS servers in the
list. So I guess it would be nice and allowed if systemd-resolved used
more than only one DNS server at the same time by alternating between
them each request - maybe taking the two best according to query time
statistics.
I also guess that it should maybe use shorter timeouts for queries as
you could have more than one DNS server and the initial query time
statistic should pretend 5-10 seconds, while the rotation interval
suggests 2-5 seconds.
I think it would work to have "10 seconds divided by servers count" or
2 seconds, whatever is bigger, as a timeout for query rotation. But a
late reply should still be accepted as pointed out in section 7.3,
even when the query was already rotated to the next DNS server. Using
only a single DNS server can skip all this logic as there's no
rotation and would work with timeouts of 10 seconds.
That way, systemd-resolved would "learn" to use only the fastest DNS
server and when it becomes too slow (which is 5-10 seconds based on
the RFC), it would switch to the next server. If parallel requests
come in, it would use more DNS servers from the list in parallel,
auto-sorted by query reply time. The startup order is the one given
by the administrator (or whatever provides the DNS server list).
Applied to my UDP packet loss (which seem to be single packet losses
as an immediate next request would've got a reply), it would mean
that the systemd resolver gives me an address after 2-3 seconds
instead of 5 or 10 because I had 4 DNS servers on that link. This is
more or less what I've seen previously in my situation when I
switched back to direct resolving instead of through systemd-resolved.
What do you think? I could think of this being an implementation
improvement project in the Github bug tracker. I would be willing to
work on such an improvement given that the existing code is not too
difficult to understand since I'm not a C pro (my last bigger project
is like 10 years ago). Apparently, such an improvement could only be a
spare time project for me, taking some time.
Of course, all this only works when all DNS servers on the same link
can resolve the same zones. Otherwise you will get a lot of timeouts
and switching. I know that some people prefer to give DNS server IPs
that resolve different zones (or can only resolve some). I never
understood why one would supply such a mis-configuration, and it is
mostly seen only in the Windows world, but well: it exists. A warning
in the log when detecting such a situation could fix this. In the end,
systemd-resolved is able to correctly handle per-link DNS servers.
I've prepared a patch and added a pull request:
https://github.com/systemd/systemd/pull/5953

I saw that the basic infrastructure as laid out in RFC1035 is there but
the current implementation can only increase timeouts and never lowers
them, resulting in timeouts hard limited to 5 seconds all the time.
That way, occasional UDP packet drops result in always 5 second stalls.
In case there are bursts of packet drops, you would easily get DNS
timeouts in clients (tho, this seems to have been fixed already in
systemd-resolved as I now longer saw that effect), or at least stalls
of 5, 10, or 15 seconds which is very annoying.

With this patch, after a DNS server recovers, the timeouts are lowered
again so we get faster failover to the next server in case a packet is
lost again.

It works pretty well for web browsing here. I didn't test the effects
on LLMNR and mDNS as I cannot really use them here with a proper test
scenario.
--
Regards,
Kai

Replies to list-only preferred.
Continue reading on narkive:
Loading...