Discussion:
[Unbound-users] Unbound stops answering after ADSL-line bounce
Jan-Piet Mens
2012-01-23 07:29:25 UTC
Permalink
Hello,

I'm running unbound-1.4.14 behind an ADSL line which is cycled once in a
while. This can take anything from several seconds to four/five minutes
(router reboot) before the Internet is visible again. At this point,
I've been experiencing Unbound replies with a SERVFAIL to all queries,
as though it marks the DNS servers as being down and stops sending
requests to them.

Is this a known issue?

(Somebody else has also experienced this problem with a vanilla Unbound
on what I call DAP [1], and he's reported this to me privately with a
pcap and verbosiyt=5 logfile I can submit privately if required.)

Regards,

-JP

[1]: http://jpmens.net/pages/dns/dnssec-appliance/
W.C.A. Wijngaards
2012-01-23 08:21:08 UTC
Permalink
Hi Jan-Piet,

Yes it marks servers as down and stops sending queries. There was a
rewrite for better handling of this case, if you run an older version
an update may be useful. However, all versions step down packets to
hosts that are down.

This has a timeout of x minutes (1-15). After that your service
should re-enable again. If the downtime was under two minutes, you
have to wait that amount of time for service to resume (exponential
backoff used here). If the downtime was longer, you may have to wait
15 minutes (infra-ttl config option) before service resumes to hosts
that were probed to be 'down'.

If you want service to instantly go down and up with the line
downtime, and you can notice the line-bounce via some other method,
then you could use e.g. unbound-control flush_infra to resume traffic.

Best regards,
Wouter
Post by Jan-Piet Mens
Hello,
I'm running unbound-1.4.14 behind an ADSL line which is cycled once
in a while. This can take anything from several seconds to
four/five minutes (router reboot) before the Internet is visible
again. At this point, I've been experiencing Unbound replies with a
SERVFAIL to all queries, as though it marks the DNS servers as
being down and stops sending requests to them.
Is this a known issue?
(Somebody else has also experienced this problem with a vanilla
Unbound on what I call DAP [1], and he's reported this to me
privately with a pcap and verbosiyt=5 logfile I can submit
privately if required.)
Regards,
-JP
[1]: http://jpmens.net/pages/dns/dnssec-appliance/
_______________________________________________ Unbound-users
http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
Jan-Piet Mens
2012-01-23 12:04:28 UTC
Permalink
Hello Wouter,
Post by W.C.A. Wijngaards
This has a timeout of x minutes (1-15). After that your service
should re-enable again. If the downtime was under two minutes, you
have to wait that amount of time for service to resume (exponential
backoff used here). If the downtime was longer, you may have to wait
15 minutes (infra-ttl config option) before service resumes to hosts
that were probed to be 'down'.
If I understand you (and the man page) correctly, setting
`infra-host-ttl' to something like 10 seconds would mean that at most 10
seconds would elapse before Unbound starts querying dropped servers.
Maybe that would be a tolerable setting for a relatively low-volume SOHO
environment.
Post by W.C.A. Wijngaards
If you want service to instantly go down and up with the line
downtime, and you can notice the line-bounce via some other method,
then you could use e.g. unbound-control flush_infra to resume traffic.
I'll certainly try that next time, although this will be difficult to
automate without continuously monitoring the line status.

Regards,

-JP
Paul Taylor
2012-01-23 18:40:18 UTC
Permalink
Wouter,

Hi - I'm the DAP user that JP mentioned.

As a side note, I'm extremely impressed with the performance of Unbound.
We are looking at using Unbound at my job and have been doing a bit of
testing. Using ResPerf to stress test with a cleared cache resulted in
a peak of about 23,500 queries per second with Unbound doing DNSSEC.
This was on a Dell 2850 server with two dual core Xeon's running at 2.8
Ghz under Ubuntu 12.04 alpha. We also tested Unbound with DNSSEC
disabled and got over 35,000 queries per second. A 3rd party Windows
DNS server (not performing DNSSEC validation) peaked at around 1250
queries per second under Windows 2003 on similar hardware.

Back to my home issue, though. The first time I
experienced this issue, my internet connection had gone down for about
an hour around 2 AM. It was about 7AM before I noticed the problem
(sleep has to happen sometime). I restarting Unbound, and it recovered.


The 2nd time this happened, I had about 3 bounces in about 10 minutes
during the afternoon. I believe each bounce took a minute or so to
recover I was at work at the time and my wife and kids couldn't get
anywhere on the Internet. I got home a few hours later and DNS
resolution was not working until I restarted Unbound.

So, in these two cases I've had outages of various lengths, but hours
have passed without DNS resolution working.

Since most people using Unbound are probably using it for the DNSSEC
capability, perhaps my configuration has to do with the issue I'm having
recovering? In my environment, Unbound isn't configured to go direct,
but rather forward to various DNS servers. I have about 10-12 domains
(mostly CDNs) that I'm forwarding to my ISP's DNS servers so I get DNS
replies directing me to close servers. Theoretically, this should help
me have a better experience with Netflix at home. After the forwarder
definitions for all the CDNs, I have a forwarder defined for "." to send
everything else to OpenDNS. This is to help keep my family from getting
to websites I don't want little eyes to run across.

Is it possible that with this type of config that it might cause Unbound
to recover differently?

Thanks,
Paul
W.C.A. Wijngaards
2012-01-24 16:06:04 UTC
Permalink
Hi Paul,

Nice that the performance looks good :-)

If you are running unbound under windows, there are some things to be
aware of. On windows, unbound has reduced capacity because it cannot
open a lot (thousands) of file descriptors. Windows simply lacks an
API that makes this possible, unless you spawn thousands of threads or
something similar. So, if the performance you see is based on
recursion, then using Linux (or FreeBSD) on a similar box should have
more capacity (you can configure unbound to have extra capacity). If
the performance you see is based on cache-responses, then the move to
Linux makes less of a difference. With capacity I mean recursing a
lot of user queries at the same time, with thousands of sockets open.

The capacity on windows today is more relevant to small workgroups or
desktop environments. With some code changes it could be improved,
e.g. with polling behaviour the number of sockets can be increased to
very large numbers. Today unbound sleeps the process nicely when not
busy in WSAWaitForMultipleEvents.

The easiest way today to get more capacity on windows, by the way, is
to increase the number of workers (num-thread) to 4 (or so).

You note unbound was down for a lengthy time, can you upgrade or if
this was a recent version get me more details? It should really fully
recover after 15 minutes from anything, I believe.

Best regards,
Wouter
Post by Jan-Piet Mens
Wouter,
Hi – I’m the DAP user that JP mentioned.
As a side note, I’m extremely impressed with the performance of
Unbound. We are looking at using Unbound at my job and have been
doing a bit of testing. Using ResPerf to stress test with a
cleared cache resulted in a peak of about 23,500 queries per second
with Unbound doing DNSSEC. This was on a Dell 2850 server with two
dual core Xeon’s running at 2.8 Ghz under Ubuntu 12.04 alpha. We
also tested Unbound with DNSSEC disabled and got over 35,000
queries per second. A 3^rd party Windows DNS server (not
performing DNSSEC validation) peaked at around 1250 queries per
second under Windows 2003 on similar hardware.
Back to my home issue, though. The first time I experienced this
issue, my internet connection had gone down for about an hour
around 2 AM. It was about 7AM before I noticed the problem (sleep
has to happen sometime). I restarting Unbound, and it recovered.
The 2^nd time this happened, I had about 3 bounces in about 10
minutes during the afternoon. I believe each bounce took a minute
or so to recover I was at work at the time and my wife and kids
couldn’t get anywhere on the Internet. I got home a few hours
later and DNS resolution was not working until I restarted
Unbound.
So, in these two cases I’ve had outages of various lengths, but
hours have passed without DNS resolution working.
Since most people using Unbound are probably using it for the
DNSSEC capability, perhaps my configuration has to do with the
issue I’m having recovering? In my environment, Unbound isn’t
configured to go direct, but rather forward to various DNS servers.
I have about 10-12 domains (mostly CDNs) that I’m forwarding to my
ISP’s DNS servers so I get DNS replies directing me to close
servers. Theoretically, this should help me have a better
experience with Netflix at home. After the forwarder definitions
for all the CDNs, I have a forwarder defined for “.” to send
everything else to OpenDNS. This is to help keep my family from
getting to websites I don’t want little eyes to run across.
Is it possible that with this type of config that it might cause
Unbound to recover differently?
Thanks,
Paul
_______________________________________________ Unbound-users
http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
l***@kwsoft.de
2012-01-27 10:39:57 UTC
Permalink
Post by Paul Taylor
Since most people using Unbound are probably using it for the DNSSEC
capability, perhaps my configuration has to do with the issue I'm having
recovering? In my environment, Unbound isn't configured to go direct,
but rather forward to various DNS servers. I have about 10-12 domains
(mostly CDNs) that I'm forwarding to my ISP's DNS servers so I get DNS
replies directing me to close servers. Theoretically, this should help
me have a better experience with Netflix at home. After the forwarder
definitions for all the CDNs, I have a forwarder defined for "." to send
everything else to OpenDNS. This is to help keep my family from getting
to websites I don't want little eyes to run across.
Is it possible that with this type of config that it might cause Unbound
to recover differently?
This reminds me of the issues we have when using Unbound with DNSSEC
validation *and* using a forwarder. For some time it was Unbound using
Bind 9.7.4 as parent but it also happend with a second Unbound
instance as parent that Unbound stop resolving any names, because of
some obscure validation failure. We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder do
the DNSSEC work.

Regards

Andreas
Jan-Piet Mens
2012-01-27 10:54:27 UTC
Permalink
Post by l***@kwsoft.de
We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder
do the DNSSEC work.
That would be a neat feature for DNSSEC-Trigger: detect that the
upstream forwarder is Unbound (version.bind chaos txt) and disable the
validator. Well, maybe not. :-)

-JP
l***@kwsoft.de
2012-01-27 12:57:37 UTC
Permalink
Post by Jan-Piet Mens
Post by l***@kwsoft.de
We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder
do the DNSSEC work.
That would be a neat feature for DNSSEC-Trigger: detect that the
upstream forwarder is Unbound (version.bind chaos txt) and disable the
validator. Well, maybe not. :-)
In our case it doesn't matter because both resolvers are managed by
us, but for sure this should not be done automatically. Basically it
looks like there are "rough-edges" when cascaded resolvers all try to
do DNSSEC validation.

Regards

Andreas
W.C.A. Wijngaards
2012-02-10 10:05:19 UTC
Permalink
Hi Andreas,
Post by Jan-Piet Mens
Post by l***@kwsoft.de
We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder
do the DNSSEC work.
That would be a neat feature for DNSSEC-Trigger: detect that the
upstream forwarder is Unbound (version.bind chaos txt) and disable the
validator. Well, maybe not. :-)
In our case it doesn't matter because both resolvers are managed by us,
but for sure this should not be done automatically. Basically it looks
like there are "rough-edges" when cascaded resolvers all try to do
DNSSEC validation.
This was with unbound at an older version? In 1.4.11 there has been a
fix that should help cascading validators. The issue is that the
downstream validator sends CD=1 queries to the upstream. Now, suppose
an authority server is outdated but another is not. Then the downstream
validator cannot perform failover to the other authority server, because
it has to talk to the upstream validator. The upstream validator cannot
perform failover to the other authority server because with CD=1 it is
not validating the query. The fix in 1.4.11 is to make the upstream
validator perform failover to the other authority server for CD=1
queries as well.

Best regards,
Wouter
Paul Taylor
2012-02-10 13:28:27 UTC
Permalink
On the original topic of this thread, I have another incident to report.
After experiencing some strangeness with my NAS (where unbound was
running previously), I moved Unbound to an installation of pfSense
running on an old net4801. I believe pfSense is still on version 1.4.14
of Unbound. I configured it pretty much identically to my NAS
installation of Unbound. By that, I mean that I have numerous
forwarders added for various CDNs, with a "." forwarder pointing to
OpenDNS. DNSSEC validation is disabled. About two weeks had passed
with no further problems, until this morning.

Just before I was about to leave home for work (just after 7 AM), my
daughter told me that the internet was down. I checked my router and
saw that the internet connection went down last night for a little over
an hour.. It recovered about 3:15 AM. So, it had been up and
operational for almost 4 hours by the time I started looking at the
issue. A quick nslookup showed SERVFAIL replies. Since I had to leave
for work, I didn't have time to do much in the way of troubleshooting.
I recycled the service via pfSense's Services page (I think it just
kills and restarts the service), and DNS was resolving properly again.

Unfortunately, since it's on an embedded box, I didn't have logging
enabled, and I don't know what commands, if any, I could run that let
you see the "state" Unbound is stuck in when this happens.
W.C.A. Wijngaards
2012-02-10 14:34:05 UTC
Permalink
Hi Paul,
Post by Paul Taylor
On the original topic of this thread, I have another incident to report.
After experiencing some strangeness with my NAS (where unbound was
running previously), I moved Unbound to an installation of pfSense
running on an old net4801. I believe pfSense is still on version 1.4.14
of Unbound. I configured it pretty much identically to my NAS
installation of Unbound. By that, I mean that I have numerous
forwarders added for various CDNs, with a "." forwarder pointing to
OpenDNS. DNSSEC validation is disabled. About two weeks had passed
with no further problems, until this morning.
Just before I was about to leave home for work (just after 7 AM), my
daughter told me that the internet was down. I checked my router and
saw that the internet connection went down last night for a little over
an hour.. It recovered about 3:15 AM. So, it had been up and
operational for almost 4 hours by the time I started looking at the
issue. A quick nslookup showed SERVFAIL replies. Since I had to leave
for work, I didn't have time to do much in the way of troubleshooting.
I recycled the service via pfSense's Services page (I think it just
kills and restarts the service), and DNS was resolving properly again.
It should not be down for that long; 15 minutes really.
Post by Paul Taylor
Unfortunately, since it's on an embedded box, I didn't have logging
enabled, and I don't know what commands, if any, I could run that let
you see the "state" Unbound is stuck in when this happens.
unbound-control verbosity 4 ; then nslookup and capture the logs (which
are then plentiful).

unbound-control dump_infra > tofile.txt
that shows the state of the infrastructure cache.

Best regards,
Wouter
Paul Taylor
2012-02-10 15:50:27 UTC
Permalink
Thank you - I'll file these commands away for future reference.

Previously, I tried recreating the problem a few times, but after
waiting 15 minutes (per your previous advice) after the WAN recovery,
DNS has worked. I've not tried leaving my internet connection down for
much more than about 10 minutes, though.

-----Original Message-----
From: W.C.A. Wijngaards [mailto:***@nlnetlabs.nl]
Sent: Friday, February 10, 2012 9:34 AM
To: Paul Taylor
Cc: unbound-***@unbound.net
Subject: Re: [Unbound-users] Unbound stops answering after ADSL-line
bounce

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Paul,
Post by Paul Taylor
On the original topic of this thread, I have another incident to report.
After experiencing some strangeness with my NAS (where unbound was
running previously), I moved Unbound to an installation of pfSense
running on an old net4801. I believe pfSense is still on version 1.4.14
of Unbound. I configured it pretty much identically to my NAS
installation of Unbound. By that, I mean that I have numerous
forwarders added for various CDNs, with a "." forwarder pointing to
OpenDNS. DNSSEC validation is disabled. About two weeks had passed
with no further problems, until this morning.
Just before I was about to leave home for work (just after 7 AM), my
daughter told me that the internet was down. I checked my router and
saw that the internet connection went down last night for a little over
an hour.. It recovered about 3:15 AM. So, it had been up and
operational for almost 4 hours by the time I started looking at the
issue. A quick nslookup showed SERVFAIL replies. Since I had to leave
for work, I didn't have time to do much in the way of troubleshooting.
I recycled the service via pfSense's Services page (I think it just
kills and restarts the service), and DNS was resolving properly again.
It should not be down for that long; 15 minutes really.
Post by Paul Taylor
Unfortunately, since it's on an embedded box, I didn't have logging
enabled, and I don't know what commands, if any, I could run that let
you see the "state" Unbound is stuck in when this happens.
unbound-control verbosity 4 ; then nslookup and capture the logs (which
are then plentiful).

unbound-control dump_infra > tofile.txt
that shows the state of the infrastructure cache.

Best regards,
Wouter

Paul Taylor
2012-01-25 13:00:55 UTC
Permalink
Wouter,

Unfortunately, I've not had a chance to do further
testing. I should be able to test this weekend. I plan to take my
connection down for 10 minutes, then bring it back up and wait 20
minutes to see if Unbound will recover. While doing this, should I have
a verbosity level 5 log file going? Would a pcap file at my router
(filtered on the IP of the Unbound box) be helpful?

Thanks,
Paul
W.C.A. Wijngaards
2012-01-26 10:45:04 UTC
Permalink
Hi Paul,
Post by Jan-Piet Mens
Wouter,
Unfortunately, I’ve not had a chance to do further testing. I
should be able to test this weekend. I plan to take my connection
down for 10 minutes, then bring it back up and wait 20 minutes to
see if Unbound will recover. While doing this, should I have a
verbosity level 5 log file going? Would a pcap file at my router
(filtered on the IP of the Unbound box) be helpful?
Packet info would be included in verbosity 5 already, the pcap dump
may be useful as a different format, but not really needed.

What would be nice and interesting is a look at unbound-control
dump_infra . This prints a textoutput list of the probe-status of
IPs. Maybe do it at start (>file1 to store it), when failed, and if
it stays down, afterwards.

Best regards,
Wouter
Paul Taylor
2012-01-26 15:15:48 UTC
Permalink
Post by W.C.A. Wijngaards
Packet info would be included in verbosity 5 already, the pcap dump
may be useful as a different format, but not really needed.
What would be nice and interesting is a look at unbound-control
dump_infra . This prints a textoutput list of the probe-status of
IPs. Maybe do it at start (>file1 to store it), when failed, and if
it stays down, afterwards.
Wouter,

I had an unexpected chance to test last night... When I tested before,
I didn't know that it could take up to 15 minutes to recover, so I
assumed that I had successfully recreated the problem after bringing the
WAN interface back up and finding that Unbound wasn't sending the
requests out within the next few minutes. Last night after waiting
15-20 minutes after bringing the WAN back up, everything recovered as
you had explained that it should. I did this twice, and it recovered
fine both times, so the problem I experienced before is not as easy to
re-create as I previously thought. I've not seen an actual occurrence
of this problem since last week. It happened twice within about a week
previously.

So, I'll try to turn on a verbosity level 5 log and then run the
dump-infra commands at startup, and then when it gets to a failed state
(as in, failed for more than 15 minutes after a WAN link recovery). It
looks like I'll have to wait until there's an actual natural occurrence
of the problem again.

Thanks,
Paul
Loading...