Page 1 of 1
DNS timeout
Posted: Wed Mar 04, 2015 10:52 am
by jpipitone
My current resolv.conf looks like this:
Code: Select all
options timeout:1
search domain1.com
search domain2.com
nameserver 10.9.X.XXX
nameserver 10.9.X.XXX
nameserver 10.9.X.XX
If we reboot the primary name server (windows updates, etc) Nagios reports that several hosts and services appear to be down. I have recently added the options timeout:1, and performed some testing. I'm still noticing that Nagios is reporting hosts and services down. When the primary DC comes back up, Nagios recovers.
Is there any way to cut back on the number of false notifications that we have hosts and services down if our primary DC (DNS) goes down? Shouldn't it be failing over to the next DNS server in line within 1 second?
Re: DNS timeout
Posted: Wed Mar 04, 2015 11:26 am
by ssax
It looks to be setup properly, if you are still getting DNS timeouts you may consider setting up a local DNS cache.
You can follow the guide here:
http://www.g-loaded.eu/2010/09/18/cachi ... g-dnsmasq/
Re: DNS timeout
Posted: Wed Mar 04, 2015 3:04 pm
by jpipitone
Thanks - I have it configured and we're performing lookups for 2 internal domains, as well as external resources.
Re: DNS timeout
Posted: Wed Mar 04, 2015 3:43 pm
by jdalrymple
Hi jpipitone,
In a quick lab buildup I couldn't reproduce your results:
Code: Select all
[jdalrymple@localhost ~]$ sudo tcpdump -i eth0 port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
14:33:49.232512 IP 192.168.145.128.48886 > dnsserver1.domain: 54133+ A? www.google.com. (32)
14:33:50.233511 IP 192.168.145.128.51837 > dnsserver2: 54133+ A? www.google.com. (32)
14:33:50.919923 IP 192.168.145.128.46243 > dnsserver1.domain: 54866+ A? www.google.com. (32)
14:33:51.234160 IP 192.168.145.128.51837 > dnsserver2: 54133+ A? www.google.com. (32)
14:33:51.257732 IP dnsserver2 > 192.168.145.128.51837: 54133 5/0/0 A 173.194.46.115, A 173.194.46.116, A 173.194.46.114, A 173.194.46.113, A 173.194.46.112 (112)
14:33:51.305594 IP dnsserver2 > 192.168.145.128.58609: 48947 NXDomain 0/0/0 (46)
14:33:51.925405 IP 192.168.145.128.38145 > dnsserver2: 54866+ A? www.google.com. (32)
14:33:51.949905 IP dnsserver2 > 192.168.145.128.38145: 54866 5/0/0 A 173.194.46.116, A 173.194.46.114, A 173.194.46.113, A 173.194.46.112, A 173.194.46.115 (112)
14:33:52.926330 IP 192.168.145.128.38145 > dnsserver2: 54866+ A? www.google.com. (32)
14:33:52.946569 IP dnsserver2 > 192.168.145.128.38145: 54866 5/0/0 A 173.194.46.114, A 173.194.46.113, A 173.194.46.112, A 173.194.46.115, A 173.194.46.116 (112)
Each time that I ran my host check it tried dns1 and when it was unreachable it failed to dns2 right away. As expected, my host check too about 1 second longer.
As an aside when I made dns1 available it would never fail to dns2.
What check command are you using for host availability?
Do any hosts or services proceed normally when your primary DNS goes offline? It might be useful to find some patterns if so.
Re: DNS timeout
Posted: Wed Mar 04, 2015 4:13 pm
by jpipitone
Just a simple ping for host availability, and checking a few other services. It depends on the host.
After I setup dnsmasq, the problem went away.
Re: DNS timeout
Posted: Wed Mar 04, 2015 4:26 pm
by jdalrymple
After I posted my reply it occurred to me that my simulation isn't quite the same as your real-world DNS failure. I was blocking the port with a firewall, your Window server's DNS service was dying a slow death via Microsoft's shutdown process, possibly port 53 was "kind of there" but not responding in a timely fashion. I wonder if that difference could affect the behavior of the timeout specified in /etc/resolv.conf, it's hard to be sure. Either way I'm glad your problem is solved. Can we lock the thread?
Re: DNS timeout
Posted: Wed Mar 04, 2015 4:49 pm
by jpipitone
jdalrymple wrote:After I posted my reply it occurred to me that my simulation isn't quite the same as your real-world DNS failure. I was blocking the port with a firewall, your Window server's DNS service was dying a slow death via Microsoft's shutdown process, possibly port 53 was "kind of there" but not responding in a timely fashion. I wonder if that difference could affect the behavior of the timeout specified in /etc/resolv.conf, it's hard to be sure. Either way I'm glad your problem is solved. Can we lock the thread?
Absolutely - thank you!
DNS timeout - its back
Posted: Mon Mar 09, 2015 9:33 am
by jpipitone
I've setup a dnsmasq caching dns server on our NagiosXI server. It appears to be reporting false positives very frequently - more so than when we were using remote DNS servers for lookups.
It was recommended to setup a caching server - any recommendations for dnsmasq tuning for Nagios?
I've done time nslookup service.com and the replies appear to be cached. Not sure why Nagios would be reporting so many false positives.
Re: DNS timeout
Posted: Mon Mar 09, 2015 4:11 pm
by tgriep
You may want to adjust these settings dns-forward-max and cache-size and test it out in your environment.
Did you change the /etc/resolve.conf file to only have the following in it?