Nagios intermittently nulls all services

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
ashykneecaps
Posts: 11
Joined: Thu Mar 05, 2015 3:05 am

Nagios intermittently nulls all services

Post by ashykneecaps »

Hi,

We're using nagios core 3.5.1 and every once in a while, we will experience a terrible bit of confusion.
All of our services for each of our hosts will return "(null)" without any rhyme or reason. Also the occasional "No address associated with hostname" will sometimes appear.

Here's a small snippet of our service history as the above was happening:
[2015-05-25 14:17:42] SERVICE ALERT: hosta;HTTPS;CRITICAL;SOFT;1;No address associated with hostname
Service Warning[2015-05-25 14:17:42] SERVICE ALERT: hostb;Swap status Quiet;WARNING;SOFT;3;(null)
Service Warning[2015-05-25 14:17:42] SERVICE ALERT: hostc;Swap status;WARNING;SOFT;1;(null)
Service Warning[2015-05-25 14:17:42] SERVICE ALERT: hostd;Ping some.site.com;WARNING;SOFT;1;(null)
Service Warning[2015-05-25 14:17:42] SERVICE ALERT: hoste;Current Users;WARNING;SOFT;1;(null)
After about 10-20 minutes later they (services) will all start to show signs of recovery and we get a mixture of warning and recovery notifications. Then a little while after that, everything goes back to being okay.

This happens every few months, without any changes to our configs.
The logs do not show anything beyond what is already shown by the plugin output. Also, running the commands manually through check_nrpe or even as the nagios user executing a plugin that's showing null as output, returns sane values.

We would like to know:
A. What causes this? or How to find out what causes this.
and
B. If this is a known issue, ways to deal with it.

Like I said, the logs do not show anything useful and what adds to the confusion is that when we execute any of the services experiencing issues, they return OK values. We have yet to be able to manually reproduce a "(null)" output.

Does anyone have any advice regarding what I can do next time this happens to try and find the cause and implement a fix? Any and all tips are welcome.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Nagios intermittently nulls all services

Post by Box293 »

Do you host objects have IP addresses or DNS names for the address directive? Does it only affect hosts with an IP address vs a DNS Name or does the problem affect all hosts?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
ashykneecaps
Posts: 11
Joined: Thu Mar 05, 2015 3:05 am

Re: Nagios intermittently nulls all services

Post by ashykneecaps »

Hi Box293,

Thanks for your response.
All our hosts use DNS names for their address directive.
If it is a DNS issue that keeps popping up, it would make sense since we use Digital Ocean quite a bit and occasionally they do give us a bit of grief relating to this.

But on second thought, this would then show up when the services are manually executed and we would get the "(null)" ourselves.
Digital Ocean status https://status.digitalocean.com/ also shows no hiccups during the time (2015-05-25) we experienced our issue, although not sure how reliable that status thing is.

Without manually running the services, how can I confirm that it is/is not a DNS issue? i.e. Best practices or sure fire method
Besides DNS issues, is there anything else that could be causing this?
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Nagios intermittently nulls all services

Post by jdalrymple »

ashykneecaps wrote:[2015-05-25 14:17:42] SERVICE ALERT: hosta;HTTPS;CRITICAL;SOFT;1;No address associated with hostname
I think this is what has Box293 wondering about DNS - it definitely looks like some sort of a name resolution issue. Do you see a lot of "No address associated with hostname" errors?

Maybe it would behoove you to setup a check_dns for a few hosts to see how that service behaves when your other checks are failing.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Nagios intermittently nulls all services

Post by Box293 »

The one suggestion I have is that you enable debug logging and find exactly what nagios is doing when this problem occurs.

Code: Select all

sed -i 's/.*debug_level=.*/debug_level=-1/g' /usr/local/nagios/etc/nagios.cfg
service nagios restart
Additional logging is now in /usr/local/nagios/var/nagios.debug

When you are finished, this turns debugging off:

Code: Select all

sed -i 's/.*debug_level=.*/debug_level=0/g' /usr/local/nagios/etc/nagios.cfg
service nagios restart

The problem is, if this is happening every few months, it's going to be hard to track down.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
ashykneecaps
Posts: 11
Joined: Thu Mar 05, 2015 3:05 am

Re: Nagios intermittently nulls all services

Post by ashykneecaps »

Hi guys,

Thanks for your suggestions. I will be implementing both, the check_dns and debug logging, to try and find out what exactly is going on.
The only "downside" right now is that it may be a while before I actually get to use my new toys.

I guess now all I really need to do is wait...

Ill report back as soon as I have anything new.
Thanks again for your suggestions, really appreciate it.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Nagios intermittently nulls all services

Post by jdalrymple »

By default the debug max file size is only 1MB. That can be overwhelmed in under a second in a very busy environment with debug level set to -1. Consider increasing it. Here are the debug logging directives that are important - all in main nagios.cfg:

Code: Select all

debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
max_debug_file_size=1000000
You can see mine is still the default - 1000000 bytes. Also there is more information to be had if you switch the debug_verbosity to 2. Since you only get one shot every month or so I suggest changing it to get as much data as you possibly can.
Locked