(null) attacks

grimsniffer · Post by **grimsniffer** » Thu Aug 29, 2013 3:56 am

Hey there,

This problem has been bothering me for a while and I haven't yet managed to properly debug it or Google up a solution.

The gist of it: every now and then - on average once a week - all services in our Nagios installation return CRITICAL with an output of (null). The only thing that solves this is restarting the Nagios process. Since this is a production environment, there isn't much time for debugging or investigating this issue except after the fact.

More details:

We have several Nagios instances (all of them independant, this isn't a distributed system) with similar configuration but varying degrees of monitoring complexity. The above problem happens most often in our most complex environment, but sometimes also happens in one of the other Nagios instances. The Nagios CPU Load is rather high, averaging on around 2.0 (that's after taking the number of CPUs into account, of course). But there's no particular peak around the time this happens or after it's solved by restarting the Nagios process.

This problem almost always starts with a particular service we have on all Linux hosts, which is a custom perl script to check the disk IO. In my attempts to debug this issue, I wrapped the script inside the capture_plugin script in order to try and get some sort of output other than "(null)". Doing this, after the last incident, I received the following*:

Can't locate XSLoader.pm in @INC (@INC contains: /usr/local/nagios/libexec /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/IO.pm line 5.

An after-the-fact locate for XSLoader.pm returns:
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/XSLoader.pm
...Which is obviously within @INC.

Needless to say, the mount is local.

At this point, I'll also mention the Nagios machine is a virtual VMWare machine - however, the machine doesn't seem to have any special incidents during this time.

/var/log/messages shows nothing aside from the Nagios (null) errors.

Additional info that might help:
We've got several "monitor generators", running as independent services once a day and generating monitors according to dynamic external information. These scripts verify and reload the Nagios configuration. On a hunch that they might have something to do with this problem, I consolidated them all into one script - and the frequency of these "(null)" attacks decreased. It hasn't stopped though.

Nagios Core version: Version 3.4.1
Kernel: 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Distribution: CentOS release 5.8 (Final)

I'd be glad to provide any more information - and would very much appreciate any ideas.

Thanks!

*Oddly enough, after I wrapped the capture_plugin around the script, I get this output in Nagios as well in place of the (null) - all other services still return (null), however.

sreinhardt · Post by **sreinhardt** » Thu Aug 29, 2013 11:22 am

You are seeing this because nagios is set to accept anything from standard out. Since this plugin is being multi-threaded, or at least including a module that is, could it be exhausting the number of threads\network connections on the system? Restarting the nagios process would kill all currently running processes. Are you seeing anything in /var/log/messages related to resource exhaustion or processes hanging?

grimsniffer · Post by **grimsniffer** » Sun Sep 01, 2013 6:32 am

Interesting point.
I just added a service to monitor this. I'll wait for the next time this happens to correlate, and update here.

Thanks!

sreinhardt · Post by **sreinhardt** » Tue Sep 03, 2013 10:37 am

Sounds good, I'll be interested to see what comes up!

grimsniffer · Post by **grimsniffer** » Sun Sep 15, 2013 2:31 am

Okay - it finally happened again. The graph shows a spike of up to ~70k threads, but it's nowhere near the 140k kernel limit.
Might it still be the case? If not, any other ideas? I must admit I'm at a loss here.

sreinhardt · Post by **sreinhardt** » Mon Sep 16, 2013 10:02 am

This is strange, it seems that the total number of threads on your machine slowly but continually increases from 500-1000 to 35-40l, then starts severely spiking. I do note that there are a few times throughout the graph there are a couple other smaller spikes. Do you have an hourly or so cron job that might take about 15 minutes to complete? Also why don't you post your nagios.cfg and we can take a look to be sure nothing strange is configured there.

grimsniffer · Post by **grimsniffer** » Tue Sep 17, 2013 3:39 am

It happened again in a different environment - one in which I could afford a couple hours of downtime before restarting. I had a chance to study this a little, and just got more confused in the process.

It turns out this issue does not happen for all services. The common ground I could find for all of them is that these are services that read and/or write to/from the local HD in one way or another.
I have quite a few of these plugins (for example, plugins that check the DISK IO of a remote machine, save the result, check it again five minutes later and compare the result in order to produce the IO/sec).

The behavior of Nagios right around this issue is downright odd:
If I reschedule these services, they reschedule properly but return the same result.
If I manually run the exact command these services are running, I get proper output.
If I try to acknowledge these services, I do not see the EXTERNAL_COMMAND in SYSLOG or the Nagios log, and the problem isn't acknowledged. Same goes for a service comment.
Submitting a passive check DOES work, however upon rescheduling - I get the (null) response.

The amount of threads in this machine during this whole time is ~3K - so I don't think that's the issue after all.

If needed, I'll post the contents of my nagios.cfg - but before that: does the above behavior ring any bells by any chance?

grimsniffer · Post by **grimsniffer** » Tue Sep 17, 2013 4:59 am

More info:

Reloading Nagios apparently doesn't solve the problem (just restarting) so I was able to do a little debugging:
Adding the "capture_plugin" to the relevant service commands changes the output from (null) to:

Can't locate Module/Runtime.pm in @INC (@INC contains: /usr/local/nagios/libexec /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /usr/lib/perl5/site_perl/5.8.8/Module/Implementation.pm line 9.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Module/Implementation.pm line 9.
Compilation failed in require at /usr/lib/perl5/site_perl/5.8.8/Params/Validate.pm line 12.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Params/Validate.pm line 12.
Compilation failed in require at /usr/lib/perl5/site_perl/5.8.8/Nagios/Plugin/Functions.pm line 11.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Nagios/Plugin/Functions.pm line 11.
Compilation failed in require at /usr/lib/perl5/site_perl/5.8.8/Nagios/Plugin.pm line 4.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Nagios/Plugin.pm line 4.
Compilation failed in require at /usr/local/nagios/libexec/SearsMon/NagiosWrapper.pm line 7.
BEGIN failed--compilation aborted at /usr/local/nagios/libexec/SearsMon/NagiosWrapper.pm line 7.
Compilation failed in require at /usr/local/nagios/libexec/check_win_cpu_vm.pl line 7.
BEGIN failed--compilation aborted at /usr/local/nagios/libexec/check_win_cpu_vm.pl line 7.

The PM is in fact on the HD (a simple "locate Runtime.pm" finds it without issue), and is part of the Nagios::Plugin perl package.
If I comment out the Nagios::Plugin requirement and just have the plugin exit with "OK" - it does this perfectly fine.
Other service checks, not exhibiting the "(null)" behavior use the Nagios::Plugin (and so, naturally - the Runtime.pm package) - and are working perfectly fine.

If I didn't know better I'd think I'm going insane.

abrist · Post by **abrist** » Tue Sep 17, 2013 1:23 pm

I have a few guesses.
1. Perl issues? Do any non-perl checks return null as well?
2. Any segfaults?

Code: Select all

grep seg /var/log/messages

3. There were a number of small bug fixes from 3.4.1 to 3.5.x. Have you considered upgrading?

sreinhardt · Post by **sreinhardt** » Tue Sep 17, 2013 1:34 pm

It really seems like a disk io issue, or if this is connected to a san possibly connection bottleneck. I may be way off here, but I would be very interested to see if moving your check results and possibly performance data to a ram disk, if it would not resolve this.
http://assets.nagios.com/downloads/nagi ... giosXI.pdf

Nagios Support Forum

(null) attacks

(null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks

Re: (null) attacks