(null) attacks
Posted: Thu Aug 29, 2013 3:56 am
Hey there,
This problem has been bothering me for a while and I haven't yet managed to properly debug it or Google up a solution.
The gist of it: every now and then - on average once a week - all services in our Nagios installation return CRITICAL with an output of (null). The only thing that solves this is restarting the Nagios process. Since this is a production environment, there isn't much time for debugging or investigating this issue except after the fact.
More details:
We have several Nagios instances (all of them independant, this isn't a distributed system) with similar configuration but varying degrees of monitoring complexity. The above problem happens most often in our most complex environment, but sometimes also happens in one of the other Nagios instances. The Nagios CPU Load is rather high, averaging on around 2.0 (that's after taking the number of CPUs into account, of course). But there's no particular peak around the time this happens or after it's solved by restarting the Nagios process.
This problem almost always starts with a particular service we have on all Linux hosts, which is a custom perl script to check the disk IO. In my attempts to debug this issue, I wrapped the script inside the capture_plugin script in order to try and get some sort of output other than "(null)". Doing this, after the last incident, I received the following*:
Can't locate XSLoader.pm in @INC (@INC contains: /usr/local/nagios/libexec /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/IO.pm line 5.
An after-the-fact locate for XSLoader.pm returns:
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/XSLoader.pm
...Which is obviously within @INC.
Needless to say, the mount is local.
At this point, I'll also mention the Nagios machine is a virtual VMWare machine - however, the machine doesn't seem to have any special incidents during this time.
/var/log/messages shows nothing aside from the Nagios (null) errors.
Additional info that might help:
We've got several "monitor generators", running as independent services once a day and generating monitors according to dynamic external information. These scripts verify and reload the Nagios configuration. On a hunch that they might have something to do with this problem, I consolidated them all into one script - and the frequency of these "(null)" attacks decreased. It hasn't stopped though.
Nagios Core version: Version 3.4.1
Kernel: 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Distribution: CentOS release 5.8 (Final)
I'd be glad to provide any more information - and would very much appreciate any ideas.
Thanks!
*Oddly enough, after I wrapped the capture_plugin around the script, I get this output in Nagios as well in place of the (null) - all other services still return (null), however.
This problem has been bothering me for a while and I haven't yet managed to properly debug it or Google up a solution.
The gist of it: every now and then - on average once a week - all services in our Nagios installation return CRITICAL with an output of (null). The only thing that solves this is restarting the Nagios process. Since this is a production environment, there isn't much time for debugging or investigating this issue except after the fact.
More details:
We have several Nagios instances (all of them independant, this isn't a distributed system) with similar configuration but varying degrees of monitoring complexity. The above problem happens most often in our most complex environment, but sometimes also happens in one of the other Nagios instances. The Nagios CPU Load is rather high, averaging on around 2.0 (that's after taking the number of CPUs into account, of course). But there's no particular peak around the time this happens or after it's solved by restarting the Nagios process.
This problem almost always starts with a particular service we have on all Linux hosts, which is a custom perl script to check the disk IO. In my attempts to debug this issue, I wrapped the script inside the capture_plugin script in order to try and get some sort of output other than "(null)". Doing this, after the last incident, I received the following*:
Can't locate XSLoader.pm in @INC (@INC contains: /usr/local/nagios/libexec /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/IO.pm line 5.
An after-the-fact locate for XSLoader.pm returns:
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/XSLoader.pm
...Which is obviously within @INC.
Needless to say, the mount is local.
At this point, I'll also mention the Nagios machine is a virtual VMWare machine - however, the machine doesn't seem to have any special incidents during this time.
/var/log/messages shows nothing aside from the Nagios (null) errors.
Additional info that might help:
We've got several "monitor generators", running as independent services once a day and generating monitors according to dynamic external information. These scripts verify and reload the Nagios configuration. On a hunch that they might have something to do with this problem, I consolidated them all into one script - and the frequency of these "(null)" attacks decreased. It hasn't stopped though.
Nagios Core version: Version 3.4.1
Kernel: 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Distribution: CentOS release 5.8 (Final)
I'd be glad to provide any more information - and would very much appreciate any ideas.
Thanks!
*Oddly enough, after I wrapped the capture_plugin around the script, I get this output in Nagios as well in place of the (null) - all other services still return (null), however.