Nagios Crash/Performance Issues
Posted: Mon Oct 14, 2013 12:04 pm
Last weekend, we had to shutdown a large number of our monitored hosts. Shortly after these hosts went into "down" state, Nagios crashed in an interesting manner and I'd like some help identifying how we can prevent this in the future.
During the time that Nagios was not functioning, there was a huge increase in log file messages like these:
Oct 12 10:11:10 monitor01 nagios: Warning: The check of service 'Perf - NIC win' on host 'xa-fe02' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
We also do not have any performance data for that time period for ANY monitored host, even if it was still online.
We were able to fix Nagios by a full reboot of the system. Simply restarting the Nagios process through the GUI did not get it out of its failure state. The hosts were back online by this point, but Nagios did not recover on its own.
What can we tune to prevent this problem in the future? Any recommendations on how to monitor for this failure state? Monitoring that the Nagios process were running is obviously not enough...
Here are our ulimit settings:
[root@monitor01 security]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256236
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 256236
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Thanks!
During the time that Nagios was not functioning, there was a huge increase in log file messages like these:
Oct 12 10:11:10 monitor01 nagios: Warning: The check of service 'Perf - NIC win' on host 'xa-fe02' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
We also do not have any performance data for that time period for ANY monitored host, even if it was still online.
We were able to fix Nagios by a full reboot of the system. Simply restarting the Nagios process through the GUI did not get it out of its failure state. The hosts were back online by this point, but Nagios did not recover on its own.
What can we tune to prevent this problem in the future? Any recommendations on how to monitor for this failure state? Monitoring that the Nagios process were running is obviously not enough...
Here are our ulimit settings:
[root@monitor01 security]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256236
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 256236
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Thanks!