Page 1 of 1

Nagios Crash/Performance Issues

Posted: Mon Oct 14, 2013 12:04 pm
by uidaho
Last weekend, we had to shutdown a large number of our monitored hosts. Shortly after these hosts went into "down" state, Nagios crashed in an interesting manner and I'd like some help identifying how we can prevent this in the future.

During the time that Nagios was not functioning, there was a huge increase in log file messages like these:
Oct 12 10:11:10 monitor01 nagios: Warning: The check of service 'Perf - NIC win' on host 'xa-fe02' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

We also do not have any performance data for that time period for ANY monitored host, even if it was still online.


We were able to fix Nagios by a full reboot of the system. Simply restarting the Nagios process through the GUI did not get it out of its failure state. The hosts were back online by this point, but Nagios did not recover on its own.

What can we tune to prevent this problem in the future? Any recommendations on how to monitor for this failure state? Monitoring that the Nagios process were running is obviously not enough...

nagios-engine-status-101213.jpg

Here are our ulimit settings:
[root@monitor01 security]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256236
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 256236
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks!

Re: Nagios Crash/Performance Issues

Posted: Mon Oct 14, 2013 12:28 pm
by slansing
Your system may have locked up due to an excess of checks being scheduled at the same time, that and the results coming in from them could have really bogged you down, have you taken a look at this page?:

http://support.nagios.com/wiki/index.ph ... g_Orphaned

Re: Nagios Crash/Performance Issues

Posted: Mon Oct 14, 2013 3:32 pm
by uidaho
I applied the changes. There was no applicable difference with the limits.conf changes. I'll watch to see if the nagios.cfg changes will work.

I have realized this is a totally different problem from our previous post with XI issues. I will go ahead and update that ticket instead. This problem can be considered resolved, for now.

Re: Nagios Crash/Performance Issues

Posted: Mon Oct 14, 2013 3:41 pm
by slansing
Sounds good, we will keep this thread open, let us know if this returns.

Re: Nagios Crash/Performance Issues

Posted: Mon Nov 18, 2013 12:06 pm
by uidaho
It did return with the same "orphaned services" error messages. However, Nagios was able to recover without a restart. Nagios got backed up when system backups were taking place. The system was slow for about thirty minutes. Nagios eventually caught up when the systems backups completed. Nagios also got into this state when we were patching about 20 servers, which were intermittently offline and online.

What else can I look at to help identify the problem?

I'm concerned that a slight slowdown of the system is pushing Nagios over the threshold. Could we have too many checks running too frequently?

Re: Nagios Crash/Performance Issues

Posted: Mon Nov 18, 2013 12:14 pm
by abrist
Looking at your numbers in the screenshots above, I don't think you are doing too many checks. What are the hardware/provisioned specs of the XI sever?

Re: Nagios Crash/Performance Issues

Posted: Mon Nov 18, 2013 8:09 pm
by uidaho
It is a Dell R720. The system has 32GB memory, 2 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (8 cores each, 32 virtual CPUs after hyperthreading is accounted for). The nagios volume is on a dedicated four disk RAID10 volume.

Surely that is a big enough server for this deployment?

Re: Nagios Crash/Performance Issues

Posted: Tue Nov 19, 2013 10:04 am
by slansing
Do you still have that ticket open? Is this being handled there?

Re: Nagios Crash/Performance Issues

Posted: Tue Nov 19, 2013 11:21 am
by uidaho
I don't recall if we made this an official support ticket. I'll create one now.

Re: Nagios Crash/Performance Issues

Posted: Tue Nov 19, 2013 11:55 am
by abrist
Great. See you soon.