Nagios Crash/Performance Issues

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
uidaho
Posts: 89
Joined: Tue Feb 12, 2013 11:58 am

Nagios Crash/Performance Issues

Post by uidaho »

Last weekend, we had to shutdown a large number of our monitored hosts. Shortly after these hosts went into "down" state, Nagios crashed in an interesting manner and I'd like some help identifying how we can prevent this in the future.

During the time that Nagios was not functioning, there was a huge increase in log file messages like these:
Oct 12 10:11:10 monitor01 nagios: Warning: The check of service 'Perf - NIC win' on host 'xa-fe02' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

We also do not have any performance data for that time period for ANY monitored host, even if it was still online.


We were able to fix Nagios by a full reboot of the system. Simply restarting the Nagios process through the GUI did not get it out of its failure state. The hosts were back online by this point, but Nagios did not recover on its own.

What can we tune to prevent this problem in the future? Any recommendations on how to monitor for this failure state? Monitoring that the Nagios process were running is obviously not enough...

nagios-engine-status-101213.jpg

Here are our ulimit settings:
[root@monitor01 security]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256236
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 256236
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks!
You do not have the required permissions to view the files attached to this post.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios Crash/Performance Issues

Post by slansing »

Your system may have locked up due to an excess of checks being scheduled at the same time, that and the results coming in from them could have really bogged you down, have you taken a look at this page?:

http://support.nagios.com/wiki/index.ph ... g_Orphaned
uidaho
Posts: 89
Joined: Tue Feb 12, 2013 11:58 am

Re: Nagios Crash/Performance Issues

Post by uidaho »

I applied the changes. There was no applicable difference with the limits.conf changes. I'll watch to see if the nagios.cfg changes will work.

I have realized this is a totally different problem from our previous post with XI issues. I will go ahead and update that ticket instead. This problem can be considered resolved, for now.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios Crash/Performance Issues

Post by slansing »

Sounds good, we will keep this thread open, let us know if this returns.
uidaho
Posts: 89
Joined: Tue Feb 12, 2013 11:58 am

Re: Nagios Crash/Performance Issues

Post by uidaho »

It did return with the same "orphaned services" error messages. However, Nagios was able to recover without a restart. Nagios got backed up when system backups were taking place. The system was slow for about thirty minutes. Nagios eventually caught up when the systems backups completed. Nagios also got into this state when we were patching about 20 servers, which were intermittently offline and online.

What else can I look at to help identify the problem?

I'm concerned that a slight slowdown of the system is pushing Nagios over the threshold. Could we have too many checks running too frequently?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios Crash/Performance Issues

Post by abrist »

Looking at your numbers in the screenshots above, I don't think you are doing too many checks. What are the hardware/provisioned specs of the XI sever?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
uidaho
Posts: 89
Joined: Tue Feb 12, 2013 11:58 am

Re: Nagios Crash/Performance Issues

Post by uidaho »

It is a Dell R720. The system has 32GB memory, 2 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (8 cores each, 32 virtual CPUs after hyperthreading is accounted for). The nagios volume is on a dedicated four disk RAID10 volume.

Surely that is a big enough server for this deployment?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios Crash/Performance Issues

Post by slansing »

Do you still have that ticket open? Is this being handled there?
uidaho
Posts: 89
Joined: Tue Feb 12, 2013 11:58 am

Re: Nagios Crash/Performance Issues

Post by uidaho »

I don't recall if we made this an official support ticket. I'll create one now.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios Crash/Performance Issues

Post by abrist »

Great. See you soon.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked