Last weekend, we had to shutdown a large number of our monitored hosts. Shortly after these hosts went into "down" state, Nagios crashed in an interesting manner and I'd like some help identifying how we can prevent this in the future.
During the time that Nagios was not functioning, there was a huge increase in log file messages like these:
Oct 12 10:11:10 monitor01 nagios: Warning: The check of service 'Perf - NIC win' on host 'xa-fe02' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
We also do not have any performance data for that time period for ANY monitored host, even if it was still online.
We were able to fix Nagios by a full reboot of the system. Simply restarting the Nagios process through the GUI did not get it out of its failure state. The hosts were back online by this point, but Nagios did not recover on its own.
What can we tune to prevent this problem in the future? Any recommendations on how to monitor for this failure state? Monitoring that the Nagios process were running is obviously not enough...
Here are our ulimit settings:
[root@monitor01 security]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256236
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 256236
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Thanks!
Nagios Crash/Performance Issues
Nagios Crash/Performance Issues
You do not have the required permissions to view the files attached to this post.
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Crash/Performance Issues
Your system may have locked up due to an excess of checks being scheduled at the same time, that and the results coming in from them could have really bogged you down, have you taken a look at this page?:
http://support.nagios.com/wiki/index.ph ... g_Orphaned
http://support.nagios.com/wiki/index.ph ... g_Orphaned
Re: Nagios Crash/Performance Issues
I applied the changes. There was no applicable difference with the limits.conf changes. I'll watch to see if the nagios.cfg changes will work.
I have realized this is a totally different problem from our previous post with XI issues. I will go ahead and update that ticket instead. This problem can be considered resolved, for now.
I have realized this is a totally different problem from our previous post with XI issues. I will go ahead and update that ticket instead. This problem can be considered resolved, for now.
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Crash/Performance Issues
Sounds good, we will keep this thread open, let us know if this returns.
Re: Nagios Crash/Performance Issues
It did return with the same "orphaned services" error messages. However, Nagios was able to recover without a restart. Nagios got backed up when system backups were taking place. The system was slow for about thirty minutes. Nagios eventually caught up when the systems backups completed. Nagios also got into this state when we were patching about 20 servers, which were intermittently offline and online.
What else can I look at to help identify the problem?
I'm concerned that a slight slowdown of the system is pushing Nagios over the threshold. Could we have too many checks running too frequently?
What else can I look at to help identify the problem?
I'm concerned that a slight slowdown of the system is pushing Nagios over the threshold. Could we have too many checks running too frequently?
Re: Nagios Crash/Performance Issues
Looking at your numbers in the screenshots above, I don't think you are doing too many checks. What are the hardware/provisioned specs of the XI sever?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios Crash/Performance Issues
It is a Dell R720. The system has 32GB memory, 2 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (8 cores each, 32 virtual CPUs after hyperthreading is accounted for). The nagios volume is on a dedicated four disk RAID10 volume.
Surely that is a big enough server for this deployment?
Surely that is a big enough server for this deployment?
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Crash/Performance Issues
Do you still have that ticket open? Is this being handled there?
Re: Nagios Crash/Performance Issues
I don't recall if we made this an official support ticket. I'll create one now.
Re: Nagios Crash/Performance Issues
Great. See you soon.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.