Page 1 of 2
Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Sun Nov 27, 2016 3:25 pm
by termcap
Hello,
I am currently running nagios 4.2.3 with 250+ hosts and around 7000+ service checks. The server config is 8 core, 10GB ram on a Centos 6. Unfortunately I am noticing crazy load spikes on nagios that can go as high as 50 at times.
I was facing a similar problem on nagios 4.1.1, where the load would shoot up every 10 to 15 minutes, I upgraded to nagios 4.2.3 after reading that a similar issue has been resolved. Post upgrade, things were good for around 6 to 7 hours and after that I am now having spikes reaching as high 50 or 60!
I observe this spike 90% of times when the check-cisco.pl plugin runs which is available here ->
https://exchange.nagios.org/directory/P ... st/details
But I cannot blame the plugin for this because for hours the load remains stable and after that it starts to spike whenever this plugin appears, another thing I've noticed is that nagios tends to run check-cisco.pl in large batches.
I am using ramdisk for status.dat as well as tmp and have made sure that I am not getting blocked on writes anywhere. Any pointers for me where to look next ?
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Mon Nov 28, 2016 9:59 am
by rkennedy
I observe this spike 90% of times when the check-cisco.pl plugin runs which is available here ->
https://exchange.nagios.org/directory/P ... st/details
But I cannot blame the plugin for this because for hours the load remains stable and after that it starts to spike whenever this plugin appears, another thing I've noticed is that nagios tends to run check-cisco.pl in large batches.
I believe this is the issue, actually. The check does not run in batches, but instead it stays open until a timeout is reached if you have failing checks which explains why you see it open for a continuous amount of time.
Are a lot of your check-cisco checks failing, or do they take a while to respond?
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Mon Nov 28, 2016 12:03 pm
by termcap
rkennedy wrote:I believe this is the issue, actually. The check does not run in batches, but instead it stays open until a timeout is reached if you have failing checks which explains why you see it open for a continuous amount of time.
Are a lot of your check-cisco checks failing, or do they take a while to respond?
None of the check-cisco.pl are failing to run and the max runtime I have seen for the checks is around 83 ms. I have verified this manually by running them by hand to make sure that none are failing.
Infact today I changed a lot of my check-cisco.pl checks and ported them to check_snmp that comes along with the install, still I get load spikes as described above. I have also experimented with commenting out check-cisco.pl based checks totally from my setup and still get the load spikes.
Intrestingly if I use the the experimental auto_scheduler option then my setup seems to run smoothly with only a couple of spikes in an 24 hour cycle.
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Mon Nov 28, 2016 2:07 pm
by termcap
Is there any more information that I could provide from my server logs, commands that would help in debugging this issue ?
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Mon Nov 28, 2016 4:27 pm
by rkennedy
None of the check-cisco.pl are failing to run and the max runtime I have seen for the checks is around 83 ms. I have verified this manually by running them by hand to make sure that none are failing.
Are you running them as the nagios user? Can you show us with time appended before the command? The only reason that the process would stay open, is if it wasn't existing properly.
A few more questions:
- Does it happen at the same time every day, or is it random?
- Is this a VM or a bare metal install?
- Anything running in conjunction with Nagios?
I am interested to see what a cron running every minute with the following script will show us -
Code: Select all
#!/bin/bash
date=$(date)
echo -e "$date" >> /tmp/logdata.txt
uptime >> /tmp/logdata.txt
ps axo rss,comm,pid \
| awk '{ proc_list[$2]++; proc_list[$2 "," 1] += $1; } \
END { for (proc in proc_list) { printf("%d\t%s\n", \
proc_list[proc "," 1],proc); }}' | sort -n | tail -n 10 | sort -rn \
| awk '{$1/=1024;printf "%.0fMB\t",$1}{print $2}' >> /tmp/logdata.txt
ps -eo pcpu,args --sort=-%cpu|head >> /tmp/logdata.txt
echo -e "\n" >> /tmp/logdata.txt
Could you set one up to run every minute, and then post it + the corresponding nagios.log file for us to review after the issue happens again?
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Tue Dec 06, 2016 9:48 am
by termcap
Sorry for the delay in responding to this thread, I am just trying to make sure that I first eliminate all other causes like slow plugin etc that you have mentioned what might be causing high load before posting any further info.
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Tue Dec 06, 2016 12:42 pm
by dwhitfield
No problem, we'll be here.
4.2.4 should be out shortly too. My understanding is it will just be security fixes, but still worth keeping an eye on.
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Fri Dec 09, 2016 5:32 am
by termcap
Hi,
So I have made sure that I have no failing plugins on slow plugins, I have followed nagios.log to make sure that I am not having failing issues or plugins timing out etc. Manually executed plugins with the time command etc..
Here is what I have observed, and I feel this may be a scheduling issue. Nagios does not seem to be spreading checks around.
I run a infinite loop with the following command ps aux | grep [c]heck (All my plugins start with name check_[string]), depending upon when I run the command, its either totally quiet or I see two or three small bunches with around 5 to 10 checks running, then everything goes quiet, then suddenly I see a large bunch that causes my screen to scroll several times, it seems 60 to 70% of all my checks are running in this huge bunch. This is when the load shoots up. Everything goes quiet till the next small bunches appear and the cycle continues.
On the other hand when I set the auto_reschedule_checks = 1 , the check scheduling behavior changes and rather than seeing bunches of checks like before, I now see a constant stream of checks running with almost negligible quiet times, now the load is under control for the past three days and the load has hardly ever crossed 1.0 !
Some more observations, I have 7000+ checks, but I have only a couple of plugins maybe 10 or 12, when I see the checks executing, it appears to me that nagios schedules all the plugins with the same name around the same time, so I see bunches of check_cisco, bunches of check_fan etc, do smart settings take into consideration the alphabetical order of plugin names etc that maybe causing this ?
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Fri Dec 09, 2016 1:56 pm
by dwhitfield
termcap wrote:
Intrestingly if I use the the experimental auto_scheduler option then my setup seems to run smoothly with only a couple of spikes in an 24 hour cycle.
Aside from the couple of spikes in a 24-hour cycle, do you have any problems running the auto_scheduler? Are you using anything like mod_gearman to offload? FWIW, we should have our own offloader coming in a future version. I can't say yet specifically which version.
4.2.4 is out with an important security fix. Please upgrade.
Re: Nagios 4.1.1 and 4.2.3 cause very high loads on server
Posted: Sat Dec 10, 2016 12:51 am
by termcap
dwhitfield wrote:
Aside from the couple of spikes in a 24-hour cycle, do you have any problems running the auto_scheduler? Are you using anything like mod_gearman to offload? FWIW, we should have our own offloader coming in a future version. I can't say yet specifically which version.
4.2.4 is out with an important security fix. Please upgrade.
No, I am not using any offloader like mod_gearman or anything similar, my install is the standard install that I download off the nagios download link.
Some days back I saw spikes with auto_scheduler, but lately I have noticed no spikes with auto_scheduler as well.
Considering auto_scheduler is an experimental feature, is it ok if I keep it on for a production box ?
I will upgrade at the earliest.