Nagios Server Performance Issues

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Nagios Server Performance Issues

Post by Fred Kroeger »

Yes... it's the old Nagios Server Performance issues post again......
I've implemented all the good things like RAMDisk, folowed all the performance tuning tips, etc. and all has been working really well.
However, recently, we've had excessive CPU utilisation on a Nagios VM.
It also appears that this is related to when the Nagios service is restarted after a config change.
I've attached a graph showing the CPU Utilisation . At about 10am yesterday, Nagios was restarted. CPU "User" shoots up & CPU Idle goes to almost zero.
I restarted Nagios again at 08:00 this morning and User goes down to its normal level and Idle increases to normal.

Looking at the process stats on the server, I can't find any process that is using this extra CPU which is really frustrating. I thought perhaps that the MySQL database was responsible, but of course that doesn't get restarted when the Nagios service is restarted. I even ran a DB repair but it made no difference.

I know that this is difficult for you to fault find especially as it isn't always present. I guess what I'm asking for is any clues to look for or for tips if anyone has had similar issues.

I am running NagiosXI 2012R2.8c with 280 Hosts and 2,300 Services

regards... Fred
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios Server Performance Issues

Post by tmcdonald »

I'm not seeing the attachment, but I would take a look at the type of checks you are running. ESX and WMI checks can be hogs, and check_by_ssh doesn't always play nice either.
Former Nagios employee
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Nagios Server Performance Issues

Post by Fred Kroeger »

Sorry about the attachment - just tried to upload it again & discovered that I can't upload pdf files.
Yes I'm already "renicing" any CPU hog - but the problem is that when this issue starts, I can't identify any particular process that could be responsible.
I would also expect that this issue would always be consistent as the same monitors are always being scheduled. So restarting the Nagios service shouldn't change any of the monitors.
As you can see from the CPU graph it is so obvious when the Nagios service gets restarted.

regards... Fred
You do not have the required permissions to view the files attached to this post.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios Server Performance Issues

Post by sreinhardt »

Would we be correct in understanding that the blue area on the left is a single restart, and the one(s) in the middle are from multiple restarts? To be fair, I would expect load increases for a little bit when nagios is restarted, as it has to recompile all of your configs, figure out templating and inheritance, which can both take a bit and take some resources. However I did want to start by confirming, that this is not happening for 6 hours straight due to one restart.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Nagios Server Performance Issues

Post by Fred Kroeger »

The start and the end of each of the "blue" sections corresponds to a single Nagios restart.

The left and right edges of the graph are what it looks like normally.
During the period of high usr CPU, there is no indication via top that a nagios processs is hogging all the recources. It was only when I applied a config change and then noticed that the utilisation had returned to normal that I started to suspect the nagios service. As I mentioned previously, I did try & restart the mysqld service (as it had used a large amount of CPU time ) but that didn't make any difference.

I appreciate that this may be impossible to diagnose as I haven't had this experience on any of the other 9 Nagios servers I'm running. One even is on the same ESX host and shares the same SAN. I'm just putting this "out there" in case anyone else has had a similar experience.

To me it would appear that there is some nagios process not completing or looping causing this continuous CPU utilisation - however it does not affect the monitoring as it still runs normally

regards... Fred
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Server Performance Issues

Post by scottwilkerson »

Fred Kroeger wrote:I appreciate that this may be impossible to diagnose as I haven't had this experience on any of the other 9 Nagios servers I'm running. One even is on the same ESX host and shares the same SAN. I'm just putting this "out there" in case anyone else has had a similar experience.
You certainly may be correct.

One thing I'll throw out there, is that I have seen this type of behavior if somehow a nagios process gets stuck running, and you end up with multiple processes running at the same time.

In this case the following usually fixes the problem by giving the process a little more time to exit on restart
http://support.nagios.com/wiki/index.ph ... ely_manner
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Nagios Server Performance Issues

Post by Fred Kroeger »

well.... it turned out to be an issue with plugin and not the NAgios processes.
The CPU plugin reads the CPU values in proc/stat, waits for an interval, then rereads the proc/stat values.
Unfortunately the defaullt interval is 1 sec which really doesn't reflect any sort of average utilisation.
So changnig the default interval to a larger number now shows CPU Utilisation consistent with what I see using top, etc.

thanks for your assistance..... Fred
Locked