Yes... it's the old Nagios Server Performance issues post again......
I've implemented all the good things like RAMDisk, folowed all the performance tuning tips, etc. and all has been working really well.
However, recently, we've had excessive CPU utilisation on a Nagios VM.
It also appears that this is related to when the Nagios service is restarted after a config change.
I've attached a graph showing the CPU Utilisation . At about 10am yesterday, Nagios was restarted. CPU "User" shoots up & CPU Idle goes to almost zero.
I restarted Nagios again at 08:00 this morning and User goes down to its normal level and Idle increases to normal.
Looking at the process stats on the server, I can't find any process that is using this extra CPU which is really frustrating. I thought perhaps that the MySQL database was responsible, but of course that doesn't get restarted when the Nagios service is restarted. I even ran a DB repair but it made no difference.
I know that this is difficult for you to fault find especially as it isn't always present. I guess what I'm asking for is any clues to look for or for tips if anyone has had similar issues.
I am running NagiosXI 2012R2.8c with 280 Hosts and 2,300 Services
regards... Fred
Nagios Server Performance Issues
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Nagios Server Performance Issues
I'm not seeing the attachment, but I would take a look at the type of checks you are running. ESX and WMI checks can be hogs, and check_by_ssh doesn't always play nice either.
Former Nagios employee
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Nagios Server Performance Issues
Sorry about the attachment - just tried to upload it again & discovered that I can't upload pdf files.
Yes I'm already "renicing" any CPU hog - but the problem is that when this issue starts, I can't identify any particular process that could be responsible.
I would also expect that this issue would always be consistent as the same monitors are always being scheduled. So restarting the Nagios service shouldn't change any of the monitors.
As you can see from the CPU graph it is so obvious when the Nagios service gets restarted.
regards... Fred
Yes I'm already "renicing" any CPU hog - but the problem is that when this issue starts, I can't identify any particular process that could be responsible.
I would also expect that this issue would always be consistent as the same monitors are always being scheduled. So restarting the Nagios service shouldn't change any of the monitors.
As you can see from the CPU graph it is so obvious when the Nagios service gets restarted.
regards... Fred
You do not have the required permissions to view the files attached to this post.
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: Nagios Server Performance Issues
Would we be correct in understanding that the blue area on the left is a single restart, and the one(s) in the middle are from multiple restarts? To be fair, I would expect load increases for a little bit when nagios is restarted, as it has to recompile all of your configs, figure out templating and inheritance, which can both take a bit and take some resources. However I did want to start by confirming, that this is not happening for 6 hours straight due to one restart.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Nagios Server Performance Issues
The start and the end of each of the "blue" sections corresponds to a single Nagios restart.
The left and right edges of the graph are what it looks like normally.
During the period of high usr CPU, there is no indication via top that a nagios processs is hogging all the recources. It was only when I applied a config change and then noticed that the utilisation had returned to normal that I started to suspect the nagios service. As I mentioned previously, I did try & restart the mysqld service (as it had used a large amount of CPU time ) but that didn't make any difference.
I appreciate that this may be impossible to diagnose as I haven't had this experience on any of the other 9 Nagios servers I'm running. One even is on the same ESX host and shares the same SAN. I'm just putting this "out there" in case anyone else has had a similar experience.
To me it would appear that there is some nagios process not completing or looping causing this continuous CPU utilisation - however it does not affect the monitoring as it still runs normally
regards... Fred
The left and right edges of the graph are what it looks like normally.
During the period of high usr CPU, there is no indication via top that a nagios processs is hogging all the recources. It was only when I applied a config change and then noticed that the utilisation had returned to normal that I started to suspect the nagios service. As I mentioned previously, I did try & restart the mysqld service (as it had used a large amount of CPU time ) but that didn't make any difference.
I appreciate that this may be impossible to diagnose as I haven't had this experience on any of the other 9 Nagios servers I'm running. One even is on the same ESX host and shares the same SAN. I'm just putting this "out there" in case anyone else has had a similar experience.
To me it would appear that there is some nagios process not completing or looping causing this continuous CPU utilisation - however it does not affect the monitoring as it still runs normally
regards... Fred
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Nagios Server Performance Issues
You certainly may be correct.Fred Kroeger wrote:I appreciate that this may be impossible to diagnose as I haven't had this experience on any of the other 9 Nagios servers I'm running. One even is on the same ESX host and shares the same SAN. I'm just putting this "out there" in case anyone else has had a similar experience.
One thing I'll throw out there, is that I have seen this type of behavior if somehow a nagios process gets stuck running, and you end up with multiple processes running at the same time.
In this case the following usually fixes the problem by giving the process a little more time to exit on restart
http://support.nagios.com/wiki/index.ph ... ely_manner
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Nagios Server Performance Issues
well.... it turned out to be an issue with plugin and not the NAgios processes.
The CPU plugin reads the CPU values in proc/stat, waits for an interval, then rereads the proc/stat values.
Unfortunately the defaullt interval is 1 sec which really doesn't reflect any sort of average utilisation.
So changnig the default interval to a larger number now shows CPU Utilisation consistent with what I see using top, etc.
thanks for your assistance..... Fred
The CPU plugin reads the CPU values in proc/stat, waits for an interval, then rereads the proc/stat values.
Unfortunately the defaullt interval is 1 sec which really doesn't reflect any sort of average utilisation.
So changnig the default interval to a larger number now shows CPU Utilisation consistent with what I see using top, etc.
thanks for your assistance..... Fred