Windows server CPU monitoring discrepencies

cstarr · Post by **cstarr** » Tue Dec 12, 2017 4:46 pm

We have been using Nagios to monitor our mostly windows server environment. Up until about a month ago I was able to understand and tune my alerts to match our network's baseline but a few weeks ago several of the servers started reporting steady increases in %CPU utilization. When I investigated the servers themselves were not reporting any particularly high CPU usage in perfmon but Nagios (using NCPA) keeps reporting constant 0-100% spikes in CPU usage on a regular basis. Looking at the capacity planning reports it seems like these few servers started some kind of CPU intensive application but no such reports on the servers themselves show this.

Suggestions? Ideas?

npolovenko · Post by **npolovenko** » Tue Dec 12, 2017 5:46 pm

Hello, @cstarr. Unlikely, but could it be that at the time NCPA was running a lot of checks on your windows server and therefore causing the CPU load to spike? Can you also let us know what version of NCPA agent you're using? Have you changed any settings recently? Can you show us the check_cpu command that you have on the Nagios server?

cstarr · Post by **cstarr** » Wed Dec 13, 2017 3:03 pm

No changes have recently occurred to either the targeted servers or the Nagios server. The version of NCPA agent is 2.0.2. The check I'm running is the following:

check_xi_ncpa!-t 'nagiosncpa' -P 5693 -M cpu/percent -w 50 -c 80 -q 'aggregate=avg'

It checks every 5 minutes and rechecks every 2 minutes 5 times before generating an alert.

npolovenko · Post by **npolovenko** » Wed Dec 13, 2017 5:18 pm

@cstarr, The command looks fine, also I wanted to make sure that you pass in the aggregate=avg argument in the command, and you do. Would you able to upgrade the NCPA to the latest version, and the Linux plugin as well? Here's the link to the latest windows agent: https://www.nagios.org/ncpa/getting-started.php#windows and here's the download link to the latest check_ncpa plugin https://assets.nagios.com/downloads/ncp ... cpa.tar.gz

cstarr · Post by **cstarr** » Thu Dec 14, 2017 2:06 pm

I updated the check_ncpa plugin and the client on two of the servers exhibiting symptoms, but no luck. Nagios continues to report the server CPU utilization going to 100% for about 5 minutes then dropping down to nearly zero for 5-20 minutes, but the server's internal performance counters show nearly zero CPU utilization the entire time.

npolovenko · Post by **npolovenko** » Thu Dec 14, 2017 3:34 pm

@cstarr , NCPA is using psutil to check the counters. It's running on python, which loads up its libraries in order to run the checks. So NCPA's readings won't always match the system performance counters readings. But this still doesn't really explain 100% CPU spikes. How many CPU cores does your server have? Also, it may be helpful to rebuild the performance counters: https://support.microsoft.com/en-us/hel ... ver-2008-6

cstarr · Post by **cstarr** » Tue Dec 19, 2017 11:25 am

So we have several servers that are showing different levels of CPU utilization.

Our backup DC, which is virtualized and has 4GB RAM and 1 3.07 GHz vCPU, CPU usage goes from near zero to 100% very frequently, but sometimes stays maxed out. On the server itself the performance counters show near zero utilization and using the GUI doesn't show any signs of a 100% maxed out CPU.

Our primary DC went from near zero to about 80% around the same time, it's physical and has 2 1.8 GHz Xeons with 16 GB of RAM.

There are several application servers experiencing this as well most of which are virtualized but have a variety of RAM and CPU core configurations.

All of them began doing this about a month ago, although not all at the EXACT same time.

I could rebuild the performance counters but the servers themselves don't feel particularly sluggish in their GUI when Nagios is showing such high CPU utilization.

npolovenko · Post by **npolovenko** » Tue Dec 19, 2017 12:00 pm

@cstarr, Thank you for the detailed explanation. I passed this information to our developers. We will look closer into this issue and try to replicate it in our environment. In a meantime, you can open a GitHub issue for this problem if you'd like:
https://github.com/NagiosEnterprises/ncpa/issues

cstarr · Post by **cstarr** » Wed Dec 27, 2017 11:20 am

I just noticed something odd that may relate to this issue. When I view the performance graphs for CPU usage I can see the usage spikes up to 100% on several servers at the 4 hr, 24 hr, 7 day, and 30 day intervals. When I go out to the 365 day interval I do not see as high of a spike, they only appear to be hitting around 60%. There's still a clear jump from near nothing to 60% but the difference between the two charts makes me wonder if there is some sort of database corruption or they're reading from different datasets?

npolovenko · Post by **npolovenko** » Wed Dec 27, 2017 12:48 pm

@cstarr, This has to do with how Round Robin Database works. It compresses the data by averaging it out over periods of time. A longer period will have more averaged data. That was made to dynamically optimize the storage consumption and prevent the DB from bloating over time. This wouldn't explain the "spiking issue".

Nagios Support Forum

Windows server CPU monitoring discrepencies

Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies

Re: Windows server CPU monitoring discrepencies