Windows server CPU monitoring discrepencies

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
cstarr
Posts: 13
Joined: Thu Feb 16, 2017 11:18 am

Windows server CPU monitoring discrepencies

Post by cstarr »

We have been using Nagios to monitor our mostly windows server environment. Up until about a month ago I was able to understand and tune my alerts to match our network's baseline but a few weeks ago several of the servers started reporting steady increases in %CPU utilization. When I investigated the servers themselves were not reporting any particularly high CPU usage in perfmon but Nagios (using NCPA) keeps reporting constant 0-100% spikes in CPU usage on a regular basis. Looking at the capacity planning reports it seems like these few servers started some kind of CPU intensive application but no such reports on the servers themselves show this.

Suggestions? Ideas?
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Windows server CPU monitoring discrepencies

Post by npolovenko »

Hello, @cstarr. Unlikely, but could it be that at the time NCPA was running a lot of checks on your windows server and therefore causing the CPU load to spike? Can you also let us know what version of NCPA agent you're using? Have you changed any settings recently? Can you show us the check_cpu command that you have on the Nagios server?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
cstarr
Posts: 13
Joined: Thu Feb 16, 2017 11:18 am

Re: Windows server CPU monitoring discrepencies

Post by cstarr »

No changes have recently occurred to either the targeted servers or the Nagios server. The version of NCPA agent is 2.0.2. The check I'm running is the following:

check_xi_ncpa!-t 'nagiosncpa' -P 5693 -M cpu/percent -w 50 -c 80 -q 'aggregate=avg'

It checks every 5 minutes and rechecks every 2 minutes 5 times before generating an alert.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Windows server CPU monitoring discrepencies

Post by npolovenko »

@cstarr, The command looks fine, also I wanted to make sure that you pass in the aggregate=avg argument in the command, and you do. Would you able to upgrade the NCPA to the latest version, and the Linux plugin as well? Here's the link to the latest windows agent: https://www.nagios.org/ncpa/getting-started.php#windows and here's the download link to the latest check_ncpa plugin https://assets.nagios.com/downloads/ncp ... cpa.tar.gz
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
cstarr
Posts: 13
Joined: Thu Feb 16, 2017 11:18 am

Re: Windows server CPU monitoring discrepencies

Post by cstarr »

I updated the check_ncpa plugin and the client on two of the servers exhibiting symptoms, but no luck. Nagios continues to report the server CPU utilization going to 100% for about 5 minutes then dropping down to nearly zero for 5-20 minutes, but the server's internal performance counters show nearly zero CPU utilization the entire time.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Windows server CPU monitoring discrepencies

Post by npolovenko »

@cstarr , NCPA is using psutil to check the counters. It's running on python, which loads up its libraries in order to run the checks. So NCPA's readings won't always match the system performance counters readings. But this still doesn't really explain 100% CPU spikes. How many CPU cores does your server have? Also, it may be helpful to rebuild the performance counters: https://support.microsoft.com/en-us/hel ... ver-2008-6
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
cstarr
Posts: 13
Joined: Thu Feb 16, 2017 11:18 am

Re: Windows server CPU monitoring discrepencies

Post by cstarr »

So we have several servers that are showing different levels of CPU utilization.

Our backup DC, which is virtualized and has 4GB RAM and 1 3.07 GHz vCPU, CPU usage goes from near zero to 100% very frequently, but sometimes stays maxed out. On the server itself the performance counters show near zero utilization and using the GUI doesn't show any signs of a 100% maxed out CPU.

Our primary DC went from near zero to about 80% around the same time, it's physical and has 2 1.8 GHz Xeons with 16 GB of RAM.

There are several application servers experiencing this as well most of which are virtualized but have a variety of RAM and CPU core configurations.

All of them began doing this about a month ago, although not all at the EXACT same time.

I could rebuild the performance counters but the servers themselves don't feel particularly sluggish in their GUI when Nagios is showing such high CPU utilization.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Windows server CPU monitoring discrepencies

Post by npolovenko »

@cstarr, Thank you for the detailed explanation. I passed this information to our developers. We will look closer into this issue and try to replicate it in our environment. In a meantime, you can open a GitHub issue for this problem if you'd like:
https://github.com/NagiosEnterprises/ncpa/issues
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
cstarr
Posts: 13
Joined: Thu Feb 16, 2017 11:18 am

Re: Windows server CPU monitoring discrepencies

Post by cstarr »

I just noticed something odd that may relate to this issue. When I view the performance graphs for CPU usage I can see the usage spikes up to 100% on several servers at the 4 hr, 24 hr, 7 day, and 30 day intervals. When I go out to the 365 day interval I do not see as high of a spike, they only appear to be hitting around 60%. There's still a clear jump from near nothing to 60% but the difference between the two charts makes me wonder if there is some sort of database corruption or they're reading from different datasets?
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Windows server CPU monitoring discrepencies

Post by npolovenko »

@cstarr, This has to do with how Round Robin Database works. It compresses the data by averaging it out over periods of time. A longer period will have more averaged data. That was made to dynamically optimize the storage consumption and prevent the DB from bloating over time. This wouldn't explain the "spiking issue".
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked