Nagios Support Forum

Posted: **Wed Apr 05, 2017 3:34 pm**

Hello,

I have the following business needs for each server:

Alert on CPU usage at standard thresholds (80% warn, 90% crit, 5-minute check, 1-minute recheck, notify after 5 checks above threshold)
Also alert on CPU usage on 2-core VMs to identify hung processes (40% warn, 50% crit, 5-minute check, 5-minute recheck, notify after 6 checks above threshold)

There is a rule for 4-core VMs also but I wanted to keep this question simple. So basically, for each server I have two active CPU monitors, each with different alerting thresholds. This allows us to identify hung processes in a more timely manner while minimizing the notifications we receive (and thus paying attention to the ones we get because they're important). I'm going to have to tweak the metrics display to keep the extra monitors from showing up there but that will be easy enough. My question is...is this how people typically satisfy this business requirement? Or am I missing some way that I could do this with a single CPU monitor?

Thank you for any input you can provide.

-Greg

Posted: **Wed Apr 05, 2017 4:36 pm**

To be honest I'm not sure how that helps identify a hung process unless you're referring to something like a livelock situation where a process is consuming 100% of the CPU time.

I am not aware of anyone else using multiple cpu checks. It almost sound more like you want a plugin that detects if a process is using 100% CPU for an extend time, or perhaps I've misunderstood your request.

So maybe something along these lines?

https://exchange.nagios.org/directory/P ... es/details
https://exchange.nagios.org/directory/P ... sh/details

Posted: **Wed Apr 05, 2017 5:25 pm**

Yes, I'm referring to a situation in which a process consumes approximately 100% of one CPU core, which translates to 40-50% CPU usage on a 2-core server and 20-25% CPU usage on a 4-core server. These are Windows servers so I want to stick with a Windows-native solution and not have to install anything extra to run Linux shell scripts. There can be multiple processes that may do this sort of thing (TiWorker.exe, vendor apps, etc.) so I don't need to monitor any specific process. I'd love to get a "top 5" list of current CPU-hogging processes to add to the notifications. I may just end up having to write my own PowerShell script to do this but using the two CPU monitors should work for us in the meantime. Thank you for the suggestions.

Posted: **Thu Apr 06, 2017 10:59 am**

I know check_wmi_plus (included with XI) has some options for monitoring cpu usage on a per-process basis. Here's the technical bits from the module itself:

Code: Select all

#----------------------------------------------------------
[checkproc cpuabove]
requires=1.48
inihelp=<<EOT
Check for processes using more than a specified CPU utilisation. To make this work as intended you need to specify some
warning/critical criteria eg -w 50 for warning when a process uses more than 50% CPU. You probably also want to remove 
all processes with low CPU from the results. Do this using something like -exc _AvgCPU=@0:5 (which will exclude processes that have CPU utilisation between 0 and 5%)
ARG1  The processname to look for. Use % for wildcards.
   The process name typically only includes the actual file name minus its suffix eg firefox, svchost
   If there are multiple instances eg svchost, then some versions of Windows have them named all the same while others
   such as Windows 2008 Server, have them numbered eg svchost#1, svchost#2, svchost#3. To get all svchost processes you
   need to set ARG1 to svchost%
   To view all processes set ARG1 to "%" and the full process list will be included in the plugin output.

Note:  Use --nodatamode and/or NODATAEXIT settings to control what happens if no matching process is found.
EOT

aligndata=Name,IDProcess
query=select Name,IDProcess,PercentProcessorTime,Timestamp_Sys100NS from Win32_PerfRawData_PerfProc_Process WHERE Name like "{_arg1}" and Name != "Idle"  and Name != "_Total"

# run 2 WMI queries, 5 seconds apart. The delay only applies if using --nokeepstate
samples=2
delay=5

customfield=_AvgCPU,PERF_100NSEC_TIMER,PercentProcessorTime,%.1f,100

test=_AvgCPU
test=_ItemCount

# fields to display before we list out all the CPU data
predisplay=_DisplayMsg||~|~| - ||
predisplay=_ItemCount||Total Process Count|||| (Process details on next line)\n

display=_DisplayMsg||~|~| - ||
display=_AvgCPU|%|CPU for {Name} (PID={IDProcess})||||\n

# need to include the {Name} so that performance data is unique to each instance
perf=_ItemCount||Process Count
# perf=_AvgCPU|%|Avg Utilisation CPU_{Name} - don't really need perfdata for each process for this check - use checkproc cpu if you want that

That's probably the easiest option. You'll need to enable your Windows environment for WMI monitoring though. We have docs for that:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

You'll need to include the .ini in your check_wmi_plus.pl as well. That's a one-line change from the default one we distribute (I think):

Code: Select all

our $wmi_ini_file='$conf_file_dir/check_wmi_plus.ini';

Though there's nothing to allow you to say "only show the top 5" without modifying the plugin itself, you could say "only show me processes using 10% or greater" using -exc like so:

Code: Select all

[root@xi-stable rw]# /usr/local/nagios/libexec/check_wmi_plus.pl -H 192.168.67.99 -u admin -p welcome123 -m checkproc -s cpuabove -a % -exc _AvgCPU=@0:9
OK (Sample Period 2 sec) - Total Process Count=1 (Process details on next line)|'Process Count'=1;
OK - CPU for wscript (PID=30260)=54.1%

The -exc _AvgCPU=@0:9 bit basically says "exclude processes with % usage between 0% and 9%". That would help narrow things down a bit, but you would also trigger a CRITICAL if there were no processes matching that criteria and, in turn, no results were returned. I believe you could then say -w 20 -c 25 to warn at 20% and critical at 25%.

Posted: **Thu Apr 06, 2017 12:35 pm**

Thank you for this detailed response, mcapra. I will have to take a look at check_wmi_plus as that's something I haven't investigated yet. I'm glad to see in the guide that you aren't recommending adding the wmiagent user to the local Administrators group like most vendors recommend out of laziness and a complete disregard for security. Please feel free to lock this thread. Thanks!

Posted: **Thu Apr 06, 2017 4:04 pm**

Sure thing! Feel free to open a new thread if you have additional WMI questions.

Nagios Support Forum

Multi-threshold CPU monitor recommendation

Multi-threshold CPU monitor recommendation

Re: Multi-threshold CPU monitor recommendation

Re: Multi-threshold CPU monitor recommendation

Re: Multi-threshold CPU monitor recommendation

Re: Multi-threshold CPU monitor recommendation

Re: Multi-threshold CPU monitor recommendation