Page 1 of 1
Multi-threshold CPU monitor recommendation
Posted: Wed Apr 05, 2017 3:34 pm
by gmackey
Hello,
I have the following business needs for each server:
- Alert on CPU usage at standard thresholds (80% warn, 90% crit, 5-minute check, 1-minute recheck, notify after 5 checks above threshold)
- Also alert on CPU usage on 2-core VMs to identify hung processes (40% warn, 50% crit, 5-minute check, 5-minute recheck, notify after 6 checks above threshold)
There is a rule for 4-core VMs also but I wanted to keep this question simple. So basically, for each server I have two active CPU monitors, each with different alerting thresholds. This allows us to identify hung processes in a more timely manner while minimizing the notifications we receive (and thus paying attention to the ones we get because they're important). I'm going to have to tweak the metrics display to keep the extra monitors from showing up there but that will be easy enough. My question is...is this how people typically satisfy this business requirement? Or am I missing some way that I could do this with a single CPU monitor?
Thank you for any input you can provide.
-Greg
Re: Multi-threshold CPU monitor recommendation
Posted: Wed Apr 05, 2017 4:36 pm
by avandemore
To be honest I'm not sure how that helps identify a hung process unless you're referring to something like a livelock situation where a process is consuming 100% of the CPU time.
I am not aware of anyone else using multiple cpu checks. It almost sound more like you want a plugin that detects if a process is using 100% CPU for an extend time, or perhaps I've misunderstood your request.
So maybe something along these lines?
https://exchange.nagios.org/directory/P ... es/details
https://exchange.nagios.org/directory/P ... sh/details
Re: Multi-threshold CPU monitor recommendation
Posted: Wed Apr 05, 2017 5:25 pm
by gmackey
Yes, I'm referring to a situation in which a process consumes approximately 100% of one CPU core, which translates to 40-50% CPU usage on a 2-core server and 20-25% CPU usage on a 4-core server. These are Windows servers so I want to stick with a Windows-native solution and not have to install anything extra to run Linux shell scripts. There can be multiple processes that may do this sort of thing (TiWorker.exe, vendor apps, etc.) so I don't need to monitor any specific process. I'd love to get a "top 5" list of current CPU-hogging processes to add to the notifications. I may just end up having to write my own PowerShell script to do this but using the two CPU monitors should work for us in the meantime. Thank you for the suggestions.
Re: Multi-threshold CPU monitor recommendation
Posted: Thu Apr 06, 2017 10:59 am
by mcapra
I know
check_wmi_plus (included with XI) has some options for monitoring cpu usage on a per-process basis. Here's the technical bits from the module itself:
Code: Select all
#----------------------------------------------------------
[checkproc cpuabove]
requires=1.48
inihelp=<<EOT
Check for processes using more than a specified CPU utilisation. To make this work as intended you need to specify some
warning/critical criteria eg -w 50 for warning when a process uses more than 50% CPU. You probably also want to remove
all processes with low CPU from the results. Do this using something like -exc _AvgCPU=@0:5 (which will exclude processes that have CPU utilisation between 0 and 5%)
ARG1 The processname to look for. Use % for wildcards.
The process name typically only includes the actual file name minus its suffix eg firefox, svchost
If there are multiple instances eg svchost, then some versions of Windows have them named all the same while others
such as Windows 2008 Server, have them numbered eg svchost#1, svchost#2, svchost#3. To get all svchost processes you
need to set ARG1 to svchost%
To view all processes set ARG1 to "%" and the full process list will be included in the plugin output.
Note: Use --nodatamode and/or NODATAEXIT settings to control what happens if no matching process is found.
EOT
aligndata=Name,IDProcess
query=select Name,IDProcess,PercentProcessorTime,Timestamp_Sys100NS from Win32_PerfRawData_PerfProc_Process WHERE Name like "{_arg1}" and Name != "Idle" and Name != "_Total"
# run 2 WMI queries, 5 seconds apart. The delay only applies if using --nokeepstate
samples=2
delay=5
customfield=_AvgCPU,PERF_100NSEC_TIMER,PercentProcessorTime,%.1f,100
test=_AvgCPU
test=_ItemCount
# fields to display before we list out all the CPU data
predisplay=_DisplayMsg||~|~| - ||
predisplay=_ItemCount||Total Process Count|||| (Process details on next line)\n
display=_DisplayMsg||~|~| - ||
display=_AvgCPU|%|CPU for {Name} (PID={IDProcess})||||\n
# need to include the {Name} so that performance data is unique to each instance
perf=_ItemCount||Process Count
# perf=_AvgCPU|%|Avg Utilisation CPU_{Name} - don't really need perfdata for each process for this check - use checkproc cpu if you want that
That's probably the easiest option. You'll need to enable your Windows environment for WMI monitoring though. We have docs for that:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
You'll need to include the .ini in your check_wmi_plus.pl as well. That's a one-line change from the default one we distribute (I think):
Code: Select all
our $wmi_ini_file='$conf_file_dir/check_wmi_plus.ini';
Though there's nothing to allow you to say "only show the top 5" without modifying the plugin itself, you could say "only show me processes using 10% or greater" using -exc like so:
Code: Select all
[root@xi-stable rw]# /usr/local/nagios/libexec/check_wmi_plus.pl -H 192.168.67.99 -u admin -p welcome123 -m checkproc -s cpuabove -a % -exc _AvgCPU=@0:9
OK (Sample Period 2 sec) - Total Process Count=1 (Process details on next line)|'Process Count'=1;
OK - CPU for wscript (PID=30260)=54.1%
The
-exc _AvgCPU=@0:9 bit basically says "exclude processes with % usage between 0% and 9%". That would help narrow things down a bit, but you would also trigger a CRITICAL if there were no processes matching that criteria and, in turn, no results were returned. I believe you could then say
-w 20 -c 25 to warn at 20% and critical at 25%.
Re: Multi-threshold CPU monitor recommendation
Posted: Thu Apr 06, 2017 12:35 pm
by gmackey
Thank you for this detailed response, mcapra. I will have to take a look at check_wmi_plus as that's something I haven't investigated yet. I'm glad to see in the guide that you aren't recommending adding the wmiagent user to the local Administrators group like most vendors recommend out of laziness and a complete disregard for security. Please feel free to lock this thread. Thanks!
Re: Multi-threshold CPU monitor recommendation
Posted: Thu Apr 06, 2017 4:04 pm
by mcapra
Sure thing! Feel free to open a new thread if you have additional WMI questions.