Metrics for nix CPU

Post by **BanditBBS** » Mon Aug 19, 2013 10:51 am

I know it doesn't currently work, but is there a reason for CPU Stats not being able to show under metrics for nix based systems? I have performance data being returned and we'd much rather look at average CPU utilization instead of Load. Any easier way to see this? We really want to see the top X number of nix hosts with the highest CPU average.

Thanks!

sreinhardt · Post by **sreinhardt** » Mon Aug 19, 2013 11:03 am

It's entirely possible, its just a matter of finding a good counter\location to check this depending on the distro you are running. Are these all Cent\RHEL?

Edit: Ah you mean in the metrics component.... refer to abrist.

abrist · Post by **abrist** » Mon Aug 19, 2013 11:04 am

The primary reason it is is missing, is that *nix server performance is predominantly measured by "Load". As many *nix distributions are quite aggressive in cacheing, read-ahead, preprocessing, etc, "Load over time" is usually a superior metric to properly gauge a server's performance health. Once load is over 1.0 per cpu core, you may start to have issues due to wait, and that is the important business metric.

I understand the desire for cpu average though (especially for ec2 and other cloud computing). Do you have a custom script returning these metrics?

The metrics component is an odd beast. It deals with the output of many different plugins and tries to normalize them. You may be able to look at it's php and add your metric, though just a warning, the component is a bit complex due to how many differently formatted sources it pulls from. Obviously, custom development is an option as well . . .

Post by **BanditBBS** » Mon Aug 19, 2013 11:27 am

abrist,

That reasoning is a pretty big generalization. None of our AIX admins care one bit about load and they only really care about CPU usage if it is "cookiing" for over an hour or so.

The two importane part of the check I use:

Code: Select all

open(PS, "/usr/bin/vmstat 1 4 | egrep -v '[a-z,A-Z]|-' |egrep '[0-9]' |") || return 1;
	while (<PS>) {
		(undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,$idle,undef) = split(/[\t \n]+/);
		$tidle = $tidle + $idle;
               }
$usage = 100 - ($tidle / 4);

and

Code: Select all

if ($usage >= $crit) {

	printf("CRITICAL - CPU usage at $usage%|Percent=$usage\n");
	exit($STATUSCODE{"CRITICAL"});
	}

elsif ($usage >= $warn) {

	printf("WARNING - CPU usage at $usage%|Percent=$usage\n");
	exit($STATUSCODE{"WARNING"});
	}

elsif ($usage < $warn) {

	printf("OK - CPU usage at $usage%|Percent=$usage\n");
	exit($STATUSCODE{"OK"});

So it is just returning a simple percentage and we'd love to be able to see an average for the host group sorted by average usage. I guess I'll look at the php....gulp

abrist · Post by **abrist** » Mon Aug 19, 2013 11:39 am

BanditBBS wrote:That reasoning is a pretty big generalization. None of our AIX admins care one bit about load and they only really care about CPU usage if it is "cookiing" for over an hour or so.

Fair enough, I meant no offense, I just wanted to explain the decision behind the component's use of load instead of cpu utilization for *nix boxes.
I am looking into the php to see if there is any easy way to include your checkresults' performance data in the component.

Post by **BanditBBS** » Mon Aug 19, 2013 11:45 am

abrist wrote:
BanditBBS wrote:That reasoning is a pretty big generalization. None of our AIX admins care one bit about load and they only really care about CPU usage if it is "cookiing" for over an hour or so.
Fair enough, I meant no offense, I just wanted to explain the decision behind the component's use of load instead of cpu utilization for *nix boxes.
I am looking into the php to see if there is any easy way to include your checkresults' performance data in the component.

I wasn't yelling, was just calling it a generalization

I'm looking into it as well.

Post by **BanditBBS** » Mon Aug 19, 2013 12:14 pm

What conditions need to be true for the metric component to utilize the information as CPU Usage stats?

abrist · Post by **abrist** » Mon Aug 19, 2013 12:55 pm

I just tested this. As you are returning a percentage, you will want to the "CPU Usage" (windows cpu metric) so that sorts work correctly.

1. Name the service check description "CPU Usage"
2. Make sure the perfdata ds label is named "5 min avg Load" or better yet, change the php file: /usr/local/nagiosxi/html/includes/utils-metrics.inc.php

Lines 166 and 219:
From:

Code: Select all

if(preg_match("/5 min avg Load/",$perfdata)>0)

To:

Code: Select all

if(preg_match("/(5 min avg Load|<your perfdata ds label here>)/",$perfdata)>0)

EDIT: Essentially, the component grabs services with a specific name and then greps the performance datasource names for "5 min avg Load". We need to make the regex an "or" statement and then include your performance datasource label so that you do not have to lose your perfdata or change the checks.

Post by **BanditBBS** » Mon Aug 19, 2013 2:05 pm

abrist wrote:I just tested this. As you are returning a percentage, you will want to the "CPU Usage" (windows cpu metric) so that sorts work correctly.

1. Name the service check description "CPU Usage"
2. Make sure the perfdata ds label is named "5 min avg Load" or better yet, change the php file: /usr/local/nagiosxi/html/includes/utils-metrics.inc.php

Lines 166 and 219:
From:
Code: Select all
if(preg_match("/5 min avg Load/",$perfdata)>0)
To:
Code: Select all
if(preg_match("/(5 min avg Load|<your perfdata ds label here>)/",$perfdata)>0)
EDIT: Essentially, the component grabs services with a specific name and then greps the performance datasource names for "5 min avg Load". We need to make the regex an "or" statement and then include your performance datasource label so that you do not have to lose your perfdata or change the checks.

Works like a champ. I was just about to correct yo uand say line 219, not 229, but I see you already corrected that

FYI - The service is labeled CPU Stats not CPU Usage for the AIX servers and they show up fine. Just modifying that php file has given me the desired effect! My last question for you....this is only showing the current numbers, correct? We can't say "show me avg CPU % over past week" That is not a function of the metrics, right?

abrist · Post by **abrist** » Mon Aug 19, 2013 2:12 pm

BanditBBS wrote:Works like a champ. I was just about to correct yo uand say line 219, not 229, but I see you already corrected that

Yeah, I think I edited it like 4 times . .

BanditBBS wrote:That is not a function of the metrics, right?

Nope. As the data is just pulled from the most recent checkresult in the rrd, that is all you get. This could be changed, but would be a bit deeper of an edit to the php. I am sure that it would require custom development unfortunately.

BanditBBS wrote:FYI - The service is labeled CPU Stats not CPU Usage for the AIX servers and they show up fine

Interesting, I changed my service description to something else and it disappeared from the metrics ui. Maybe I was just impatient?

Nagios Support Forum

Metrics for nix CPU

Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU

Re: Metrics for nix CPU