Page 1 of 1

NagiosXI dint triggered notification

Posted: Thu Apr 08, 2021 8:02 am
by apteancloud
Hi Team,

NagiosXI dint triggered notification on CPU load spike for one of our servers in Azure, As checked in Azure metrics we can see CPU utilization spiked up to 98%, and due to the CPU load spike, sever was hung for an hour and we had to reboot it. Please find the Nagios plugin we are using.

The alert dint triggered on Nagios, at that particular time frame it was 17% in Nagios performance graph

Code: Select all

[nagios@NagiosXIAzPrd ~]$ /usr/local/nagios/libexec/check_nt -H 10.179.1.33 -p 12489 -s "sprt575" -v CPULOAD -l 15,85,90
CPU Load 3% (15 min average) | '15 min avg Load'=3%;85;90;0;100
Attached are both the Azure metric graph and Nagios Performance graph at the same time frame. Please check on this

PFA

Thanks in Advance

Re: NagiosXI dint triggered notification

Posted: Thu Apr 08, 2021 3:54 pm
by dchurch
How many CPU's are in the host?

Because the spike only reached ~20%, it seems to me that the load is being calculated by NSClient as being across all CPU's (absolute maximum being 100%) whereas I think your assumption was that it was that it would be in terms of individual CPU's, e.g. 100% for 1 CPU pegged, 200% for 2 CPU's pegged, etc.

You could try lowering the average time scale to, say 5 minutes, and decrease the check interval too. With a 15 minute average, the CPU would have to be pegged for 7 minutes straight to get the needle to move to 50%. So it would become -l 5,85,90

I'm not sure why the value is different between the Azure console and what Nagios captured. Perhaps NSClient is miscounting the CPU's? You could try decreasing the thresholds to 12% to work around this.

Really, though, NSClient (is deprecated, insecure, and hasn't been maintained since 2014. I'd consider replacing it with NCPA or NSClient++. You may have better results with NCPA, since I know that actually gives you an option to report on load averaged across CPU's, or summed.