Notification sent before threshold breach

conston_rd · Post by **conston_rd** » Wed Apr 15, 2020 8:20 am

Hi,

We are running nagiosxi 5.6.6 on centOS 7.
core version is 4.4.3.
We received a complaint from one of the sys admin, that they received notification for disk utilization when the actual utilization was below threshold.

Looking into the nagiosxi performance graph we could not see the utilization breaching the threshold.

Looking into the /usr/local/nagios/var/nagios.log revealed that the service state had directly entered "Critical HARD" from warning, with no "critical soft"

Kindly help in identifying the issue.

service definition:

######################################################

define service {
host_name XXXXXXXX
service_description Root Volume
check_period 24x7
check_command check_xi_hpe_ncpa_disk!-t 5nidNag -p 5693!disk/logical/!|!/used_percent!-w 80 -c 90!!!
contacts servicenow_integration
notification_period 24x7
initial_state o
importance 0
check_interval 1.000000
retry_interval 1.000000
max_check_attempts 3
is_volatile 0
parallelize_check 1
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,w,c
notifications_enabled 1
notification_interval 480.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
_AG Platform#Alerts
}

####################################################

nagios.log entry

[1586934000] CURRENT SERVICE STATE: s1l00103g;Ncpa_Agent_Status;OK;HARD;1;HTTP OK: HTTP/1.1 200 OK - 184 bytes in 0.026 second response time
[1586934000] CURRENT SERVICE STATE: s1l00103g;Root Volume;WARNING;HARD;3;WARNING: Used_percent was 82.50 %
[1586934000] CURRENT SERVICE STATE: s1l00103g;SSH Service Monitoring;OK;HARD;1;SSH OK - OpenSSH_7.4 (protocol 2.0)
[1586934000] CURRENT SERVICE STATE: s1l00103g;Swap Usage;OK;HARD;1;OK: Used swap was 9.60 % (Total: 3.10 GiB, Used: 0.30 GiB, Free: 2.80 GiB)
[1586940698] SERVICE NOTIFICATION: servicenow_integration;s1l00103g;Root Volume;WARNING;notify_servicenow_service;WARNING: Used_percent was 82.50 %
[1586943165] SERVICE NOTIFICATION: servicenow_integration;s1l00103g;Root Volume;CRITICAL;notify_servicenow_service;CRITICAL: Used_percent was 91.50 %
[1586943165] SERVICE ALERT: s1l00103g;Root Volume;CRITICAL;HARD;3;CRITICAL: Used_percent was 91.50 %
[1586943224] SERVICE NOTIFICATION: servicenow_integration;s1l00103g;Root Volume;WARNING;notify_servicenow_service;WARNING: Used_percent was 82.80 %
[1586943224] SERVICE ALERT: s1l00103g;Root Volume;WARNING;HARD;3;WARNING: Used_percent was 82.80 %

####################

jbrunkow · Post by **jbrunkow** » Wed Apr 15, 2020 11:08 am

The check should return that result in a soft state, then hard after the same result is returned a second time. I do not see evidence of that in the nagios.log you posted.

Can you please gather up the event log? That will tell us if the soft state is being skipped for some reason. Open up your Nagios XI instance navigate to the reports page using the top navigation bar > then click Event Log under Available Reports or Legacy Reports to pull a report of all events. Make sure to choose a wide enough time range, so that we can compare it against past behavior.

I do not see any bugs in that particular version that sound like what you are experiencing, but it may not hurt to update.
CHANGELOG=https://github.com/NagiosEnterprises/na ... /Changelog

conston_rd · Post by **conston_rd** » Wed Apr 15, 2020 12:22 pm

Thank you for your response, i have uploaded the event log for that server for last 15 days.

jbrunkow · Post by **jbrunkow** » Wed Apr 15, 2020 3:17 pm

I'm sorry, but I asked for the wrong report! Silly me.

I meant to ask for the State History report. Can you please share that with me instead?

That report will tell us if the soft state is being skipped for some reason. Open up your Nagios XI instance navigate to the Reports page using the top navigation bar > then click State History under Available Reports to pull more state data. Make sure to choose a wide enough time range, so that we can compare it against past behavior.

Thank you for your participation!

conston_rd · Post by **conston_rd** » Mon Apr 20, 2020 3:08 am

I have uploaded the state history screenshot with one week timeframe.

Also kindly update, if you were able to check, why the performance report does not contain the data for breached threshold?

Thanks

jbrunkow · Post by **jbrunkow** » Mon Apr 20, 2020 4:53 pm

That is actually expected behavior.

Nagios is designed to still trigger a notification after bobbing between states for a certain amount of checks. This is so that administrators are notified of events like flapping. Unfortunately, this logic doesn't make quite as much sense in the context of disk usage.

Please refer to the following linked documentation about how Nagios 4 determines states.
https://assets.nagios.com/downloads/nag ... types.html

If you adjust the --critical __ option from the command line, you can change the threshold that triggers a critical notification.

Please to the following linked document for more information on how to manage plugins in Nagios XI.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

conston_rd · Post by **conston_rd** » Tue Apr 21, 2020 11:31 am

Thank you for the info.

can you also let us know the reason for this spike not appearing in performance graph.
as per the attached graph the threshold was never breached.

Thanks

jbrunkow · Post by **jbrunkow** » Tue Apr 21, 2020 2:10 pm

It is probably worth mentioning that both of those options take either an integer or percent as an argument. It appears that you do not have the percent sign in your command, so it could be interpreting your units incorrectly. Can you please try adding a percent sign to the value to see if it recognizes the value as a percentage afterwards.

Code: Select all

-w, --warning=INTEGER
    Exit with WARNING status if less than INTEGER units of disk are free
-w, --warning=PERCENT%
    Exit with WARNING status if less than PERCENT of disk space is free
-c, --critical=INTEGER
    Exit with CRITICAL status if less than INTEGER units of disk are free
-c, --critical=PERCENT%
    Exit with CRITICAL status if less than PERCENT of disk space is free

Another way to dig deeper into investigating this problem would be to examine the script itself to read exactly how the options are handled, and the states are determined.

conston_rd · Post by **conston_rd** » Wed Apr 22, 2020 1:42 am

This check is working fine and we are receiving notification,
my question is when nagios says the threshold has breached, that should be visible in graph, which didn't happen here.

we will have to justify the alert to sys admins through graph data.

If there is a issue with service check command, this service check should have had issues from day one, this is not the case.

please look into this and let us know why the data is missing in graphs.

if you want me to look into anyother logs let me know, i will share it.

jbrunkow · Post by **jbrunkow** » Wed Apr 22, 2020 9:22 am

The reason that the graph doesn't exactly match the is because the data is ingested in to a round robin database. This type of database allows data to be approximated over time, and alleviates the burden of storing data indefinitely. The spike that triggered your notification became a lower peak when combined with past data.

The Wikipedia page on the subject may clear things up.
ref= https://en.wikipedia.org/wiki/RRDtool

Is that what you need to know? Sorry I did not express that earlier.

Nagios Support Forum

Notification sent before threshold breach

Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach

Re: Notification sent before threshold breach