Page 1 of 3

Missing notifications for certain alarms

Posted: Thu Sep 03, 2020 8:47 am
by nms
Hi
We have an OpsGenie setup hooked on to our NagiosXI installation, and we're having issues with alarms not clearing on OpsGenie side which have cleared on Nagios.
The main culprits are alarms resulting in connection issues between Nagios & the remote server, particularly alarms such as:

Code: Select all

(Return code of 255 for service '403_S-VMaas-Mevo004-NbConnections' on host 'vip-whm-msd01-p_v-vmaas' was out of bounds)
We can see the OK status in the "Service History", but no OK sent in the "Service Notifications".
Just wanted to know really is this a known issue, is there a recommended configuration to implement to handle this?

Please let me know if you need any other information/details.

Thanks!
Martin

Re: Missing notifications for certain alarms

Posted: Thu Sep 03, 2020 5:10 pm
by benjaminsmith
Hi Martin,

If you can send me your system profile and the exact name of the main culprits, I'll take a look at the settings for you. Thanks! Benjamin

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message or upload it to the post/ticket, and then reply to this post to bring it up in the queue.

Re: Missing notifications for certain alarms

Posted: Fri Sep 04, 2020 2:32 am
by nms
Hi, I'll PM you the profile in 2 minutes.
Some of the services having this out of bounds alarm leading to non-notification:
vip-whm-msd01-p_v-vmaas 510_S-ipops-SMSE011-SMPP-SubmitSent
vip-whm-msd01-p_v-vmaas 403_S-VMaas-Mevo004-NbConnections
bru-cub-vws01-p_v-ncc 400_T-NCC-SolSparc011-PortAvailability [Oracle]
Thanks

Re: Missing notifications for certain alarms

Posted: Fri Sep 04, 2020 5:30 pm
by benjaminsmith
Hi @nms,

The out of bounds error usually means the plugin did not exit with a proper exit code as expected. I noticed all of those services are using a ssh plugin. Can you run of those checks, that's failing, and then type the following so you can see the exit code.

Code: Select all

echo $?
An exit code of 3 is for an unknown or unreachable state and you can set up your service to then send a notification to Opsgenie for this state condition.

https://nagios-plugins.org/doc/guidelines.html

Re: Missing notifications for certain alarms

Posted: Mon Sep 07, 2020 3:03 am
by nms
Hi, thanks for the reply.

We do have these errors occasionally, particularly for sites being monitored which are geographically very distant (we're in Europe, site in New Zealand for example). It's not reproduceable but happens quite often - if it raised exit code 3 that would be OK we can manage that but it doesn't, we get critical alarms for it. We don't have access to the source for the "check_by_ssh" plugin and aren't aware of such an option, but do you know if it has a similar thing to the "check_nrpe" plugin, where there is:
-u, --unknown-timeout Make connection problems return UNKNOWN instead of CRITICAL
All that is an issue we're trying to work out but not the main problem here which I'm raising.


The problem here is that it raises an alert for whatever reason - at that point it sends it to OpsGenie no problem. But when it's cleared, more often than not no OK notification is sent, meaning that our OpsGenie gets out of sync.

Any thoughts on getting that notification please?

Thanks

Re: Missing notifications for certain alarms

Posted: Tue Sep 08, 2020 5:41 pm
by benjaminsmith
Hi @nms,
The problem here is that it raises an alert for whatever reason - at that point it sends it to OpsGenie no problem. But when it's cleared, more often than not no OK notification is sent, meaning that our OpsGenie gets out of sync.
Ok, that makes perfect sense. Normally this is a configuration issue, and I did check you configurations and you have enabled recovery noticess so those should be going out.

If you have the opsgenie contact setup as a XI user account, make sure you have the proper preferences set as there is another layer to the notification settings.

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

Can you go to Home > Incident Management > Notifications and pull a report on the server in question and the time frame to verify if a notification was sent or not.

Additionally, go to Reports > State History and pull a report on this service and time frame to verify that it did go into a hard non-ok state and then recover. Be sure to Both for State Type in the options panel.

Re: Missing notifications for certain alarms

Posted: Wed Sep 09, 2020 5:09 am
by nms
Hi
Thanks for the reply.
opsgenie doesn't have a user, it's just a contact, so there are no other notification options I think?

For the service notifications & service state history, for a recent example I have the alarm & the clear, but only a notification for the alarm, I've put the examples below.
service state history.png
service notifications.png
Please let me know if you would like more examples, or what thoughts you have on a solution given what I've provided.

Thanks

Re: Missing notifications for certain alarms

Posted: Thu Sep 10, 2020 2:49 am
by nms
Hi Benjamin,
Wondering if you've had a chance to look at this please?

Thanks

Re: Missing notifications for certain alarms

Posted: Thu Sep 10, 2020 10:10 am
by benjaminsmith
Hi @nms,

Thank you for those reports. I've noticed a couple of items to take a look at. First, in the State History report, the service went OK at 4:38 but it was a Soft state, so it would not send the recovery email until it was in a hard OK state.

To figure this part out, it would be helpful to get the whole nagios.log files from that period. The nagios log is rotated into archives after 24 hours. You'll find these files in the following directory.

Code: Select all

/usr/local/nagios/var/archives
If you can upload the files for nagios-09-08-2020-00.log and nagios-09-09-2020-00.log, and I can take a closer look to make sure the service did recover.

Re: Missing notifications for certain alarms

Posted: Fri Sep 11, 2020 2:36 am
by nms
Hi Benjamin
Thanks for the reply - have to say I was wondering what was going on with the SOFT OK...
Attaching the requested logs.