Service Recovery is not triggering for some of the services

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
vishal313
Posts: 50
Joined: Wed Dec 18, 2019 10:23 pm

Service Recovery is not triggering for some of the services

Post by vishal313 »

Hi All,

We have Nagios XI 5.8.5. We are monitoring servers(Windows as well as Unix) for CPU, Memory, Disk, etc.
Nagios XI 5.8.5 is integrated with a ticketing tool for raising incidents based on threshold breached events.

Recently we have found that the monitored services are not triggering "Service Recovery" notifications for some of the monitored services.
The service configuration has "Recovery" option enabled for triggering a "Service Recovery" notification. This is not happening for all services but some and the counts are increasing as time passes.

This is a concern since the Service Recovery notification is not triggering, the incident raised in the ticketing tool is not getting auto resolved and we need a manual intervention to mark them resolved.

Can you please help me in understanding and solving this issue for why the Service Recovery notification is not getting triggered once the Service comes back to OK state.

Thanks in advance.

Regards
Vishal Dhote
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Service Recovery is not triggering for some of the servi

Post by ssax »

Please PM a copy of your profile.zip so that I can review some information, you can download it from Admin > System Profile by clicking the Download Profile button.

PM me the nagios archive file from the last time that this occurred as well:
- NOTE: The files are rotated at midnight the next day so if you're looking for data from the 15th, send the file named with the 16th.

Code: Select all

/usr/local/nagios/var/archives/nagios-XX-XX-XXXX-00.log
Include the exact host name, service name, and contact name that should have been sent the recovery but wasn't.

Please run the Reports > Notifications report for that time period against the host. Does it list the recovery notification as trying to be sent to the contact (even though they didn't receive it)?
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Service Recovery is not triggering for some of the servi

Post by pbroste »

Hello @vishal313

Thanks for reaching out, and have some suggestions that I want to go over.

The suggestion is to look at the 'eventman.log' and view the events logged during the 'Service Recovery Notification'.

Here is an example that you will see:
...........
[hoststate] => UP
[hoststateid] => 0
[hosteventid] => 709709
[hostproblemid] => 0
[servicestate] => OK
[servicestateid] => 0
[lastservicestate] => WARNING
[lastservicestateid] => 1
[servicestatetype] => SOFT or HARD
.......

Code: Select all

cat /usr/local/nagiosxi/var/eventman.log | less -SR
I speculate the service went directly to 'hard critical' when the host was 'soft down', then the service recovered.

Here's a further explanation of the logic
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/statetypes.html

Thanks,
Perry
vishal313
Posts: 50
Joined: Wed Dec 18, 2019 10:23 pm

Re: Service Recovery is not triggering for some of the servi

Post by vishal313 »

@ssax : I have shared the requested information with you. Please help with your inputs.
vishal313
Posts: 50
Joined: Wed Dec 18, 2019 10:23 pm

Re: Service Recovery is not triggering for some of the servi

Post by vishal313 »

Hi @pbroste : I have shared the files and information with you. Please help me with your inputs.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Service Recovery is not triggering for some of the servi

Post by pbroste »

Hello @vishal313

Thanks for sending over the System Profile and event logs.

We see that the: Host: tXXXXXXXX003 with Service: Memory Utilization is going to an 'UNKNOWN' state when the check times out. There is no notification sent when it goes back to an 'OK' state per config.

Options to check on why the service check is timing out and increase the timeout. Or include 'UNKNOWN" to trigger notification. I also want to point out that your 'service_check_timeout_state' in 'nagios.cfg' is set to default 'UNKNOWN'. Typically the default is CRITICAL.

We see that Host: bXXXXXXXX071 with Service : CPU Utilization[/icode] is going to a SOFT state and when it recovers from SOFT we do not see 'Event Notifications Sent'. We do not see that the 'Max_Check_Attempts' are reached, and do not go to a HARD state. I found this site and lays out an excellent example.

Thanks,
Perry

Thanks,
Perry
vishal313
Posts: 50
Joined: Wed Dec 18, 2019 10:23 pm

Re: Service Recovery is not triggering for some of the servi

Post by vishal313 »

Thank you @pbroste.
Can flapping of the service/host could be one of the reason for this situation.
What would happen if we disable flapping globally in Nagios XI. What impact it creates.


Regards
Vishal Dhote
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Service Recovery is not triggering for some of the servi

Post by pbroste »

Hello @vishal313

Thanks for following up, there are 'Global Variables' and 'Object-Specific Variables on 'Flap Detection Thresholds'.

Flap options:
low_flap_threshold: This directive is used to specify the low state change threshold used in flap detection for this service. More information on flap detection can be found here. If you set this directive to a value of 0, the program-wide value specified by the low_service_flap_threshold directive will be used.
high_flap_threshold: This directive is used to specify the high state change threshold used in flap detection for this service. More information on flap detection can be found here. If you set this directive to a value of 0, the program-wide value specified by the high_service_flap_threshold directive will be used.
flap_detection_enabled *: This directive is used to determine whether or not flap detection is enabled for this service. More information on flap detection can be found here. Values: 0 = disable service flap detection, 1 = enable service flap detection.
flap_detection_options: This directive is used to determine what service states the flap detection logic will use for this service. Valid options are a combination of one or more of the following: o = OK states, w = WARNING states, c = CRITICAL states, u = UNKNOWN states.
You will want to set various flapping options to determine what works best for your environment and checks.

Thanks,
Perry
Locked