Page 1 of 1
Service Recovery is not triggering for some of the services
Posted: Wed Oct 20, 2021 4:00 am
by vishal313
Hi All,
We have Nagios XI 5.8.5. We are monitoring servers(Windows as well as Unix) for CPU, Memory, Disk, etc.
Nagios XI 5.8.5 is integrated with a ticketing tool for raising incidents based on threshold breached events.
Recently we have found that the monitored services are not triggering "Service Recovery" notifications for some of the monitored services.
The service configuration has "Recovery" option enabled for triggering a "Service Recovery" notification. This is not happening for all services but some and the counts are increasing as time passes.
This is a concern since the Service Recovery notification is not triggering, the incident raised in the ticketing tool is not getting auto resolved and we need a manual intervention to mark them resolved.
Can you please help me in understanding and solving this issue for why the Service Recovery notification is not getting triggered once the Service comes back to OK state.
Thanks in advance.
Regards
Vishal Dhote
Re: Service Recovery is not triggering for some of the servi
Posted: Wed Oct 20, 2021 2:04 pm
by ssax
Please PM a copy of your profile.zip so that I can review some information, you can download it from Admin > System Profile by clicking the Download Profile button.
PM me the nagios archive file from the last time that this occurred as well:
- NOTE: The files are rotated at midnight the next day so if you're looking for data from the 15th, send the file named with the 16th.
Code: Select all
/usr/local/nagios/var/archives/nagios-XX-XX-XXXX-00.log
Include the exact host name, service name, and contact name that should have been sent the recovery but wasn't.
Please run the Reports > Notifications report for that time period against the host. Does it list the recovery notification as trying to be sent to the contact (even though they didn't receive it)?
Re: Service Recovery is not triggering for some of the servi
Posted: Wed Oct 20, 2021 2:15 pm
by pbroste
Hello
@vishal313
Thanks for reaching out, and have some suggestions that I want to go over.
The suggestion is to look at the 'eventman.log' and view the events logged during the 'Service Recovery Notification'.
Here is an example that you will see:
...........
[hoststate] => UP
[hoststateid] => 0
[hosteventid] => 709709
[hostproblemid] => 0
[servicestate] => OK
[servicestateid] => 0
[lastservicestate] => WARNING
[lastservicestateid] => 1
[servicestatetype] => SOFT or HARD
.......
Code: Select all
cat /usr/local/nagiosxi/var/eventman.log | less -SR
I speculate the service went directly to 'hard critical' when the host was 'soft down', then the service recovered.
Here's a further explanation of the logic
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/statetypes.html
Thanks,
Perry
Re: Service Recovery is not triggering for some of the servi
Posted: Fri Oct 22, 2021 4:14 am
by vishal313
@ssax : I have shared the requested information with you. Please help with your inputs.
Re: Service Recovery is not triggering for some of the servi
Posted: Fri Oct 22, 2021 4:24 am
by vishal313
Hi @pbroste : I have shared the files and information with you. Please help me with your inputs.
Re: Service Recovery is not triggering for some of the servi
Posted: Fri Oct 22, 2021 12:55 pm
by pbroste
Hello
@vishal313
Thanks for sending over the System Profile and event logs.
We see that the:
Host: tXXXXXXXX003 with Service: Memory Utilization is going to an 'UNKNOWN' state when the check times out. There is no notification sent when it goes back to an 'OK' state per config.
Options to check on why the service check is timing out and increase the timeout. Or include 'UNKNOWN" to trigger notification. I also want to point out that your 'service_check_timeout_state' in 'nagios.cfg' is set to default 'UNKNOWN'. Typically the default is CRITICAL.
We see that
Host: bXXXXXXXX071 with Service : CPU Utilization[/icode] is going to a SOFT state and when it recovers from SOFT we do not see 'Event Notifications Sent'. We do not see that the 'Max_Check_Attempts' are reached, and do not go to a HARD state. I found
this site and lays out an excellent example.
Thanks,
Perry
Thanks,
Perry
Re: Service Recovery is not triggering for some of the servi
Posted: Tue Oct 26, 2021 12:12 am
by vishal313
Thank you @pbroste.
Can flapping of the service/host could be one of the reason for this situation.
What would happen if we disable flapping globally in Nagios XI. What impact it creates.
Regards
Vishal Dhote
Re: Service Recovery is not triggering for some of the servi
Posted: Tue Oct 26, 2021 1:38 pm
by pbroste
Hello
@vishal313
Thanks for following up, there are
'Global Variables' and 'Object-Specific Variables on 'Flap Detection Thresholds'.
Flap options:
low_flap_threshold: This directive is used to specify the low state change threshold used in flap detection for this service. More information on flap detection can be found here. If you set this directive to a value of 0, the program-wide value specified by the low_service_flap_threshold directive will be used.
high_flap_threshold: This directive is used to specify the high state change threshold used in flap detection for this service. More information on flap detection can be found here. If you set this directive to a value of 0, the program-wide value specified by the high_service_flap_threshold directive will be used.
flap_detection_enabled *: This directive is used to determine whether or not flap detection is enabled for this service. More information on flap detection can be found here. Values: 0 = disable service flap detection, 1 = enable service flap detection.
flap_detection_options: This directive is used to determine what service states the flap detection logic will use for this service. Valid options are a combination of one or more of the following: o = OK states, w = WARNING states, c = CRITICAL states, u = UNKNOWN states.
You will want to set various flapping options to determine what works best for your environment and checks.
Thanks,
Perry