Service notification - why?
Posted: Wed Aug 10, 2022 7:01 am
Hi
Is anyone able to explain why the following service notification was triggered:
We use the check_by_ssh plugin to perform active checks on a number of hosts & services.
Nagios host is running 4.4.6 on Rocky Linux 8.
For both types of check we use the following settings:
In this case there was a ~7 minute period where the host was unavailable (which is usually a network problem).
There is no alert log entry for the first service check retry but, it looks like the notification was triggered after the first check retry failed, even though the max_check_attempts is set to 13 and the host check was OK at this point.
If you need any more information, just let me know.
Thanks in advance.
Is anyone able to explain why the following service notification was triggered:
Code: Select all
2022-08-10 00:00:00+01:00 CURRENT HOST STATE: host.example.com;UP;HARD;1;OK: 10-08-2022 @ 08:55:13 AEST
2022-08-10 00:00:00+01:00 CURRENT SERVICE STATE: host.example.com;Samba;OK;HARD;1;OK: smb.service unit is active - 10-08-2022 @ 08:55:55 AEST
2022-08-10 12:07:11+01:00 HOST ALERT: host.example.com;DOWN;SOFT;1;Remote command execution failed: ssh: connect to host host.example.com port 22: Connection refused
2022-08-10 12:10:50+01:00 SERVICE ALERT: host.example.com;Samba;UNKNOWN;HARD;1;Remote command execution failed: ssh: connect to host host.example.com port 22: Connection refused
2022-08-10 12:12:11+01:00 HOST ALERT: host.example.com;DOWN;SOFT;2;Remote command execution failed: ssh: connect to host host.example.com port 22: Connection refused
2022-08-10 12:14:05+01:00 HOST ALERT: host.example.com;UP;SOFT;1;OK: 10-08-2022 @ 21:14:05 AEST
2022-08-10 12:16:05+01:00 SERVICE NOTIFICATION: support;host.example.com;Samba;UNKNOWN;service_notification;UNKNOWN - Plugin timed out
2022-08-10 12:20:54+01:00 SERVICE ALERT: host.example.com;Samba;OK;SOFT;1;OK: smb.service unit is active - 10-08-2022 @ 21:20:54 AEST
Nagios host is running 4.4.6 on Rocky Linux 8.
For both types of check we use the following settings:
Code: Select all
max_check_attempts 13
retry_interval 5
There is no alert log entry for the first service check retry but, it looks like the notification was triggered after the first check retry failed, even though the max_check_attempts is set to 13 and the host check was OK at this point.
If you need any more information, just let me know.
Thanks in advance.