service check timeout results in immediate hard state
Posted: Wed Jan 14, 2015 9:19 am
Hi,
Nagios® Core™ 3.4.1 installed from backport "http://ppa.launchpad.net/joffe-hero/ppa/ubuntu precise main", running on Ubuntu 12.04
I noticed that nagios sometimes sets hard state too quickly if service check times out.
Example 1:
[hard state set immediately]
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)
Example 2:
[hard set too quickly]
Service Critical[2015-01-12 15:44:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-12 15:44:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)
On the other hand for some other services but using the same generic service template definition everything works as expected (there is one minute interval between soft states)
Example 3:
Service Critical[2015-01-14 10:57:57] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-14 10:56:47] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-14 10:55:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-14 10:54:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;1;(Service Check Timed Out)
Sample of some variables in my config files which can be meaningfull in this issue:
generic-service_nagios2.cfg:
nagios.cfg:
I found quite old (from AD 2006) report for the same or similar issue on nagios users mailing list:
https://www.mail-archive.com/nagios-use ... 05634.html
I tested this issue by setting "sleep 100" at the beginning of the check script.
Also I was able to reproduce this issue only on nagios servers running in Docker containers and unable to reproduce it on "normal" systems. However as it's not reproducible on every service I'm not quite sure if Docker virtualization can play any role here.
Any ideas what could be the root cause of this issue and how to fix it?
Thanks in advance for your response,
Rafał Jankowski
Nagios® Core™ 3.4.1 installed from backport "http://ppa.launchpad.net/joffe-hero/ppa/ubuntu precise main", running on Ubuntu 12.04
I noticed that nagios sometimes sets hard state too quickly if service check times out.
Example 1:
[hard state set immediately]
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)
Example 2:
[hard set too quickly]
Service Critical[2015-01-12 15:44:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-12 15:44:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)
On the other hand for some other services but using the same generic service template definition everything works as expected (there is one minute interval between soft states)
Example 3:
Service Critical[2015-01-14 10:57:57] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-14 10:56:47] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-14 10:55:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-14 10:54:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;1;(Service Check Timed Out)
Sample of some variables in my config files which can be meaningfull in this issue:
generic-service_nagios2.cfg:
Code: Select all
# generic service template definition
retry_check_interval 1
Code: Select all
service_check_timeout=60
interval_length=60
https://www.mail-archive.com/nagios-use ... 05634.html
I tested this issue by setting "sleep 100" at the beginning of the check script.
Also I was able to reproduce this issue only on nagios servers running in Docker containers and unable to reproduce it on "normal" systems. However as it's not reproducible on every service I'm not quite sure if Docker virtualization can play any role here.
Any ideas what could be the root cause of this issue and how to fix it?
Thanks in advance for your response,
Rafał Jankowski