service check timeout results in immediate hard state

jankowsr · Post by **jankowsr** » Wed Jan 14, 2015 9:19 am

Hi,

Nagios® Core™ 3.4.1 installed from backport "http://ppa.launchpad.net/joffe-hero/ppa/ubuntu precise main", running on Ubuntu 12.04

I noticed that nagios sometimes sets hard state too quickly if service check times out.

Example 1:
[hard state set immediately]
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)

Example 2:
[hard set too quickly]
Service Critical[2015-01-12 15:44:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-12 15:44:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)

On the other hand for some other services but using the same generic service template definition everything works as expected (there is one minute interval between soft states)
Example 3:
Service Critical[2015-01-14 10:57:57] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-14 10:56:47] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-14 10:55:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-14 10:54:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;1;(Service Check Timed Out)

Sample of some variables in my config files which can be meaningfull in this issue:
generic-service_nagios2.cfg:

Code: Select all

# generic service template definition
retry_check_interval            1

nagios.cfg:

Code: Select all

service_check_timeout=60
interval_length=60

I found quite old (from AD 2006) report for the same or similar issue on nagios users mailing list:
https://www.mail-archive.com/nagios-use ... 05634.html

I tested this issue by setting "sleep 100" at the beginning of the check script.

Also I was able to reproduce this issue only on nagios servers running in Docker containers and unable to reproduce it on "normal" systems. However as it's not reproducible on every service I'm not quite sure if Docker virtualization can play any role here.

Any ideas what could be the root cause of this issue and how to fix it?

Thanks in advance for your response,
Rafał Jankowski

abrist · Post by **abrist** » Wed Jan 14, 2015 3:07 pm

jankowsr wrote: Any ideas what could be the root cause of this issue and how to fix it?

Not from what I see up there. Can you post the configs for those objects?

EDIT: I ask because certain settings like is_volatile will cause events to fire off with every check.

jankowsr · Post by **jankowsr** » Thu Jan 15, 2015 6:55 am

abrist,

I believe there are most likely installation defaults:

Code: Select all

# generic service template definition
define service{
        name                            generic-service ; The 'name' of this service template
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/accepted
        parallelize_check               1       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are enabled
        event_handler_enabled           1       ; Service event handler is enabled
        flap_detection_enabled          0       ; Flap detection is disabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
        notification_interval           15              ; Only send notifications on status change by default.
        is_volatile                     0
        check_period                    24x7
        normal_check_interval           5
        retry_check_interval            1
        max_check_attempts              4
        notification_period             24x7
        notification_options            w,u,c,r
        contact_groups                  admins
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

define service {
        use                     generic-service
        hostgroup_name          WindowsServers
        service_description     CA Healthcheck
        check_command           check_script!"$_VARIABLE$"!"parameter"
}

abrist · Post by **abrist** » Thu Jan 15, 2015 3:07 pm

Could you post your full nagios.cfg?
Are there multiple core parent processes running concurrently?

Code: Select all

ps -aef | grep nagios.cfg

jankowsr · Post by **jankowsr** » Fri Jan 16, 2015 8:24 am

abrist wrote:Could you post your full nagios.cfg?

The logs from the first post come from two different nagios servers (the first one monitors LIV2BC-CA-CA01 and the other SBX1BC-CA-CA01). I have attached nagios.cfg files for both of them and renamed accordingly to monitored server names. They should be actually identical but by mistake the one for SBX1BC-CA-CA01 got config prepared for non-dockerized instances therefore persistant data is stored in standard /var location instead of /var/nagios3_mount. That's probably unrelated to our issue as both servers are affected, so we can probably focus on just one of them.

abrist wrote: Are there multiple core parent processes running concurrently?
Code: Select all
ps -aef | grep nagios.cfg

Both above mentioned nagioses run in docker containers. On LIV2BC nagios server there is just one docker container and on SBX1BC nagios there are multiple docker containers so obviously ps output on docker host will return multiple nagios processes. Inside each docker container there should be just one nagios server. However I noticed it's not quite like that. Please refer to the output below (both commands are issued inside docker containers):

Code: Select all

root@liv2bc-ut-nagios-docker:/tmp# for i in {1..10000}; do ps -aef |grep nagios.cfg|grep -v grep |wc -l; done > nagios_procs
root@liv2bc-ut-nagios-docker:/tmp# grep 1 nagios_procs |wc -l
9494
root@liv2bc-ut-nagios-docker:/tmp# grep 2 nagios_procs |wc -l
467
root@liv2bc-ut-nagios-docker:/tmp# grep 3 nagios_procs |wc -l
34
root@liv2bc-ut-nagios-docker:/tmp# grep 4 nagios_procs |wc -l
3
root@liv2bc-ut-nagios-docker:/tmp# grep 5 nagios_procs |wc -l
2

Code: Select all

root@sbx1bc-ut-nagios-docker:~# for i in {1..10000}; do ps -aef |grep nagios.cfg|grep -v grep |wc -l; done > nagios_procs
root@sbx1bc-ut-nagios-docker:~# grep 1 nagios_procs |wc -l
9920
root@sbx1bc-ut-nagios-docker:~# grep 2 nagios_procs |wc -l
57
root@sbx1bc-ut-nagios-docker:~# grep 3 nagios_procs |wc -l
19
root@sbx1bc-ut-nagios-docker:~# grep 4 nagios_procs |wc -l
4
root@sbx1bc-ut-nagios-docker:~#

As you can see most of time there is only one nagios process running but there are short moments when multiple nagios processes are running. At the moment I don't know what invokes these extra nagios processes but I noticed similar behaviour also on non-dockerized nagios servers.

Post by **tgriep** » Fri Jan 16, 2015 2:55 pm

Could you post the config files for LIV2BC-CA-CA01 and SBX1BC-CA-CA01 services?

Nagios Support Forum

service check timeout results in immediate hard state

service check timeout results in immediate hard state

Re: service check timeout results in immediate hard state

Re: service check timeout results in immediate hard state

Re: service check timeout results in immediate hard state

Re: service check timeout results in immediate hard state

Re: service check timeout results in immediate hard state