service check timeout results in immediate hard state

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
jankowsr
Posts: 3
Joined: Wed Jan 14, 2015 8:14 am

service check timeout results in immediate hard state

Post by jankowsr »

Hi,

Nagios® Core™ 3.4.1 installed from backport "http://ppa.launchpad.net/joffe-hero/ppa/ubuntu precise main", running on Ubuntu 12.04

I noticed that nagios sometimes sets hard state too quickly if service check times out.

Example 1:
[hard state set immediately]
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-10 17:54:04] SERVICE ALERT: LIV2BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)

Example 2:
[hard set too quickly]
Service Critical[2015-01-12 15:44:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-12 15:44:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:37] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-12 15:43:17] SERVICE ALERT: SBX1BC-CA-CA01;CA Healthcheck;CRITICAL;SOFT;1;(Service Check Timed Out)

On the other hand for some other services but using the same generic service template definition everything works as expected (there is one minute interval between soft states)
Example 3:
Service Critical[2015-01-14 10:57:57] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;HARD;4;(Service Check Timed Out)
Service Critical[2015-01-14 10:56:47] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;3;(Service Check Timed Out)
Service Critical[2015-01-14 10:55:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;2;(Service Check Timed Out)
Service Critical[2015-01-14 10:54:37] SERVICE ALERT: SBX1BC-CA-CA01;SplunkForwarder;CRITICAL;SOFT;1;(Service Check Timed Out)

Sample of some variables in my config files which can be meaningfull in this issue:
generic-service_nagios2.cfg:

Code: Select all

# generic service template definition
retry_check_interval            1
nagios.cfg:

Code: Select all

service_check_timeout=60
interval_length=60
I found quite old (from AD 2006) report for the same or similar issue on nagios users mailing list:
https://www.mail-archive.com/nagios-use ... 05634.html

I tested this issue by setting "sleep 100" at the beginning of the check script.

Also I was able to reproduce this issue only on nagios servers running in Docker containers and unable to reproduce it on "normal" systems. However as it's not reproducible on every service I'm not quite sure if Docker virtualization can play any role here.

Any ideas what could be the root cause of this issue and how to fix it?

Thanks in advance for your response,
Rafał Jankowski
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: service check timeout results in immediate hard state

Post by abrist »

jankowsr wrote: Any ideas what could be the root cause of this issue and how to fix it?
Not from what I see up there. Can you post the configs for those objects?

EDIT: I ask because certain settings like is_volatile will cause events to fire off with every check.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jankowsr
Posts: 3
Joined: Wed Jan 14, 2015 8:14 am

Re: service check timeout results in immediate hard state

Post by jankowsr »

abrist,

I believe there are most likely installation defaults:

Code: Select all

# generic service template definition
define service{
        name                            generic-service ; The 'name' of this service template
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/accepted
        parallelize_check               1       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are enabled
        event_handler_enabled           1       ; Service event handler is enabled
        flap_detection_enabled          0       ; Flap detection is disabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
        notification_interval           15              ; Only send notifications on status change by default.
        is_volatile                     0
        check_period                    24x7
        normal_check_interval           5
        retry_check_interval            1
        max_check_attempts              4
        notification_period             24x7
        notification_options            w,u,c,r
        contact_groups                  admins
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

define service {
        use                     generic-service
        hostgroup_name          WindowsServers
        service_description     CA Healthcheck
        check_command           check_script!"$_VARIABLE$"!"parameter"
}
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: service check timeout results in immediate hard state

Post by abrist »

Could you post your full nagios.cfg?
Are there multiple core parent processes running concurrently?

Code: Select all

ps -aef | grep nagios.cfg
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jankowsr
Posts: 3
Joined: Wed Jan 14, 2015 8:14 am

Re: service check timeout results in immediate hard state

Post by jankowsr »

abrist wrote:Could you post your full nagios.cfg?
The logs from the first post come from two different nagios servers (the first one monitors LIV2BC-CA-CA01 and the other SBX1BC-CA-CA01). I have attached nagios.cfg files for both of them and renamed accordingly to monitored server names. They should be actually identical but by mistake the one for SBX1BC-CA-CA01 got config prepared for non-dockerized instances therefore persistant data is stored in standard /var location instead of /var/nagios3_mount. That's probably unrelated to our issue as both servers are affected, so we can probably focus on just one of them.
abrist wrote: Are there multiple core parent processes running concurrently?

Code: Select all

ps -aef | grep nagios.cfg
Both above mentioned nagioses run in docker containers. On LIV2BC nagios server there is just one docker container and on SBX1BC nagios there are multiple docker containers so obviously ps output on docker host will return multiple nagios processes. Inside each docker container there should be just one nagios server. However I noticed it's not quite like that. Please refer to the output below (both commands are issued inside docker containers):

Code: Select all

root@liv2bc-ut-nagios-docker:/tmp# for i in {1..10000}; do ps -aef |grep nagios.cfg|grep -v grep |wc -l; done > nagios_procs
root@liv2bc-ut-nagios-docker:/tmp# grep 1 nagios_procs |wc -l
9494
root@liv2bc-ut-nagios-docker:/tmp# grep 2 nagios_procs |wc -l
467
root@liv2bc-ut-nagios-docker:/tmp# grep 3 nagios_procs |wc -l
34
root@liv2bc-ut-nagios-docker:/tmp# grep 4 nagios_procs |wc -l
3
root@liv2bc-ut-nagios-docker:/tmp# grep 5 nagios_procs |wc -l
2

Code: Select all

root@sbx1bc-ut-nagios-docker:~# for i in {1..10000}; do ps -aef |grep nagios.cfg|grep -v grep |wc -l; done > nagios_procs
root@sbx1bc-ut-nagios-docker:~# grep 1 nagios_procs |wc -l
9920
root@sbx1bc-ut-nagios-docker:~# grep 2 nagios_procs |wc -l
57
root@sbx1bc-ut-nagios-docker:~# grep 3 nagios_procs |wc -l
19
root@sbx1bc-ut-nagios-docker:~# grep 4 nagios_procs |wc -l
4
root@sbx1bc-ut-nagios-docker:~#
As you can see most of time there is only one nagios process running but there are short moments when multiple nagios processes are running. At the moment I don't know what invokes these extra nagios processes but I noticed similar behaviour also on non-dockerized nagios servers.
Attachments
sbx1bc-nagios.cfg
(43.59 KiB) Downloaded 293 times
liv2bc-nagios.cfg
(43.77 KiB) Downloaded 319 times
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: service check timeout results in immediate hard state

Post by tgriep »

Could you post the config files for LIV2BC-CA-CA01 and SBX1BC-CA-CA01 services?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked