[Nagios-devel] Reproducible bug scheduling second host check

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Reproducible bug scheduling second host check

Post by Guest »


----=_pe3dn3tqn1atl8stbn0ahg4pjguvcu0v1l.MFSBCHJLHS
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I have, for some time, been suspicions that Nagios 3 appeared a bit slow
alerting on host failures. In order to diagnose this, I setup a minimal
config to try to show the problem using 3.0rc1 with a default nagios.cfg.
It consists of one host and one service (both checked by ping), with the
retries and timers set as below (the config file is attached in full):-

|define host{
| host_name hostname
| check_interval 10
| retry_interval 1
| max_check_attempts 5
|}
|define service{
| max_check_attempts 5
| normal_check_interval 5
| retry_check_interval 5
| host_name hostname
| service_description ping
|}

The host was up at 16:25 with the next host check scheduled for 16:35 and
the next service check for 16:30:-

|[2007-12-29 16:25:42] [1198945542.102847] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:30:32 2007
|[2007-12-29 16:25:52] [1198945552.201392] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=1/5, Type=HARD, Final State=0
|[2007-12-29 16:25:52] [1198945552.201763] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:35:52 2007

The host was then unplugged and the service was checked as expected at
16:30. As the service was down, the host (again as expected) was checked.
The next host check was scheduled for 16:32 as expected with the host
retry_interval of 1 minute:-

|[2007-12-29 16:30:32] [1198945832.178688] [016.0] [pid=3660] Attempting to run scheduled check of service 'ping' on host 'hostname': check options=0, latency=0.178000
|[2007-12-29 16:30:52] [1198945852.109282] [016.1] [pid=3660] HOST: hostname, SERVICE: ping, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CRITICAL - Host Unreachable (m18.itconsult.net)
|[2007-12-29 16:30:52] [1198945852.109557] [016.1] [pid=3660] Service is in a non-OK state
|[2007-12-29 16:30:52] [1198945852.109579] [016.1] [pid=3660] Host is currently UP, so we'll recheck its state to make sure...
|[2007-12-29 16:30:52] [1198945852.109626] [016.0] [pid=3660] ** Running async check of host 'hostname'...
|[2007-12-29 16:30:52] [1198945852.127798] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:35:32 2007
|[2007-12-29 16:31:02] [1198945862.209380] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=1/5, Type=SOFT, Final State=1
|[2007-12-29 16:31:02] [1198945862.210709] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:32:02 2007
|[2007-12-29 16:31:02] [1198945862.210735] [016.2] [pid=3660] Found another host check event for this host...
|[2007-12-29 16:31:02] [1198945862.210760] [016.2] [pid=3660] Keeping original host check event (ignoring the new one).
|[2007-12-29 16:31:02] [1198945862.210831] [016.1] [pid=3660] ** Async check result for host 'hostname' handled: new state=1

However, although the CGIs continued to show the next host check for 16:32,
nothing happened until:-

|[2007-12-29 16:35:32] [1198946132.119030] [016.0] [pid=3660] Attempting to run scheduled check of service 'ping' on host 'hostname': check options=0, latency=0.118000
|[2007-12-29 16:35:42] [1198946142.223011] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:40:32 2007
|[2007-12-29 16:35:52] [1198946152.053988] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.053000
|[2007-12-29 16:35:52] [1198946152.054042] [016.0] [pid=3660] ** Running async check of host 'hostname'...
|[2007-12-29 16:36:02] [1198946162.152677] [016.2] [pid=3660] Host has no parents, so it is DOWN.
|[2007-12-29 16:36:02] [1198946162.152699] [016.1] [pid=3660]

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked