Page 1 of 1

Core 4.0.8 and check_interval not accurate

Posted: Fri Dec 05, 2014 12:39 am
by Adrian
Hi experts,

I have a weird problem with Nagios Core 4.0.8
OS Red Hat Enterprise Linux Server release 6.6 (Santiago)

Installed it, configured all its working as expected with one exception.

The host check interval its not accurate. I set it up from template ( templates.cfg) and also to verify it further added the check_interval to one host ( hosts.cfg).

My config for nagios.cfg is interval_length=1 ; the other values are as default; the commands.cfg values are the defaults.

Templates.cfg file has the following contents
I set up the check_interval to 5 = 5 seconds . So every seconds Nagios should send icmp requests and get replies.

But when checking with tcpdump and from the browser the check_interval is set at random values.

Code: Select all

define host{
        name                            generic-host    ; The name of this hosttemplate
        notifications_enabled           1               ; Host notifications are enabled
        event_handler_enabled           1               ; Host event handler isenabled
        flap_detection_enabled          1               ; Flap detection is enabled
        process_perf_data               1               ; Process performance data
        retain_status_information       1               ; Retain status information across program restarts
        retain_nonstatus_information    1               ; Retain non-status information across program restarts
        notification_period             24x7            ; Send host notifications at any time
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

#template - add by me

define host{
        name                   node                   ; The name of this template
        use                    generic-host           ; This template inherits other values from the generic-host template
        check_period           24x7                   ; By default Linux hosts are checked round the clock
        check_interval         5                      ; Actively check the hostevery 1 minutes
        retry_interval         1                      ; Schedule host check retries at 1 minute intervals
        max_check_attempts     2                     ; Check each Linux host 2 times (max)
        check_command          check-host-alive       ; Default command to check Linux hosts
        notification_period    workhours              ; Linux admins hate to bewoken up, so we only notify during the day
        notification_interval  0                      ; Resend notifications every 2 hours
        notification_options   d,u,r                  ; Only send notificationsfor specific host states
        contact_groups         admins                 ; Notifications get sent to the admins by default
        register               0                      ;
           }
Host check interval random values

12-05-2014 14:34:02
Next Scheduled Active Check: 12-05-2014 14:34:21



If i configure the check_interval to specific host in hosts.cfg i get an 4 seconds delay between checks.

Code: Select all

define host{
         use               node
       
         check_interval    180
12-05-2014 14:36:17
Next Scheduled Active Check: 12-05-2014 14:39:21


Can you please tell me whats happening. I already google it, but found no anser.
I appreciate you help.

Thank you.

Adrian

Re: Core 4.0.8 and check_interval not accurate

Posted: Fri Dec 05, 2014 3:44 pm
by tmcdonald
How many hosts and services do you have? Setting the interval length to 1 second instead of 1 minute means that *every* check is now running 60x as often, and your server is likely being overloaded. What sort of CPU load are you seeing?

Re: Core 4.0.8 and check_interval not accurate

Posted: Fri Dec 05, 2014 4:53 pm
by emislivec
Do you have auto_reschedule_checks=1? The rescheduling can cause the actual check_interval to differ from the configured one.

How many checks do you have configured?

Re: Core 4.0.8 and check_interval not accurate

Posted: Mon Dec 08, 2014 12:43 am
by Adrian
Hello all

Thanks for the reply.


Please find the answer to your questions :


--- How many hosts and services do you have?

27 hosts and 72 services.


--- What sort of CPU load are you seeing?

CPU looks OK.
Cpu(s): 0.6%us, 0.7%sy, 0.0%ni, 92.2%id, 6.4%wa,

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 1174540 426520 1195052 0 0 6 509 10 85 1 1 92 6 0


--- Do you have auto_reschedule_checks=1?
No.

# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY

auto_reschedule_checks=0



--- How many checks do you have configured?

72 services.


The debug output shows that Nagios spends 4 seconds to process the hosts checks.
Can this be related to the 4 seconds delay i am seeing between hosts checks ?

# cat nagios.debug |egrep "Exec Time"
[1418017225.979126] [016.2] [pid=32604] Exec Time: 4.003
[1418017228.012173] [016.2] [pid=32604] Exec Time: 4.004
[1418017229.983048] [016.2] [pid=32604] Exec Time: 4.007
[1418017233.985993] [016.2] [pid=32604] Exec Time: 4.004
[1418017235.028010] [016.2] [pid=32604] Exec Time: 4.004
[1418017235.063869] [016.2] [pid=32604] Exec Time: 4.002

Thank you for your support.

Re: Core 4.0.8 and check_interval not accurate

Posted: Mon Dec 08, 2014 6:00 pm
by emislivec
Adrian wrote:The debug output shows that Nagios spends 4 seconds to process the hosts checks.
Can this be related to the 4 seconds delay i am seeing between hosts checks ?
I think you are right, and I think this may be a bug with the rescheduling of host checks. This doesn't seem directly related to the interval_length=1 setting.

I see this for hosts checked with the default check-host-alive command every 4 minutes on a system with the default interval_length=60:

Code: Select all

Last Check Time:	2014-12-08 14:11:49
Check Latency / Duration:	0.000 / 4.007 seconds
Next Scheduled Active Check:  	2014-12-08 14:15:53
But a not on test host using check_dummy every 5 minutes:

Code: Select all

Last Check Time:	2014-12-08 14:17:42
Check Latency / Duration:	0.000 / 0.003 seconds
Next Scheduled Active Check:  	2014-12-08 14:22:42
A host that is down (check command timed out) doesn't show this problem:

Code: Select all

Last Check Time:	2014-12-08 14:32:50
Check Latency / Duration:	0.000 / 30.007 seconds
Next Scheduled Active Check:  	2014-12-08 14:37:50
Services seem fine (this one is a ping service with a one minute check_interval):

Code: Select all

Last Check Time:	2014-12-08 14:20:35
Check Latency / Duration:	0.000 / 4.009 seconds
Next Scheduled Check:  	2014-12-08 14:21:35
So it seems that host checks that complete successfully (normal exit, no timeout, I need to test WARNING/CRITICAL/UNKNOWN returns) are rescheduled like "now + check_interval" instead of "last scheduled time + check_interval". This would explain the 4 second additional delay.

I need to test some more configurations. Do you get more stable check times with a check_interval larger than 5 seconds? How about 10?

Re: Core 4.0.8 and check_interval not accurate

Posted: Tue Dec 09, 2014 5:17 pm
by Adrian
I tried with bigger intervals 10,60 but had the same result.

If I put 56 and check every 56 seconds , nagios browser will dis play exactly 1 minute between checks. :D


Thanks for helping.

Re: Core 4.0.8 and check_interval not accurate

Posted: Wed Dec 10, 2014 5:43 pm
by abrist
I have seen this behavior on a few different systems as well. Hopefully the infamous "Eric" can get this bug fixed up!