Page 2 of 3

Re: Checks always falling behind

Posted: Wed Jul 24, 2013 4:11 pm
by grimm26
abrist wrote:Are the checks running as well as queuing, or just queuing?
I don't think they are running, but some hosts say that a check has been run recently. Some are pending, but I think those are all hosts that do not have a service assigned to them.

Re: Checks always falling behind

Posted: Thu Jul 25, 2013 2:05 pm
by abrist
Are these actual host checks, or is there a chance they are a service check that is running a host-alive check or other icmp check?

Re: Checks always falling behind

Posted: Fri Jul 26, 2013 3:26 pm
by grimm26
abrist wrote:You will probably want to just pull information from the scheduliong queue cgi and grab the topmost table entry for next check time:

Code: Select all

http://<nagios server ip>/nagios/cgi-bin/extinfo.cgi?type=7
I came up with this one liner to get the time of the next check:

Code: Select all

curl -s -u nagiosadmin:<password> http://<nagios server ip>/nagios/cgi-bin/extinfo.cgi?type=7 | grep -m 2 "<TR CLASS=" | tail -n1 | awk 'BEGIN { FS = "<TD CLASS=\047queueOdd\047>|<TD CLASS=\047queueEven\047>" } ; { print $4 }' | sed 's/<.*//'
Obviously, replace <password> and <nagios server ip> with their actual values for your environment. At this point you can compare the date reported to the current date of the nagios system and report it through a plugin script right to the XI interface:

Code: Select all

#!/bin/bash

# Get time/date from topmost entry in the schedule queue for the next check.  Returns 'CCYY-MM-DD hh:mm:ss'.  
NEXT=$(curl -s -u nagiosadmin:<password> http://<nagios server ip>/nagios/cgi-bin/extinfo.cgi?type=7 | grep -m 2 "<TR CLASS=" | tail -n1 | awk 'BEGIN { FS = "<TD CLASS=\047queueOdd\047>|<TD CLASS=\047queueEven\047>" } ; { print $4 }' | sed 's/<.*//'| awk 'BEGIN { FS = " |-"};{ print $3,$1,$2,$4 }' | sed 's/ /-/g' | sed 's/-/ /g3')

# Converts date time above to unix time.
NEXTUT=$(date -d "$NEXT" +%s)

# Get current unix time
CURRENT=$(date +%s)

# Subtract current time from next check time
OFFSET=$(($NEXTUT - $CURRENT))

# Echo offset string for nagios status data.
echo "The scheduler is currently Offset by $OFFSET seconds | offset=$OFFSET"

# Exit with 0 so that Nagios shows 'OK'  
exit 0
That was fun.
Thanks, that's a nice start. I'm interested in $5 of the first awk, though (the Next Check column). Also had to undo your manipulation of the date/time since I already have Nagios set for iso8601. So I become concerned when your script starts showing a negative offset :). I'm putting this in now. I had this issue crop up again overnight and I think it correlates with when I did a nagios reload instead of a restart. I'll avoid reloads from now on and see if that helps. If only I could get ndomod to load config dat asynchronously at startup....

Re: Checks always falling behind

Posted: Mon Jul 29, 2013 10:11 am
by grimm26
abrist wrote:Are the checks running as well as queuing, or just queuing?
Upon further review today, the host checks are definitely running. execute_host_checks is set to 0 in nagios.cfg but I am using a default template for my hosts which has checks enabled. That shouldn't matter if the master setting in nagios.cfg is set to 0, right?
I'm running Nagios 3.5.0 on RHEL6.4.

Re: Checks always falling behind

Posted: Mon Jul 29, 2013 7:31 pm
by scottwilkerson
grimm26 wrote:
abrist wrote:Are the checks running as well as queuing, or just queuing?
Upon further review today, the host checks are definitely running. execute_host_checks is set to 0 in nagios.cfg but I am using a default template for my hosts which has checks enabled. That shouldn't matter if the master setting in nagios.cfg is set to 0, right?
I'm running Nagios 3.5.0 on RHEL6.4.
Well, there is also another layer that takes precedent. If a command was submitted via the web UI or command pipe it will override the setting in the nagios.cfg

The only way to know for sure would be to look in the objects.cached

Re: Checks always falling behind

Posted: Mon Jul 29, 2013 8:47 pm
by grimm26
scottwilkerson wrote:
grimm26 wrote:
abrist wrote:Are the checks running as well as queuing, or just queuing?
Upon further review today, the host checks are definitely running. execute_host_checks is set to 0 in nagios.cfg but I am using a default template for my hosts which has checks enabled. That shouldn't matter if the master setting in nagios.cfg is set to 0, right?
I'm running Nagios 3.5.0 on RHEL6.4.
Well, there is also another layer that takes precedent. If a command was submitted via the web UI or command pipe it will override the setting in the nagios.cfg

The only way to know for sure would be to look in the objects.cached
Nope. I'm the only one using the UI or the CLI. Nagios is showing that host checks are disabled, but they are still queueing up and running.

Re: Checks always falling behind

Posted: Tue Jul 30, 2013 10:22 am
by grimm26
Got a chance to restart Nagios this morning and I added

Code: Select all

active_checks_enabled  0
to the generic-host template. Only after that are host checks not being scheduled and executed.
Bottom line, it seems like

Code: Select all

execute_host_checks=0
in nagios.cfg doesn't do anything.

Re: Checks always falling behind

Posted: Tue Jul 30, 2013 11:10 am
by grimm26
I opened issue 469

Re: Checks always falling behind

Posted: Tue Jul 30, 2013 2:36 pm
by abrist
Great. Thanks for the sleuthing. This directive should either be fixed or removed, from at least the documentation.

Re: Checks always falling behind

Posted: Thu Aug 15, 2013 2:38 pm
by grimm26
anyway, service checks are still falling behind on this machine :). Nagiostats tells me:

Code: Select all

Nagios Stats 3.5.0
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 03-15-2013
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/log/nagios/status.dat
Status File Age:                        0d 0h 0m 3s
Status File Version:                    3.5.0

Program Running Time:                   0d 14h 39m 24s
Nagios PID:                             3360
Used/High/Total Command Buffers:        0 / 1927 / 8192

Total Services:                         24096
Services Checked:                       24095
Services Scheduled:                     6911
Services Actively Checked:              6912
Services Passively Checked:             17184
Total Service State Change:             0.000 / 36.780 / 0.143 %
Active Service Latency:                 0.000 / 550.965 / 528.919 sec
Active Service Execution Time:          0.000 / 190.632 / 1.134 sec
Active Service State Change:            0.000 / 36.780 / 0.353 %
Active Services Last 1/5/15/60 min:     531 / 2646 / 6911 / 6911
Passive Service Latency:                0.069 / 5.159 / 2.946 sec
Passive Service State Change:           0.000 / 11.320 / 0.058 %
Passive Services Last 1/5/15/60 min:    673 / 5122 / 17184 / 17184
Services Ok/Warn/Unk/Crit:              24010 / 4 / 71 / 11
Services Flapping:                      0
Services In Downtime:                   0
What is the difference between service latency and service execution time and why is there is such a big difference between the two. My service checks are all 5 minute intervals and the max execution time fits within that. Why is the latency so high then?
[edit] oh duh cuz it doesn't fork 6911 checks at once. I'm working on getting the execution time down but I may just have to split out into multiple instances.