Passive checks STALE from a relatively newbie

estienne · Post by **estienne** » Fri Sep 19, 2014 9:38 am

Hi,

My environment: around 1000 unix host, one nagios server, ~18000 services. Nagios 4.0.7. Most checks are "active checks"
For OS backups (not data which are handled differently), we do them once a week. I setup passive checks with a timeout of 8 days. Most of the time it just works. When the backup finishes, it sent "OK" or "CRITICAL" using nsca.
But from time to time, one server or another got a STALE checks. In this case, the nsca event was received (as the text indicate 2014/09/13, also found nsca output in syslog), but the STALE popped up 3.5 days later?!?!? (2014/09/16).
Goal: to have an alert if either the backup fails (nsca event) or there is no nsca event (backup not scheduled for any reason).

From the service log entry:

Code: Select all

Event Start Time	Event End Time	Event Duration	Event/State Type	Event/State Information
09-14-2014 00:00:00	09-14-2014 08:41:26	0d 8h 41m 26s	SERVICE OK (HARD)	OS backup ok 2014/09/13 00:39:45
09-15-2014 00:00:00	09-15-2014 08:41:27	0d 8h 41m 27s	SERVICE OK (HARD)	OS backup ok 2014/09/13 00:39:45
09-16-2014 00:00:00	09-16-2014 08:41:26	0d 8h 41m 26s	SERVICE OK (HARD)	OS backup ok 2014/09/13 00:39:45
09-16-2014 17:08:21	09-16-2014 18:41:25	0d 1h 33m 4s	SERVICE WARNING (HARD)	WARNING: STALE passive check. Please check.
09-17-2014 00:00:00	09-17-2014 01:17:00	0d 1h 17m 0s	SERVICE WARNING (HARD)	WARNING: STALE passive check. Please check.
09-18-2014 00:00:00	09-18-2014 01:17:01	0d 1h 17m 1s	SERVICE WARNING (HARD)	WARNING: STALE passive check. Please check.
09-19-2014 00:00:00	09-19-2014 08:41:30	0d 8h 41m 30s	SERVICE WARNING (HARD)	WARNING: STALE passive check. Please check.

Code: Select all

Definition of the service:
define service{
        host_name                       drpa8p00d
        use                             generic-service
        check_command                   check_dummy
        normal_check_interval           11520 # in minutes
        notification_interval           11520 # in minutes
        service_description             OS Backup
        active_checks_enabled           0
        passive_checks_enabled          1
        max_check_attempts              1
        check_freshness                 1
        freshness_threshold             691200 # 11520 minutes
        }

interval_length=60 in nagios.cfg (1 minute)

There is a frequent "service nagios reload", about each 2 hours on nagios. Can it interfere? Hours are 8,10,12,14,16,18,23 and minutes is 41. This event start time is 17:08, not related to the nagios reload (the End Time is related, 18h41, no surprise).

Why so much reload - all the configuration is done by scripting. A lot of scripts. So adding a new server, removing one, or even some tuning are done on-the-fly with the current inventory, and reload is required.

Please note: most of the time the setup WORKS (1000 servers!), maybe one server a day or a week got this behavior from one random server...

Question:

1- is my service definition ok?
2- if yes, is it a bug?

Note: we had a lot of problem with Nagios 4.0.1, upgrading to 4.0.7 solved most of them, mostly in passive checks / nsca.
Any help greatly appreciated.

tmcdonald · Post by **tmcdonald** » Mon Sep 22, 2014 4:31 pm

What sort of load are you experiencing on this server? Do the STALE results correlate with a high load? How many CPUs do you have?

estienne · Post by **estienne** » Tue Sep 23, 2014 9:32 am

The average cpu utilization is around 5%. There is 4 virtual CPU (virtualization is OVM). The "load" is normally low (less than 10; mostly around 2-3).
The only peaks are when the "auto-configure" scripts starts, multiple times per day. The nsca event and the "expiration" seen didn't happen during that time.

Does my configuration make sense? Did I miss something?

sreinhardt · Post by **sreinhardt** » Wed Sep 24, 2014 1:30 pm

Your configuration looks ok to me, there might be something in the template, but if that is the normal generic-service I wouldn't expect anything funny. The autoreconfigure stuff is there to help core restructure check scheduling, and do some other helper functions. Does there seem to be a time correlation between when the last successful check came in, and when they go stale. Could you provide a section of the nagios or /var/log/messages a minute or two before and after one of the events goes stale?

estienne · Post by **estienne** » Fri Oct 03, 2014 8:28 am

I think I found it. Human error (maybe) + nagios behavior...
When a server goes done, all service associated with it goes "critical" (cannot connect to nrpe). Often, instead on re-scheduling all services for a check one by one, we use "Schedule a check of all services on this host". This include the passive check, which goes "STALE" when we do that.
Is there a way NEVER to allow an active check on a service? Skip that service when we schedule "check all service"?

--> One weakness of Nagios, the way we use it, is the amount of time required to check multiple service, after a network problem for example: either you wait for normal timeout (some services are scheduled every hour, takes a while), or you manually re-schedule all affected service. It would be nice to have a "checkbox" on the "problems service" window and a button "reschedule check for selected" or "reschedule check of all in this page" button or even better, both of them.

abrist · Post by **abrist** » Fri Oct 03, 2014 11:58 am

We provide a bulk command tool for XI. But in core, this feature is lacking - I doubt it will get added until we move away from the current html generating cgis (whenever that will be . . .)

estienne · Post by **estienne** » Mon Oct 06, 2014 8:04 am

You're tempting me here

Question, not related to initial thread: do we have the source code of Nagios XI if we "buy" it?

Reason: for authentication, we use single-signon (IBM webseal), so instead of checking "REMOTE_USER" variable in the HTTP transaction*, we use "HTTP_IV_USER", which works great. We just modify 1 line in "cgiauth.c". I just want to check if we go with Nagios XI, could we make this "modification/hack", this way or another way?

* I'm not HTTP/HTML expert, somebody else in our company find that! I may not have the rigth keyword, but you get the idea.

tmcdonald · Post by **tmcdonald** » Mon Oct 06, 2014 9:43 am

Most of the source code for XI is open, but there are some key portions of it which are encrypted. However for all the Nagios Core stuff you can still make the edits and re-compile, just make sure to maintain the changes on update as they might be overwritten. You would need to be careful with this step though, as XI compiles Core a bit differently than a standalone Core installation itself.

Nagios Support Forum

Passive checks STALE from a relatively newbie

Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie

Re: Passive checks STALE from a relatively newbie