My environment: around 1000 unix host, one nagios server, ~18000 services. Nagios 4.0.7. Most checks are "active checks"
For OS backups (not data which are handled differently), we do them once a week. I setup passive checks with a timeout of 8 days. Most of the time it just works. When the backup finishes, it sent "OK" or "CRITICAL" using nsca.
But from time to time, one server or another got a STALE checks. In this case, the nsca event was received (as the text indicate 2014/09/13, also found nsca output in syslog), but the STALE popped up 3.5 days later?!?!? (2014/09/16).
Goal: to have an alert if either the backup fails (nsca event) or there is no nsca event (backup not scheduled for any reason).
From the service log entry:
Code: Select all
Event Start Time Event End Time Event Duration Event/State Type Event/State Information
09-14-2014 00:00:00 09-14-2014 08:41:26 0d 8h 41m 26s SERVICE OK (HARD) OS backup ok 2014/09/13 00:39:45
09-15-2014 00:00:00 09-15-2014 08:41:27 0d 8h 41m 27s SERVICE OK (HARD) OS backup ok 2014/09/13 00:39:45
09-16-2014 00:00:00 09-16-2014 08:41:26 0d 8h 41m 26s SERVICE OK (HARD) OS backup ok 2014/09/13 00:39:45
09-16-2014 17:08:21 09-16-2014 18:41:25 0d 1h 33m 4s SERVICE WARNING (HARD) WARNING: STALE passive check. Please check.
09-17-2014 00:00:00 09-17-2014 01:17:00 0d 1h 17m 0s SERVICE WARNING (HARD) WARNING: STALE passive check. Please check.
09-18-2014 00:00:00 09-18-2014 01:17:01 0d 1h 17m 1s SERVICE WARNING (HARD) WARNING: STALE passive check. Please check.
09-19-2014 00:00:00 09-19-2014 08:41:30 0d 8h 41m 30s SERVICE WARNING (HARD) WARNING: STALE passive check. Please check.
Code: Select all
Definition of the service:
define service{
host_name drpa8p00d
use generic-service
check_command check_dummy
normal_check_interval 11520 # in minutes
notification_interval 11520 # in minutes
service_description OS Backup
active_checks_enabled 0
passive_checks_enabled 1
max_check_attempts 1
check_freshness 1
freshness_threshold 691200 # 11520 minutes
}
There is a frequent "service nagios reload", about each 2 hours on nagios. Can it interfere? Hours are 8,10,12,14,16,18,23 and minutes is 41. This event start time is 17:08, not related to the nagios reload (the End Time is related, 18h41, no surprise).
Why so much reload - all the configuration is done by scripting. A lot of scripts. So adding a new server, removing one, or even some tuning are done on-the-fly with the current inventory, and reload is required.
Please note: most of the time the setup WORKS (1000 servers!), maybe one server a day or a week got this behavior from one random server...
Question:
1- is my service definition ok?
2- if yes, is it a bug?
Note: we had a lot of problem with Nagios 4.0.1, upgrading to 4.0.7 solved most of them, mostly in passive checks / nsca.
Any help greatly appreciated.