Re: [Nagios-devel] freshness_threshold bug - big problem
Posted: Thu Dec 16, 2010 8:59 pm
This is a multi-part message in MIME format.
--------------000900000003060108030909
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On 12/16/2010 12:03 PM, Rodney Ramos wrote:
> As I=B4ve said before I think that it is a Nagios Core bug. I=B4ve test=
ed it
> with Nagios 3.2.1 and I found the same problem.
> I think it=B4s a serious problem.
Oh, wow. 8-O I can confirm the effect on my 3.2.3, but there seems to be
*more* of a problem with host freshness checks. Test run with
check_interval 15, retry_interval 2, max_check_attempts 4; log excerpt:
18:23:55 Warning: Host 'Unfresh' has no services associated with it!
18:24:28 EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;Unfresh;0;Manual
Init to UP|
18:24:35 PASSIVE HOST CHECK: Unfresh;0;Manual Init to UP
18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
(threshold=3D0d 0h 15m 16s). I'm forcing an immediate check of the hos=
t.
18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)
18:51:12 Warning: Host 'Unfresh' has no services associated with it!
18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
(threshold=3D0d 0h 15m 17s). I'm forcing an immediate check of the hos=
t.
18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
19:00:12 Warning: Host 'Unfresh' has no services associated with it!
19:12:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
(threshold=3D0d 0h 15m 15s). I'm forcing an immediate check of the hos=
t.
19:12:23 HOST ALERT: Unfresh;DOWN;SOFT;2;CRITICAL: All life functions
terminated
19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
terminated
19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
terminated
20:00:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
20:16:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 41s
(threshold=3D0d 0h 15m 17s). I'm forcing an immediate check of the hos=
t.
20:32:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
20:48:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
(threshold=3D0d 0h 15m 15s). I'm forcing an immediate check of the hos=
t.
21:04:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
(threshold=3D0d 0h 15m 15s). I'm forcing an immediate check of the hos=
t.
(The additional "no services" crud stems from my not getting the check
command right the first time 'round, and having to re-reload the config.)
I took excerpts of status.dat and retention.dat initially and after the
first nine active checks, look at these current_attempt numbers:
# for FIL in *.dat* ; do echo -n "${FIL}: " | \
> sed -e 's/_[a-z]*-/-/' -e 's/\.[a-z]*: */:/' ; \
> egrep '(current_attempt|state_type|(current|last_hard)_state=3D)' \
> $FIL | sed -e 's/\([a-z][a-z][a-z]\)[a-z]*\([_=3D]\)/\1\2/g' | \
> tr '\n\t' ' ' ; echo "" ; done
retention.dat-OK: cur_sta=3D0 las_har_sta=3D0 cur_att=3D1 sta_typ=3D=
1
retention.dat-1: cur_sta=3D0 las_har_sta=3D0 cur_att=3D1 sta_typ=3D=
1
retention.dat-2: cur_sta=3D1 las_har_sta=3D0 cur_att=3D1 sta_typ=3D=
0
retention.dat-3: cur_sta=3D1 las_har_sta=3D0 cur_att=3D2 sta_typ=3D=
0
retention.dat-4: cur_sta=3D1 las_har_sta=3D0 cur_att=3D2 sta_typ=3D=
0
retention.dat-5: cur_sta=3D1 las_har_sta=3D0 cur_att=3D2 sta_typ=3D=
0
retention.dat-6: cur_sta=3D1 las_har_sta=3D0 cur_att=3D4 sta_typ=3D=
1
retention.dat-7: cur_sta=3D1 las_har_sta=3D0 cur_att=3D4 sta_typ=3D=
1
retention.dat-8: cur_sta=3D1 las_har_sta=3D0 cur_att=3D4 sta_typ=3D=
1
retention.dat-9: cur_sta=3D1 las_har_sta=3D0 cur_att=3
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
--------------000900000003060108030909
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On 12/16/2010 12:03 PM, Rodney Ramos wrote:
> As I=B4ve said before I think that it is a Nagios Core bug. I=B4ve test=
ed it
> with Nagios 3.2.1 and I found the same problem.
> I think it=B4s a serious problem.
Oh, wow. 8-O I can confirm the effect on my 3.2.3, but there seems to be
*more* of a problem with host freshness checks. Test run with
check_interval 15, retry_interval 2, max_check_attempts 4; log excerpt:
18:23:55 Warning: Host 'Unfresh' has no services associated with it!
18:24:28 EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;Unfresh;0;Manual
Init to UP|
18:24:35 PASSIVE HOST CHECK: Unfresh;0;Manual Init to UP
18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
(threshold=3D0d 0h 15m 16s). I'm forcing an immediate check of the hos=
t.
18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)
18:51:12 Warning: Host 'Unfresh' has no services associated with it!
18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
(threshold=3D0d 0h 15m 17s). I'm forcing an immediate check of the hos=
t.
18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
19:00:12 Warning: Host 'Unfresh' has no services associated with it!
19:12:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
(threshold=3D0d 0h 15m 15s). I'm forcing an immediate check of the hos=
t.
19:12:23 HOST ALERT: Unfresh;DOWN;SOFT;2;CRITICAL: All life functions
terminated
19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
terminated
19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
terminated
20:00:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
20:16:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 41s
(threshold=3D0d 0h 15m 17s). I'm forcing an immediate check of the hos=
t.
20:32:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=3D0d 0h 15m 18s). I'm forcing an immediate check of the hos=
t.
20:48:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
(threshold=3D0d 0h 15m 15s). I'm forcing an immediate check of the hos=
t.
21:04:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
(threshold=3D0d 0h 15m 15s). I'm forcing an immediate check of the hos=
t.
(The additional "no services" crud stems from my not getting the check
command right the first time 'round, and having to re-reload the config.)
I took excerpts of status.dat and retention.dat initially and after the
first nine active checks, look at these current_attempt numbers:
# for FIL in *.dat* ; do echo -n "${FIL}: " | \
> sed -e 's/_[a-z]*-/-/' -e 's/\.[a-z]*: */:/' ; \
> egrep '(current_attempt|state_type|(current|last_hard)_state=3D)' \
> $FIL | sed -e 's/\([a-z][a-z][a-z]\)[a-z]*\([_=3D]\)/\1\2/g' | \
> tr '\n\t' ' ' ; echo "" ; done
retention.dat-OK: cur_sta=3D0 las_har_sta=3D0 cur_att=3D1 sta_typ=3D=
1
retention.dat-1: cur_sta=3D0 las_har_sta=3D0 cur_att=3D1 sta_typ=3D=
1
retention.dat-2: cur_sta=3D1 las_har_sta=3D0 cur_att=3D1 sta_typ=3D=
0
retention.dat-3: cur_sta=3D1 las_har_sta=3D0 cur_att=3D2 sta_typ=3D=
0
retention.dat-4: cur_sta=3D1 las_har_sta=3D0 cur_att=3D2 sta_typ=3D=
0
retention.dat-5: cur_sta=3D1 las_har_sta=3D0 cur_att=3D2 sta_typ=3D=
0
retention.dat-6: cur_sta=3D1 las_har_sta=3D0 cur_att=3D4 sta_typ=3D=
1
retention.dat-7: cur_sta=3D1 las_har_sta=3D0 cur_att=3D4 sta_typ=3D=
1
retention.dat-8: cur_sta=3D1 las_har_sta=3D0 cur_att=3D4 sta_typ=3D=
1
retention.dat-9: cur_sta=3D1 las_har_sta=3D0 cur_att=3
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]