I'm capturing host status and service status callbacks in a neb module,
and I'm not really clear about the logic of how current_state and
last_hard_state get set. Hopefully somebody else is. Below are some table
snippets to show what I'm seeing.
The columns are, in order:
- the unique id of this service/host check,
- when this state started,
- the seconds the states remained unchanged (null when they're the current values),
- the soft_state,
- the last_hard_state,
- the current_attempt,
- the plugin_output.
Note that the current_attempt value gets updated in place when the states
don't change, instead of inserting a new row with a the same states but a
different current_attempt, as you might expect. Also, the plugin_output
value is the value at the start of the state, not after the most recent
attempt.
Clear as mud? Cool, here we go.....
484 | 2004-11-10 15:50:45-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 88.40 ms
This is pretty obvious and straightforward. A ping check succeeded on its
first try, and so the current_state is 0. It hasn't had any failures,
either, so the last_hard_state is also 0.
113 | 2004-11-10 15:06:59-08 | 29346 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 32.21 ms
113 | 2004-11-10 23:16:05-08 | 86 | 1 | 0 | 1 | PING WARNING - Packet loss = 0%, RTA = 250.70 ms
113 | 2004-11-10 23:17:31-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 144.14 ms
Another simple example to verify foundations. We have a ping check that's
working fine for a long time, then blips with a warning, but 86 seconds
and another try later, we return to an ok state. We know there was only 1
try that resulted in a soft error state, because otherwise current_attempt
would have been greater than 1 on that middle row.
141 | 2004-11-10 15:32:44-08 | 54141 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 51.94 ms
141 | 2004-11-11 06:35:05-08 | 59 | 1 | 0 | 1 | PING WARNING - Packet loss = 0%, RTA = 334.69 ms
141 | 2004-11-11 06:36:04-08 | 10801 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 115.25 ms
141 | 2004-11-11 09:36:05-08 | 123 | 1 | 0 | 2 | PING WARNING - Packet loss = 0%, RTA = 280.24 ms
141 | 2004-11-11 09:38:08-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 51.70 ms
Here's a ping that starts off work, blips with a warning, returns to a
working state, blips twice with a warning, then returns again to a working
state before max_attempts=3 is reached. Nothing special here.
Enough of the groundwork. Here's where the confusion starts:
5655 | 2004-11-10 15:03:30-08 | 58563 | 0 | 0 | 1 | HTTP ok: HTTP/1.1 200 Channel Listing - 0.041 second response time
5655 | 2004-11-11 07:19:33-08 | 553 | 2 | 0 | 5 | Socket timeout after 30 seconds
5655 | 2004-11-11 07:28:46-08 | 8322 | 2 | 2 | 5 | Socket timeout after 30 seconds
5655 | 2004-11-11 09:47:28-08 | 0 | 0 | 2 | 5 | HTTP ok: HTTP/1.1 200 Channel Listing - 1.150 second response time
5655 | 2004-11-11 09:47:28-08 | | 0 | 0 | 1 | HTTP ok: HTTP/1.1 200 Channel Listing - 1.150 second response time
We start off with an http check in a good state. Then it enters a critical
state (2), and stays in that soft error state for 5 attempts. At that
point, it enters a hard critical state and last_hard_state also gets set
to 2. It's still in a currently having problems too, though, so
current_state also stays at 2. Then, at 9:47, it recovers, but somehow
manages to get 5 checks done in 0 seconds. That's my first point of
confusion. I would have thought that if soft_state was ok (0), then
regardless of last_hard_state, there would be no more attempts and the
service would recover. This might be a bug in nagios, where it's sending
the neb callback the wrong current_attempt number.
134 | 2004-11-10 15:06:59-08 | 51015 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 65.07 ms
134 | 2004-11-11 05:17:14-08 | 132 | 1 | 0 | 2 | PING WARNING - Packet loss = 0%, RTA = 200.57 ms
134 | 2004-11-11 05:19:26-08 | 3615 | 1 | 1 | 3 | PING WARNING - Packet loss = 0%, RTA = 273.67 ms
134 | 2004-11-11 06:19:41-08 | | 0 | 0 | 1
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]