Nagios Support Forum

Posted: **Tue Feb 05, 2013 1:51 pm**

So, I'm using smokeping to monitor boxes, and I'm using the nagios plugin to query the smokeping RRD's - that all works great.

BUT

I get warning and critical alerts after only a single warning/critical condition.

Under the service, I have set [max_check_attempts 3] - and I was under the belief/assumption that the service would query for a state three times, and it had to fall outside the warning/critical limits before it should generate a warning/critical/down notification.

[Yes, I know there are other filters that impact when a notification is sent out, but lets ignore those for now - in any case, we're talking about too permissive a notification, rather than too limited.]

So, why do I get a notification for this service when only a single sample/check has a value outside the warning/critical values?

Does the ICMP probe work the same way?

-Greg

PS. I don't believe any of these other settings should impact this issue, but I'll provide them just in case.

notification_interval 1
max_check_attempts 3
check_interval 1
retry_interval 1

Posted: **Tue Feb 05, 2013 4:08 pm**

Hmmm something isn't quite right here... you are correct in assuming that max_check_attempts 3 should require it to have checked 3 times before sending any notification.

Are you able to post the full configuration for that object and anything that it is inheriting from a template? Are you also able to confirm that under the service it says check attempt: 1/3 and then 2/3 when the notification is sent? My suspicion right now is that the value is being over-written or hasn't applied.

Posted: **Tue Feb 05, 2013 7:27 pm**

Perhaps this IS whats wrong...

So, I see in the host template def that's inherited by this host, the max_check_attempts = 1
But the service is overridden as 3.

Does this mean that even though the service is not yet down [3 checks], since max_check_attempts=1 on the *host*, it will alert on a single bad check on the service?

[This is somewhat odd, though perhaps I understand - although I'm not sure I get the whole hierarchy thing. I essentially couldn't care a bit about the "host" but I don't think I can define a service to check without it.]

If so, what happens if it's the other way around. [i.e. Service with max_check_attempts 1, but the host is set to 10]
Will I get alerts on the service in one, but alerts on the host not until 10?

---
#atlas.ccast.cpe
define host{
use linux-server ; Name of host template to use
host_name abc.somehost.xyz
alias abc.somehost.xyz
address 127.0.0.1
contact_groups abc.admins
hostgroups abcd
}

define service{
use generic-service ;Inherit default values from a template
active_checks_enabled 1
passive_checks_enabled 1
notification_interval 1
max_check_attempts 3
check_interval 1
retry_interval 1
service_description smokeping
host_name abc.somehost.xyz
contact_groups
check_command check_smokeping!/var/lib/smokeping/abc/xyz.rrd!15!35!1000!10000!1000!10000
# so we're warning on 15% loss, critical on 35%, RTT 1000W 10000C, Jitter 1000W 10000C
#I don't much care about the actual values, so *ingore* them, even if they seem unreasonable.
contacts bogus.contact ;We'll handle everything through escalations
}

----
Inherited template

define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 0 ; Flap detection is enabled
failure_prediction_enabled 0 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 1 ; Check the service every 1 minutes under normal conditions
retry_check_interval 1 ; Re-check the service every one minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r,f,s ; Send notifications about warning, unknown, critical, and recovery events, flap, scheduled
notification_interval 1 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

define host{
name linux-server ; The name of this host template
use generic-host ; This template inherits other values from the generic-host template
check_period 24x7 ; By default, Linux hosts are checked round the clock
check_interval 1 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 1 ; Check each Linux host 10 times (max)
check_command check-host-alive ; Default command to check Linux hosts
notification_period 24x7 ; Linux admins hate to be woken up, so we only notify during the day
; Note that the notification_period variable is being overridden from
; the value that is inherited from the generic-host template!
notification_interval 1 ; Resend notifications every 2 hours
notification_options d,u,r,f,s ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}

Posted: **Tue Feb 05, 2013 10:45 pm**

As soon as I read your postulating at the top of your last post I had a little chuckle to myself. This trap gets nearly every single new user, at the very least it got me

The host and the service and are two distinctly separate entities that can be alerted from, what's happening here is that your service is NEVER alerting. Your host IS alerting. Go ahead and remove the ping service from a host entirely and then bring it down... you will still receive the notification.

Essentially the host has a check-host-alive command that is triggering the host down notification after one host attempt.

define host{
name linux-server ; The name of this host template
use generic-host ; This template inherits other values from the generic-host template
check_period 24x7 ; By default, Linux hosts are checked round the clock
check_interval 1 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 1 ; Check each Linux host 10 times (max)
check_command check-host-alive ; Default command to check Linux hosts
notification_period 24x7 ; Linux admins hate to be woken up, so we only notify during the day
; Note that the notification_period variable is being overridden from
; the value that is inherited from the generic-host template!
notification_interval 1 ; Resend notifications every 2 hours
notification_options d,u,r,f,s ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}

Most people end up removing the ping service from all of their devices because it's redundant. You want to keep ping on the host because if the host goes down, the parenting structure of Nagios will prevent a notification storm for the attached services.

Hopefully this makes sense to you?

Posted: **Tue Feb 05, 2013 11:08 pm**

Well, in this case, for virtually all hosts, there's a 1:1 relationship between the host and it's service. [Only one host for only one service.]

So, while I'm not exactly sure what's happening - if I look at the history, all the alerts are "service alert" - not hosts.
Thus, I'd guess I'm *not* alerting on the host.

In that view, I see something like the following.

...
Somedate/sometime SERVICE ALERT. some.service.name,smokeping,HARD,3 ...
Somedate/sometime SERVICE ALERT. some.service.name,smokeping,SOFT,2 ...
Somedate/sometime SERVICE ALERT. some.service.name,smokeping,SOFT,1 ...

Hmmm....

Natch! Dag nabbit!!
Now, perhaps I've been huffing the rubber cement again, but I see that it IS working properly - at least now.

I would have sworn I looked at this a day or two ago and it was creating a notification before it hit a hard entry. [Or at least before it had hit three checks.]

But now it looks like I was the one taking hits to the noggin' or something.
Sorry maties! I'll just be over here in the corner with the pointy hat on. Never mind me.
Nothing to see - move along....

[If I see it come back, I'll update the thread. Otherwise, paint me clueless or somethin'.]

Thanks for helping the deaf-dumb-and-blind kid.

-Greg

Posted: **Wed Feb 06, 2013 12:45 pm**

Yes let us know if this persists, don't go too hard on yourself, everyone learns these things at one time or another, if it was not clear to you that this was the case it may be something we can tweak in our documentation, any feedback is welcome!

Nagios Support Forum

imcp / smokeping warning/critical notifications

imcp / smokeping warning/critical notifications

Re: imcp / smokeping warning/critical notifications

Re: imcp / smokeping warning/critical notifications

Re: imcp / smokeping warning/critical notifications

Re: imcp / smokeping warning/critical notifications

Re: imcp / smokeping warning/critical notifications