Service checks when host is down

[email protected] · Post by **[email protected]** » Thu Feb 28, 2013 5:08 am

I am trying to tidyup our Nagios view of hosts/devices by reducing the number services marked as critical .

When a host is down, is it possible to selectively tell Nagios not to test the associated service(s) (and mark it as unknown).

I say selectively, as I have one service which is a Wake on Lan service ... ie it tests for ping , if fail, it sends a wol packet, thus this wopuld still operate when the host is down. It tries 3 times over about 10m then fails.

Looking forward to your response.

Thanks
Liam

scottwilkerson · Post by **scottwilkerson** » Thu Feb 28, 2013 5:47 pm

I think this is the document you are looking for

http://nagios.sourceforge.net/docs/3_0/ ... ncies.html

[email protected] · Post by **[email protected]** » Mon Mar 04, 2013 9:51 am

Thanks Scott,

The task of adding inter-dependencies to 2000 disparate devices is very daunting ... I was looking for a more general switch.

ie In most cases if the host is down, service checks are unecessary, wasteful of resources, and cause excessive cluttter on nagios monitors.
Any process which actively cleans up the view to enable support to home in on the actual faults/problems would be a good thing in my book.

Is there any way either in current product or as a possible enhancement, that the services would auto return 'Unknown' if the host was down.?

Thanks
Liam

slansing · Post by **slansing** » Mon Mar 04, 2013 11:22 am

They could return unreachable if a dependency relationship was set up which is close to what Scott was getting at I believe. Please see the following doc's:

http://nagios.sourceforge.net/docs/3_0/ ... ility.html

You could use this same method to create a false buffer between the hosts and services, by creating a hierarchical format between Host > Host > Service, Host "could go down" > Then Host "becomes unreachable" > And Service "becomes unreachable." This is how network reachability works within Nagios, though it would take some tuning based on how your architecture is set up... You could use event handlers for this as well.

[email protected] · Post by **[email protected]** » Fri Mar 08, 2013 7:09 am

Sounds good.

I am unclear how I should create these relationships.

Could you define how I would go about this ... by way of referring to my example below ...

I have 70 windows podia in lecture rooms.
Podia are all in one group called 'windows-podia'

6 services are setup and operate against the 'windows-podia' group.

if a podium goes down, all the services go red. (ie fail), I would prefer they returned unknown or unreachable.

typical host definition

define host {
use windows-podium ;
host_name BUS-1025-26_Lab_Podium;
check_command check-host-alive;
alias Lab_Podium ;
address 192.168.91.96 ;
hostgroups windows-podiaIPs;
}

Each service is defined similar to the following;
define service {
use generic-service
service_description CPU Detail
check_command check-wsc!cpu_detail!80%,90%
hostgroups windows-podiaIPs
}

and of course a group definition

define hostgroup{
hostgroup_name windows-podiaIPs ; The name of the hostgroup
alias Windows Podia Desktops ; Long name of the group
}

Looking forward to your reply
Liam

abrist · Post by **abrist** » Fri Mar 08, 2013 12:09 pm

Slansing's method would only work if all the podiums were children of a parent networking device, but would only label the podiums as "UNREACHABLE" when the parent networking device was down.
Service dependencies may be right tool for the job, though I do understand that making those changes is a giant task.

[email protected] wrote:Is there any way either in current product or as a possible enhancement, that the services would auto return 'Unknown' if the host was down.?

Beyond service dependencies, you could use event handlers to turn checks on and off depending on host state, although this is probably just as much of a time sink to implement as service dependencies:
http://monitoringtt.blogspot.com/2011/0 ... -host.html

vgavara · Post by **vgavara** » Fri Mar 15, 2013 10:12 am

As far as I understand the documentation, that task can hardly be done using service dependencies by two reasons:

[*]One service cannot be dependent on one host, just on one (or more) service(s) and what we're talking about is making one or more services dependent on their host
[*]You could bypass the previous fact by creating a service whose status was the same of its host (using check_dummy $HOSTSTATUS$ as service check). However, and again based on documentation, you migth define one by one a service dependency rule between that service and all their "brothers" (the rest of services associated to the host).

I believe that the solution explained http://monitoringtt.blogspot.com/2011/0 ... -host.html is the easiest one, moreover if you configure that handler as global_service_event_handler in order to be used by all your services. Even you can program that event handler script to check if a given inhibition user macro (say $DISCARD_HOSTSTATUS$ ) exists on the service or host in order to avoid running it for centain special objects.

slansing · Post by **slansing** » Mon Mar 18, 2013 10:49 am

Have you decided on or found a solution [email protected]?

[email protected] · Post by **[email protected]** » Mon Mar 18, 2013 1:55 pm

As suggested I raised an enhancement request...

[*]One service cannot be dependent on one host, just on one (or more) service(s) and what we're talking about is making one or more services dependent on their host.

All the services I test with a given host are very specific to that host (ie most are wmi checks using check_wsc as the engine), memory utilisation, processor utilisation, service running, processes running, etc. It makes sence for us to consider services dependant on their host. In fact the vast majority of all the tests we have configured are of this type. The only tests outside of this are a small number of tests on a Microsoft Cluster, DHCP,DNS and AD Domain servers, which I agree would fall under this model.

Does this mean that a different service model or a modification of the current service model is required to satisfy these needs?

The model described in http://nagios.sourceforge.net/docs/3_0/ ... ncies.html is more complex than we use, our needs are much more simpler.
In fact, are the above checks I have listed considered 'proper' services under this model?, hence, is the current model suitable for what I am trying to achieve?

The only way I could foresee to make this happen in a large network, would be to have dynamic discovery mechanism which would create all the parent / child relationships and update the config files accordingly. I cant see this happening any time soon.

[*]You could bypass the previous fact by creating a service whose status was the same of its host (using check_dummy $HOSTSTATUS$ as service check). However, and again based on documentation, you migth define one by one a service dependency rule between that service and all their "brothers" (the rest of services associated to the host).

I found the link http://nagios.sourceforge.net/docs/3_0/ ... ncies.html confusing to follow for what I was trying to achieve. Using the following host and services below how would I code these to fit your solution? The objective; to force all services to unknown state if host is down.

define host {
use windows-podium ;
host_name BUS-1025-26_Lab_Podium;
check_command check-host-alive;
alias Lab_Podium ;
address 192.168.91.96 ;
hostgroups windows-podiaIPs;
}

# dummy service as suggested
define service {
use generic-service
check_command check_dummy $HOSTSTATUS$
hostgroups windows-podiaIPs
}

# example service.
define service {
use generic-service
service_description CPU Detail
check_command check-wsc!cpu_detail!80%,90%
hostgroups windows-podiaIPs
}

... What happens to the dummy service ... what state will it be in if the host is down?

Regards
Liam

abrist · Post by **abrist** » Tue Mar 19, 2013 10:52 am

[email protected] wrote:
define host {
use windows-podium ;
host_name BUS-1025-26_Lab_Podium;
check_command check-host-alive;
alias Lab_Podium ;
address 192.168.91.96 ;
hostgroups windows-podiaIPs;
}

# dummy service as suggested
define service {
use generic-service
service_description dummy-checker
check_command check_dummy $HOSTSTATUS$
hostgroups windows-podiaIPs
}

# example service.
define service {
use generic-service
service_description CPU Detail
check_command check-wsc!cpu_detail!80%,90%
hostgroups windows-podiaIPs
}

The service dependency for this setup would be as follows (note: I had to give a service_description for the dummy service):

Code: Select all

define servicedependency{
	host_name			        BUS-1025-26_Lab_Podium
	service_description		    dummy-checker
	dependent_service_description	CPU Detail 
	execution_failure_criteria	        w,u,c
	notification_failure_criteria	w,u,c
	}

[email protected] wrote: ... What happens to the dummy service ... what state will it be in if the host is down?

It will reflect the $HOSTSTATUS$ macro, i.e., it will be DOWN.

Nagios Support Forum

Service checks when host is down

Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down

Re: Service checks when host is down