Page 1 of 2
host in 2 groups with clashing service checks - overwrite ?
Posted: Mon Apr 20, 2015 12:11 pm
by stucky
Hello
I have a basic problem I have never been able to solve in Nagios. I have defined a hostgroup called "default_linux" which monitors a few basic mountpoints.
I associated a service check with that group to achieve that. Now I can add a host to the group and these will get monitored.
However, they all use the same threshold of -w 10% -c 5%. The problem is that I might have a subgroup of hosts in there where /var is always at 6% left and that's ok.
The service checks for /var on these hosts would always be in WARNING so I created a new group called "default_linux_du_var_5%_1%" with -w 5% -c 1%.
Except, if I add a host there nagios simply adds a second check for /var (one for std 10, 5 and another one for 5,1) which defeats the purpose.
Now I have to remove this host from the "default_linux" group but now it stops monitoring the other basic mount points. I guess the question is:
How do I tell nagios how to deal with hosts that are in several groups with clashing service checks for the same service ? One would have to be labeled "higher" than the other.
I thought templates are the answer but even if one template is a child of the other, the moment I place a host into both groups nagios simply adds the 2 clashing services into the
monitoring list. This has always been a challenge. It seems I'd have to create a flat group for _every_ single possible scenario that I might need and add all relevant services to that group. Tomorrow I might need the same scenario but with different contact groups so I'd have to create yet another group just for that. That cannot be the correct way but I don't see how service templates help me here.
Do I have to associate each individual host with a set of services rather than hostgroups ? What's the purpose of host groups then ?
How are other's dealing with this ?
thx
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 12:45 pm
by jdalrymple
The behavior you're experience is desired for some people. Say you have Oracle database servers, and MSSQL database servers. It's common for Oracle to fill its tablespace and run at or near 100% full disks all the time, but MSSQL is different. At the end of the day you'll want to aggregate those 2 hostgroups into 1 called "database servers" though so that you can check some common shared services such as OS disk space, memory, load, etc. That is how hostgroups were intended to work.
The problem is that you don't have clashing services. Just because the check_command starts with the same binary doesn't mean that the 2 services are in any way related. The only way I can think to do what you're trying to achieve is with custom object variables:
http://nagios.sourceforge.net/docs/nagi ... tvars.html
You could define a custom object variable in your host definition such as:
Code: Select all
define host{
host_name server
_cfree 15%
_dfree 5%
...
}
Code: Select all
define service{
hostgroup_name servers
description Free Space C Drive
check_command check_disk!$HOSTADDRESS$!$_HOSTcfree$
...
}
Does that make any sense?
-- edit --
Changed host_name to hostgroup_name as it makes more sense in this user's use-case.
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 2:30 pm
by stucky
jda
thx for your input. I'm surprised though. Isn't this the most basic use case ?
Let's say I started out with a few simple groups of hosts that are a like. One day a year later one host starts acting up a but and I need to overwrite one service check on that level.
I don't have any custom vars defined. People must have run into this before.
The only way I could fix that is to remove this host from the group that includes this service check and create a custom group just for this particular adjustment.
I'm curious to know what other folks typically do when they encounter this unexpectedly.
I will explore the custom var idea more though.
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 2:37 pm
by jolson
There is another way to approach this problem.
If another service is defined with the same name as your existing service, you can attach it to a host to overwrite the inherited service that comes from the hostgroup.
For example, if you had these two services:
2015-04-20 14_28_40-Nagios XI - Nagios Core Config Manager.png
The top service is in hostgroup "NLS-cluster", and the bottom one is a part of any hostgroup.
I make a new host - "Nagios Log Server 2", and assign it to the group "NLS-cluster" which causes it to inherit all of that hostgroups services, which in this case includes the 'Check for Elasticsearch' service.
If I then take the bottom service and assign it to that same
host:
2015-04-20 14_36_29-Nagios XI - Nagios Core Config Manager.png
It will overwrite the original inherited service.
The easiest way to get a service up and running would be to 'copy' it and rename it to match the other service exactly. You can add some sort of identifier in the 'Display Name' field if you want to.
I figured I would let you know that this is possible. Thanks!
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 3:08 pm
by stucky
Jda
I cannot confirm that. Here my setup:
2 hosts:
define host {
host_name linux-test01.ps.am.sony.com
use xiwizard_generic_host
address linux-test01.ps.am.sony.com
register 1
}
define host {
host_name linux-test02.ps.am.sony.com
use xiwizard_generic_host
address linux-test02.ps.am.sony.com
register 1
}
and here my service that has 2 different varieties defined. The service is called "custom_check_by_ssh_du_var" and it defined twice by the same name but different description.
[root@prdmgtmon01 ~]# cat /usr/local/nagios/etc/services/custom_check_by_ssh_du_var.cfg
define service {
service_description du /var w10 c5
use xiwizard_generic_service
hostgroup_name default_linux
check_command custom_check_xi_by_ssh!'./check_disk -w 10% -c 5% -p /var'!!!!!!!
register 1
}
define service {
host_name linux-test02.ps.am.sony.com
service_description du /var w20 c10
use xiwizard_generic_service
check_command custom_check_xi_by_ssh!'./check_disk -w 20% -c 10% -p /var'!!!!!!!
register 1
}
It still just adds both /var checks to linux-test02 as before.
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 3:33 pm
by jdalrymple
jolson wrote:If another service is defined with the same name as your existing service
stucky wrote:Code: Select all
[root@prdmgtmon01 ~]# cat /usr/local/nagios/etc/services/custom_check_by_ssh_du_var.cfg
define service {
service_description du /var w10 c5
use xiwizard_generic_service
hostgroup_name default_linux
check_command custom_check_xi_by_ssh!'./check_disk -w 10% -c 5% -p /var'!!!!!!!
register 1
}
define service {
host_name linux-test02.ps.am.sony.com
service_description du /var w20 c10
use xiwizard_generic_service
check_command custom_check_xi_by_ssh!'./check_disk -w 20% -c 10% -p /var'!!!!!!!
register 1
}
You did not give your services the same name.
Hostgroups are intended to group LIKE-hosts. So you might have 1000 windows hosts that you don't want the C drive to be over 80%, but on each of those the D drive requirements will be wildly different and as such they should be grouped differently:
Code: Select all
[ -- Fileservers Alert D at 98% full -- ][ -- SQLServers Alert D at 80% full -- ][ -- Webservers Alert D at 90% full -- ]
[ -- non App No load alert -- ][ -- App servers Alert when load > 90% -- ]
[ -- Windows Servers Alert at C 80% full -- ]
Note that groups can be nested. The idea is that if you're finding that the services are being defined differently for different host they are probably not alike to be justifiably in that hostgroup. Again - use nesting so that they are in the same hostgroup where appropriate.
As for others solutions - I'll offer my own "personal" favorite - most checks with thresholds are remote checks (NRPE/NSclient/NCPA) and *personally* I prefer to hard code my thresholds in my host configs, then my check_command is just "check_disk" - no arguments whatsoever. You'll get plenty of differing opinions on favored ways of handling this.
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 3:35 pm
by jolson
My services are defined as follows:
Code: Select all
define service {
service_description Check for Elasticsearch
use generic-service
hostgroup_name NLS-cluster
check_command check_nrpe!check_for_elasticsearch!!!!!!!
max_check_attempts 5
check_interval 5
retry_interval 1
check_period xi_timeperiod_24x7
notification_interval 60
notification_period xi_timeperiod_24x7
contacts nagiosadmin
_xiwizard nrpe
register 1
}
define service {
host_name Nagios Log Server 2
service_description Check for Elasticsearch
use generic-service
check_command check_nrpe!check_for_elasticsearch!!!!!!!
max_check_attempts 6
check_interval 6
retry_interval 1
check_period xi_timeperiod_24x7
notification_interval 60
notification_period xi_timeperiod_24x7
contacts nagiosadmin
_xiwizard nrpe
register 1
}
Note that the
description needs to match. The way to identify them from one another will need to be done through the
display_name or Config Name variables. On the GUI those variables are referenced as 'Display name' and 'Config Name' respectively:
2015-04-20 15_33_18-Nagios XI - Nagios Core Config Manager.png
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 3:58 pm
by stucky
Thank you guys for the great feedback so quickly !
Ok getting somewhere now. I thought the service name itself had to match. This works but I noticed that the display name doesn't seem to show up anywhere on the GUI.
I user check_by_ssh and the problem is that this plugin doesn't return the threshold values as part of the output so in the gui you don't know that one /var check is actually using different threshold from the other /var check.
That's why I had added the thresholds into the descriptions like "du /var w10 c5" vs/ "du /var w20 c10". Now the gui just shows the generic "du /var".
Q1. Where is the "Display Name" supposed to show up ? I would really like to see which "/var" check I'm looking at.
thx
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 4:43 pm
by jolson
Q1. Where is the "Display Name" supposed to show up ? I would really like to see which "/var" check I'm looking at.
I cannot think of a way to make the 'display_name' variable show up instead of the service description - but it is planned to be released in the next major version of XI.
There is a workaround that I found, using the actions components. Please follow the below screenshots:
Access components:
2015-04-20 16_39_52-Nagios XI - Administration.png
Edit the 'Actions' component:
2015-04-20 16_40_02-Nagios XI - Administration.png
Use the variables in place below, and press 'Apply Settings':
2015-04-20 16_40_24-Nagios XI - Administration.png
You can now see the 'display_name' in the service details:
2015-04-20 16_40_39-Nagios XI.png
Re: host in 2 groups with clashing service checks - overwrit
Posted: Mon Apr 20, 2015 5:10 pm
by stucky
Not elegant but I take it for now. I'm also playing with nested host groups and that appears to work as expected.
Now I just wish the nested relationship of host groups were in any way indicated graphically.
For now I'll try to make it work with specific naming but it'd be nice to have in the future.
Is this in the pipe for later releases ?