Page 1 of 2

remove host from hostgroup but it still gets service checks

Posted: Tue Apr 28, 2015 2:00 pm
by tredlightly
Nagios version is Core: 4.0.8-2

I have defined a set of shared service checks using a hostgroup. I want to use the hostgroup so that I can remove a host from it and have the service checks continue for the remaining hosts. However, when I remove the host from the hostgroup, nagios continues to monitor it and triggers my event handlers for specific services that are only defined for the hostgroup. My removal script removes the host from the list of members in the hostgroup and restarts nagios. I've also tried appending the bang '!' to that hostname while leaving it within the hostgroup definition. Same issue, it still monitors and triggers event handlers for that host.

I have tried commenting out and unsetting a variety of settings in the nagios.cfg file in an attempt to make sure that nagios is not caching old information about this host and its hostgroup membership. Those changes are shown here:

custom]$ diff /etc/nagios/nagios.cfg /etc/nagios/nagios.cfg~
67c67
< ##object_cache_file=/u02/nagios/var/nagios/objects.cache
---
> object_cache_file=/u02/nagios/var/nagios/objects.cache
83c83
< ##precached_object_file=/u02/nagios/var/nagios/objects.precache
---
> precached_object_file=/u02/nagios/var/nagios/objects.precache
106c106
< ##status_file=/u02/nagios/var/nagios/status.dat
---
> status_file=/u02/nagios/var/nagios/status.dat
115c115
< ##status_update_interval=10
---
> status_update_interval=10
479c479
< ##cached_host_check_horizon=15
---
> cached_host_check_horizon=15
491c491
< ##cached_service_check_horizon=15
---
> cached_service_check_horizon=15
608c608
< retain_state_information=0
---
> retain_state_information=1
643c643
< use_retained_program_state=0
---
> use_retained_program_state=1
654c654
< use_retained_scheduling_info=0
---
> use_retained_scheduling_info=1

However, regardless of those changes the issue persists. Can I not remove a host from a hostgroup and subsequently have all of the service checks for that hostgroup no longer performed on the removed host?

Any direction here would be greatly appreciated. I have searched for others encountering the same issues and have come up empty.

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 2:18 pm
by ssax
Please post your host and hostgroup definitions for the host that has been removed.

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 2:54 pm
by tredlightly
Host definition, which I do not remove:

define host{
use linux-server ; Name of host template to use
; This host definition will inherit all variables that are defined
; in (or inherited by) the linux-server host template definition.
host_name nss-app2
alias nss-app2
; see if we can use /etc/hosts address 127.0.0.1
check_interval 1
retry_interval 1
}

Hostgroup definition:

define hostgroup{
hostgroup_name nss-app-servers ; The name of the hostgroup
alias Nss Application Servers ; Long name of the group
members nss-app1,nss-app2,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group
}

The current removal script would adjust the members such that if we were removing nss-app2, the subsequent hostgroup definition would look like:

define hostgroup{
hostgroup_name nss-app-servers ; The name of the hostgroup
alias Nss Application Servers ; Long name of the group
members nss-app1,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group
}

I've also tried this (the bang 'solution') to no avail:

define hostgroup{
hostgroup_name nss-app-servers ; The name of the hostgroup
alias Nss Application Servers ; Long name of the group
members nss-app1,!nss-app2,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group
}


And an example service is defined as:

define service{
use generic-service ; Name of service template to use
hostgroup_name nss-app-servers
service_description Tomcat Processes
check_command check_nrpe!check_tomcat_procs
event_handler sendTomcatAppAlarm
check_interval 1
retry_interval 1
notifications_enabled 0
}

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 2:57 pm
by jdalrymple
How are you verifying that Nagios is getting restarted. Look in your nagios.log to verify that it actually is restarting successfully. I suspect your removal script is breaking the syntax of the cfg file and Nagios isn't restarting proper.

Is your script successfully modifying the files?

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 3:04 pm
by tredlightly
It is restarting. It uses /etc/init.d/nagios restart

I've also tried /etc/init.d/nagios reload
and /etc/init.d/nagios force-reload

all to no avail. I grep for the existence of the nss-app2 in this set; Here's the egrep for the bang solution:

sudo egrep 'members|nss-app2' /u02/nagios/var/nagios/retention.dat /u02/nagios/var/nagios/status.dat /etc/nagios/objects/custom/nss_app_hosts.cfg
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: host_name nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: alias nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: members !nss-app2,nss-app1,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group

Here it is for the removal solution:

The script:

admin-app-down.sh nss-app2
NEW_MEMBER_SET = nss-app1,nss-app3,nss-app4
Running configuration check...
Stopping nagios:No lock file found in /var/nagios/nagios.pid
Starting nagios: done.
NEW_MEMBER_SET = nss-app1,nss-app3,nss-app4
Running configuration check...
Stopping nagios:. done.
Starting nagios: done.

and the egrep, 3 of them, see the nss-app2 stuff come back in status.dat?

[bmcs@cp8-nss-lb2 scripts]$ sudo egrep 'members|nss-app2' /u02/nagios/var/nagios/retention.dat /u02/nagios/var/nagios/status.dat /etc/nagios/objects/custom/nss_app_hosts.cfg
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: host_name nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: alias nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: members nss-app1,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group
[bmcs@cp8-nss-lb2 scripts]$ sudo egrep 'members|nss-app2' /u02/nagios/var/nagios/retention.dat /u02/nagios/var/nagios/status.dat /etc/nagios/objects/custom/nss_app_hosts.cfg
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: host_name nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: alias nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: members nss-app1,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group
[bmcs@cp8-nss-lb2 scripts]$ sudo egrep 'members|nss-app2' /u02/nagios/var/nagios/retention.dat /u02/nagios/var/nagios/status.dat /etc/nagios/objects/custom/nss_app_hosts.cfg
/u02/nagios/var/nagios/retention.dat:host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: plugin_output=connect to address nss-app2 and port 8080: Connection refused
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/u02/nagios/var/nagios/status.dat: host_name=nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: host_name nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: alias nss-app2
/etc/nagios/objects/custom/nss_app_hosts.cfg: members nss-app1,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 3:23 pm
by jdalrymple

Code: Select all

sudo egrep 'members|nss-app2' /u02/nagios/var/nagios/retention.dat /u02/nagios/var/nagios/status.dat /etc/nagios/objects/custom/nss_app_hosts.cfg
The output of retention.dat and status.dat are useless to us, we understand that you're having a problem where that the live configuration isn't consistent with what you want. No need in posting the extra cruft to prove that to us.

There are a number of places that host can be placed in a hostgroup, hostgroup definition, host definition and template definition. You need to start at your host and work backwards to find where in those definitions that your host is being placed into the hostgroup. This should be a straightforward process. Also make sure you don't have a stray .cfg file somewhere that's getting read.

There is no need to modify nagios.cfg. When you make changes to the nagios configuration files it obeys.

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 4:52 pm
by tredlightly
Understood. Our configurations use the cfg_dir directive for this directory: /etc/nagios/objects/custom

We do use the 'install' command to implement the change, which we also use the 'backup' option in it, so we do have files like this, nss_app_hosts.cfg~ (with an appended tilde). My understanding was that nagios would only read files ending in .cfg (and then not .cfg~ ) when gathering configs from the cfg_dir designated directories.

Toward that end, I removed the nss_app_hosts.cfg~ file and tried again, but no joy. This is all we have in terms of configuration for nss-app2 (we do leave the host definition for nss-app2 in the nss_app_hosts.cfg file, but we remove it from the nss-app-servers hostgroup. Here's all we have on it (all of the services are defined to use the hostgroup)

[bmcs@s4-nss-lb2 custom]$ grep nss-app2 *
nss_app_hosts.cfg: host_name nss-app2
nss_app_hosts.cfg: alias nss-app2
nss_app_hosts.cfg: members nss-app1,nss-app2,nss-app3,nss-app4 ; Comma separated list of hosts that belong to this group
nss_app_hosts.cfg~: host_name nss-app2
nss_app_hosts.cfg~: alias nss-app2
nss_app_hosts.cfg~: members nss-app1,nss-app2,nss-app3 ; Comma separated list of hosts that belong to this group

And when I remove it from the members list in nss_app_hosts.cfg you won't see that line from nss_app_hosts.cfg, but you would see it still in nss_app_hosts.cfg~. In the example above, it is nss-app4 that appears to have been added back to the operational config. Oddly enough, it works for that 1 server, nss-app4, but not for any of the others. Originally, nss-app4 would have been the last host in the members list.

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 5:02 pm
by jdalrymple
Look in objects.cache to see if the hostgroup is defined under the host, or if the host is a member of the hostgroup. I expect it's the former.

Re: remove host from hostgroup but it still gets service che

Posted: Tue Apr 28, 2015 6:09 pm
by tredlightly
We do have customized locations in case that factors in:

$ pwd
/u02/nagios/var/nagios
nagios]$ grep -n nss-app2 objects.cache
301: members nss-app1,nss-app2,nss-app3,nss-app4
401: host_name nss-app2

Hostgroup first, host afterward.


This is the block for the 301 line reference:

define hostgroup {
hostgroup_name nss-app-servers
alias Nss Application Servers
members nss-app1,nss-app2,nss-app3,nss-app4
}

This is the block for the 401 line reference:
define host {
host_name nss-app2
alias nss-app2
address nss-app2
check_period 24x7
check_command check-host-alive
contact_groups admins
notification_period workhours
initial_state o
importance 0
check_interval 1.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d,u
notifications_enabled 1
notification_interval 120.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
}

Re: remove host from hostgroup but it still gets service che

Posted: Wed Apr 29, 2015 9:41 am
by jdalrymple
This all adds up to Nagios not reloading properly, or it's finding that hostgroup definition somewhere else in an unexpected config file.

I have no need to replicate your configuration - in the decade that I've worked with Nagios I've never seen it simply disobey a configuration. I suggest creating a simple config as such to prove the behavior of the configuration to yourself:

Code: Select all

define host {
       name                                     simple-host
       check_command                            check-host-alive
       max_check_attempts                       1
       check_period                             24x7
       contacts                                 nagiosadmin
       notification_interval                    60
       notification_period                      24x7
       register                                 0
}

define host {
       use                                      simple-host
       host_name                                simple-host-a
       address                                  127.0.0.1
}

define host {
       use                                      simple-host
       host_name                                simple-host-b
       address                                  127.0.0.1
}

define host {
       use                                      simple-host
       host_name                                simple-host-c
       address                                  127.0.0.1
}

define service {
       name                                     simple-service
       service_description                      simple-service
       hostgroup_name                           simple-hostgroup
       check_command                            check_dummy!0
       max_check_attempts                       1
       check_interval                           1
       check_period                             24x7
       retry_interval                           1
       notification_interval                    60
       notification_period                      24x7
       contacts                                 nagiosadmin
       register                                 1
}

define hostgroup {
       hostgroup_name                           simple-hostgroup
       alias                                    simple-hostgroup
       members                                  simple-host-a, simple-host-b, simple-host-c
Add and remove hosts from the hostgroup at will and you'll see their associated services disappear out of Nagios.