Page 1 of 1

Nagios keeps checking services on offline hosts...

Posted: Thu Feb 19, 2015 12:46 pm
by melmoth
Hi all,
I'm fighting with a configuration problem and I hope someone here can shred some light. My /var/log/message is being flooded with check timeout messages (and the /usr/local/nagios/var/archives is huge, too):

Code: Select all

 
...
nagios: Warning: Check of service '00 System - Battery voltage' on host 'MyremoteHostSrv1' timed out after 60.011s!
nagios: wproc: Core Worker 28947: job 57811 (pid=12721): Dormant child reaped
nagios: wproc: CHECK job 57810 from worker Core Worker 28946 timed out after 60.01s
nagios: wproc:   host=MyremoteHostSrv1; service=00 Info - Hostname;
nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
...
While I know I can prevent Nagios duplicating its notifications on my syslog, what I really want is not generating the clutter in the first instance :)
I'm monitoring 147 hosts, for a total of 2549 services:

Code: Select all

Nagios Stats 4.0.8
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 08-12-2014
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 0m 1s
Status File Version:                    4.0.8

Program Running Time:                   0d 5h 44m 50s
Nagios PID:                             28943

Total Services:                         2549
Services Checked:                       2549
Services Scheduled:                     2549
Services Actively Checked:              2549
Services Passively Checked:             0
Total Service State Change:             0.000 / 11.180 / 0.396 %
Active Service Latency:                 0.000 / 0.570 / 0.001 sec
Active Service Execution Time:          0.011 / 60.029 / 26.672 sec
Active Service State Change:            0.000 / 11.180 / 0.396 %
Active Services Last 1/5/15/60 min:     299 / 2504 / 2549 / 2549
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              405 / 3 / 1118 / 1023
Services Flapping:                      0
Services In Downtime:                   0

Total Hosts:                            147
Hosts Checked:                          147
Hosts Scheduled:                        40
Hosts Actively Checked:                 147
Host Passively Checked:                 0
Total Host State Change:                0.000 / 8.680 / 0.094 %
Active Host Latency:                    0.000 / 1.021 / 0.008 sec
Active Host Execution Time:             0.244 / 30.007 / 8.890 sec
Active Host State Change:               0.000 / 8.680 / 0.094 %
Active Hosts Last 1/5/15/60 min:        76 / 131 / 133 / 134
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  50 / 64 / 33
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     171 / 848 / 2493
   Scheduled:                           162 / 823 / 2420
   On-demand:                           9 / 25 / 73
   Parallel:                            162 / 823 / 2420
   Serial:                              0 / 0 / 0
   Cached:                              9 / 25 / 73
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  376 / 2626 / 7736
   Scheduled:                           376 / 2626 / 7736
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0
Services are active checks performed via check_snmp. Due to the nature of the hosts I cannot use any other means to perform the checks.
The hosts are installed at various locations, and each location has a router which acts as a VPN client connected to the nagios server. I have established a parent-child relation between each router and the hosts which are behind it.
The routers should be always up, but the hosts are down quite frequently and that's perfectly normal, they are used a few hours a day. Unfortunately I have no way to predict when they are used, so I can't use any time period declaration.
My problem is the service checks are performed regardless of the router state (router down-> host unreachable), and regardless of the host state (host offline). So the checks exit with timeout error, of course.


My current config is as follows.

1-generic-host.cfg:

Code: Select all

define host {
  name                           generic-host
  active_checks_enabled          1
  check_command                  check-host-alive-ping
  contact_groups                 systems
  event_handler_enabled          1
  flap_detection_enabled         1
  max_check_attempts             3
  notification_interval          10
  notification_options           d                                   
  notification_period            24x7
  notifications_enabled          1
  obsess_over_host               0
  passive_checks_enabled         1
  process_perf_data              1
  register                       0                                  
  retain_nonstatus_information   1
  retain_status_information      1
}

generic-service.cfg:

Code: Select all

define service {
  name                           generic-service
  active_checks_enabled          1                                   
  check_freshness                0                                   
  check_interval                 5
  check_period                   24x7
  contact_groups                 systems
  event_handler_enabled          1                                   
  flap_detection_enabled         1                                   
  is_volatile                    0
  max_check_attempts             3
  notification_interval          30                                  
  notification_options           w,c                                 
  notification_period            24x7
  notifications_enabled          1                                   
  obsess_over_service            0                                   
  parallelize_check              1                                   
  passive_checks_enabled         1                                   
  process_perf_data              1                                   
  register                       0           
  retain_nonstatus_information   1                                   
  retain_status_information      1                                   
  retry_interval                 2
}

generic-router.cfg:

Code: Select all

define host {
  name                           generic-router                            
  check_command                  check_ping!300.0,1%!500.0,1%        
  contact_groups                 admins
  event_handler_enabled          1                                   
  flap_detection_enabled         1                                   
  hostgroups                     generic-routers
  max_check_attempts             3
  notification_interval          10
  notification_options           d,r
  notification_period            24x7
  notifications_enabled          1                                   
  obsess_over_host               0
  process_perf_data              1                                   
  register                       0                                   
  retain_nonstatus_information   1                                   
  retain_status_information      1                                   
}

define hostgroup {
  hostgroup_name                 generic-routers
  alias                          Router group
}


generic-special-server.cfg:

Code: Select all

define host {
  name                           special-srv
  use                            generic-host
  check_command                  check-host-alive                    
  check_interval                 0				
  check_period                   24x7
  contact_groups                 systems
  hostgroups                     special-servers
  max_check_attempts             3
  notification_interval          30  			
  register                       0				
  retry_interval                 1
}

define hostgroup {
  hostgroup_name                 special-servers
  alias                          My special servers
}

my-remotelocation-host.cfg:

Code: Select all

define host {
  host_name                      RemoteLocationRouter
  address                        10.1.1.1
  use                            generic-router,pnp4nagios_host            
}


define host {
  host_name                      RemoteLocationSrv1
  address                        10.1.1.10
  parents                        RemoteLocationRouter
  use                            special-srv,pnp4nagios_host
}

These are the perfs, in case they are of some interest:

Code: Select all

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
----------------------------------
Read:                 0.004944 sec
Resolve:              0.000281 sec  *
Recomb Contactgroups: 0.000018 sec  *
Recomb Hostgroups:    0.000297 sec  *
Dup Services:         0.004701 sec  *
Recomb Servicegroups: 0.000022 sec  *
Duplicate:            0.000001 sec  *
Inherit:              0.001039 sec  *
Register:             0.005164 sec
Free:                 0.000422 sec
                      ============
TOTAL:                0.016889 sec  * = 0.001590 sec (9.41%) estimated savings


Timing information on configuration verification is listed below.

CONFIG VERIFICATION TIMES
----------------------------------
Object Relationships: 0.002919 sec
Circular Paths:       0.000257 sec
Misc:                 0.000169 sec
                      ============
TOTAL:                0.003345 sec


RETENTION DATA TIMES
----------------------------------
Read and Process:     0.186711 sec
                      ============
TOTAL:                0.186711 sec


EVENT SCHEDULING TIMES
-------------------------------------
Get service info:        0.006883 sec
Get host info info:      0.000349 sec
Get service params:      0.000018 sec
Schedule service times:  0.016055 sec
Schedule service events: 0.003675 sec
Get host params:         0.000001 sec
Schedule host times:     0.000216 sec
Schedule host events:    0.000070 sec
                         ============
TOTAL:                   0.027267 sec


Projected scheduling information for host and service checks
is listed below.  This information assumes that you are going
to start running Nagios with your current config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     147
Total scheduled hosts:           40
Host inter-check delay method:   SMART
Average host check interval:     300.00 sec
Host inter-check delay:          7.50 sec
Max host check spread:           30 min
First scheduled check:           Thu Feb 19 18:17:27 2015
Last scheduled check:            Thu Feb 19 18:22:19 2015


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     2549
Total scheduled services:           2549
Service inter-check delay method:   SMART
Average service check interval:     300.00 sec
Inter-check delay:                  0.12 sec
Interleave factor method:           SMART
Average services per host:          17.34
Service interleave factor:          18
Max service check spread:           30 min
First scheduled check:              Thu Feb 19 18:17:28 2015
Last scheduled check:               Thu Feb 19 18:22:27 2015


CHECK PROCESSING INFORMATION
----------------------------
Average check execution time:    26.10s
Estimated concurrent checks:     316 (158.00 per cpu core)
Max concurrent service checks:   Unlimited


PERFORMANCE SUGGESTIONS
-----------------------
* Aim for a max of 50 concurrent checks / cpu core (current: 158.00)

NOTE: These are just guidelines and *not* hard numbers.

Ultimately, only testing will tell if your settings and hardware are
suitable for the types and number of checks you're planning to run.

Any help is greatly appreciated.

Re: Nagios keeps checking services on offline hosts...

Posted: Thu Feb 19, 2015 4:42 pm
by lmiltchev
My problem is the service checks are performed regardless of the router state (router down-> host unreachable), and regardless of the host state (host offline). So the checks exit with timeout error, of course.
You can create a simple ping service check for the router, then use service dependencies (execution failure criteria = c,u). If the ping (master service) is in a critical or unknown state, the dependent services won't be run. For more info on the host/service dependencies, see this:

http://nagios.sourceforge.net/docs/nagi ... ncies.html

Re: Nagios keeps checking services on offline hosts...

Posted: Fri Feb 20, 2015 3:25 am
by melmoth
Thanks for the suggestion. I was thinking at service dependencies as well, but what it is not clear to me is why Nagios keeps checking services on hosts which are in "unreachable" or "down" status... Shouldn't this be avoided when using parent/child relations between hosts? And moreover, if an host is offline why keeping checking its services, which obviously lend to a timeout in the check?
What I'm trying to figure is if this is the correct behavior or I've done something wrong in my config...
And then, setting service dependencies on 2500 services can be a little daunting a task :)

Re: Nagios keeps checking services on offline hosts...

Posted: Fri Feb 20, 2015 2:40 pm
by jdalrymple
melmoth,

This is somewhat by design. It is very possible for check_icmp to fail because of a firewall or some such while the host services like httpd could still be alive. This is actually quite common on a default install of a modern Windows server.

In order to have the service check stop you will need to implement service dependencies.

Re: Nagios keeps checking services on offline hosts...

Posted: Thu Feb 26, 2015 6:40 am
by melmoth
Thank you for your help as well.
Standing things this way, it seems the way checks are performed is quite inefficient. The parent->child relationship doesn't seem to make much difference, apart from the fact the child is identified as unreachable instead of down when the parent is offline...
I even tried implementing scheduled downtime: I setup a 1440 min long downtime once a day, but the checks are still performed anyway! It seems the downtime only affects the notifications.
The other option would be using Timeperiods on check_period to at least mitigate the problem (I don't need checks performed over night, for instance), but it seems there is a bug which prevent it from working in 4.0.7 and 4.0.8 as well...

BTW implementing service dependencies on ~150 hosts with a total of 2500 services would be quite complicate anyway...

Boh!

As a side note, what is the most efficient way to generate a static web page to display just certain hostgroups's data? The scenario is this: a customer would like to have access to data pertaining his hosts. He doesn't need the full Nagios/Thruk interface, nor I'm willing to provide access to my Nagios server, and he don't need realtime info as well. So I thought pulling data from the nagios status.dat and generating a static .htm page which can be hosted offsite and updated via cronjob every once in a while. Any suggestion?

Re: Nagios keeps checking services on offline hosts...

Posted: Thu Feb 26, 2015 9:46 am
by jdalrymple
I can understand your frustration, but on many points you're correct and again it's by design.

As mentioned, the parent/child relationship does indeed do what you suggested, and for many that's desireable.

With regard to maintenance mode, stopping the notifications is the only goal. For those people who are looking at a dashboard it is still useful information to know if a service is up or down during a maintenance window. A lot of people use that feature to know if they've properly recovered from their downtime and if the checks don't resume until after the maintenance window ends you won't have that visibility until after the system is prepared to start alerting users again.

I understand that making those service dependencies is cumbersome, particularly if there is no uniformity in the environment. Playing right along with the next question though...
what is the most efficient way to generate a static web page to display just certain hostgroups's data
I don't care to turn this thread into a sales pitch, but this is a rather trivial thing to do with NagiosXI. You may be a good candidate to move up to that version of our product. As a former heavy user of Core I understand the beauty of the free version, but a lot of the things you want to do easier are included in NagiosXI. I recommend downloading the trial and seeing if the features it brings is worth the cost in your environment. Besides "the easy way" of buying XI I recommend you review the following URL before creating your custom solution:
http://nagios.sourceforge.net/docs/3_0/cgiauth.html
It may be possible to achieve what you want just by modifying authorization parameters in the cgi.cfg file.

Re: Nagios keeps checking services on offline hosts...

Posted: Thu Feb 26, 2015 11:04 am
by melmoth
jdarylmpe,
thank you for the clarifications. I will certainly take a look at the trial version of Nagios XI. It seems a nice product, indeed. The price range is quite out of our possibilities, though. I know, there are many considerations which can be made about this, but I'm not the one who makes these kind of choices, here...
It may be possible to achieve what you want just by modifying authorization parameters in the cgi.cfg file.
The real problem is I don't want to let customers access the main Nagios server. I'd like to post some data on an external website, just like a customized dashboard external of the nagios server.

Re: Nagios keeps checking services on offline hosts...

Posted: Thu Feb 26, 2015 11:28 am
by jdalrymple
I understand, and again trying not to sound like a salesperson (I promise I'm not) you might take a look at Nagios Fusion. It offers the ability to see Nagios data without interacting directly with the Nagios server. Again though it's a licensed product so there is some financial investment involved. You can certainly take on the project yourself of scraping the data from Nagios and it just becomes a question of how much your time is worth.

If we covered all of the topics concerned, can I lock the topic and consider your problems "resolved"?

Re: Nagios keeps checking services on offline hosts...

Posted: Sat Feb 28, 2015 4:49 am
by melmoth
jdalrymple,
I can't really say I have reolved my issue, but you can certainly close this thread, thanks ;)