I'm fighting with a configuration problem and I hope someone here can shred some light. My /var/log/message is being flooded with check timeout messages (and the /usr/local/nagios/var/archives is huge, too):
Code: Select all
...
nagios: Warning: Check of service '00 System - Battery voltage' on host 'MyremoteHostSrv1' timed out after 60.011s!
nagios: wproc: Core Worker 28947: job 57811 (pid=12721): Dormant child reaped
nagios: wproc: CHECK job 57810 from worker Core Worker 28946 timed out after 60.01s
nagios: wproc: host=MyremoteHostSrv1; service=00 Info - Hostname;
nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
...
I'm monitoring 147 hosts, for a total of 2549 services:
Code: Select all
Nagios Stats 4.0.8
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 08-12-2014
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 1s
Status File Version: 4.0.8
Program Running Time: 0d 5h 44m 50s
Nagios PID: 28943
Total Services: 2549
Services Checked: 2549
Services Scheduled: 2549
Services Actively Checked: 2549
Services Passively Checked: 0
Total Service State Change: 0.000 / 11.180 / 0.396 %
Active Service Latency: 0.000 / 0.570 / 0.001 sec
Active Service Execution Time: 0.011 / 60.029 / 26.672 sec
Active Service State Change: 0.000 / 11.180 / 0.396 %
Active Services Last 1/5/15/60 min: 299 / 2504 / 2549 / 2549
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 405 / 3 / 1118 / 1023
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 147
Hosts Checked: 147
Hosts Scheduled: 40
Hosts Actively Checked: 147
Host Passively Checked: 0
Total Host State Change: 0.000 / 8.680 / 0.094 %
Active Host Latency: 0.000 / 1.021 / 0.008 sec
Active Host Execution Time: 0.244 / 30.007 / 8.890 sec
Active Host State Change: 0.000 / 8.680 / 0.094 %
Active Hosts Last 1/5/15/60 min: 76 / 131 / 133 / 134
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 50 / 64 / 33
Hosts Flapping: 0
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 171 / 848 / 2493
Scheduled: 162 / 823 / 2420
On-demand: 9 / 25 / 73
Parallel: 162 / 823 / 2420
Serial: 0 / 0 / 0
Cached: 9 / 25 / 73
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 376 / 2626 / 7736
Scheduled: 376 / 2626 / 7736
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 0 / 0
The hosts are installed at various locations, and each location has a router which acts as a VPN client connected to the nagios server. I have established a parent-child relation between each router and the hosts which are behind it.
The routers should be always up, but the hosts are down quite frequently and that's perfectly normal, they are used a few hours a day. Unfortunately I have no way to predict when they are used, so I can't use any time period declaration.
My problem is the service checks are performed regardless of the router state (router down-> host unreachable), and regardless of the host state (host offline). So the checks exit with timeout error, of course.
My current config is as follows.
1-generic-host.cfg:
Code: Select all
define host {
name generic-host
active_checks_enabled 1
check_command check-host-alive-ping
contact_groups systems
event_handler_enabled 1
flap_detection_enabled 1
max_check_attempts 3
notification_interval 10
notification_options d
notification_period 24x7
notifications_enabled 1
obsess_over_host 0
passive_checks_enabled 1
process_perf_data 1
register 0
retain_nonstatus_information 1
retain_status_information 1
}generic-service.cfg:
Code: Select all
define service {
name generic-service
active_checks_enabled 1
check_freshness 0
check_interval 5
check_period 24x7
contact_groups systems
event_handler_enabled 1
flap_detection_enabled 1
is_volatile 0
max_check_attempts 3
notification_interval 30
notification_options w,c
notification_period 24x7
notifications_enabled 1
obsess_over_service 0
parallelize_check 1
passive_checks_enabled 1
process_perf_data 1
register 0
retain_nonstatus_information 1
retain_status_information 1
retry_interval 2
}
generic-router.cfg:
Code: Select all
define host {
name generic-router
check_command check_ping!300.0,1%!500.0,1%
contact_groups admins
event_handler_enabled 1
flap_detection_enabled 1
hostgroups generic-routers
max_check_attempts 3
notification_interval 10
notification_options d,r
notification_period 24x7
notifications_enabled 1
obsess_over_host 0
process_perf_data 1
register 0
retain_nonstatus_information 1
retain_status_information 1
}
define hostgroup {
hostgroup_name generic-routers
alias Router group
}
generic-special-server.cfg:
Code: Select all
define host {
name special-srv
use generic-host
check_command check-host-alive
check_interval 0
check_period 24x7
contact_groups systems
hostgroups special-servers
max_check_attempts 3
notification_interval 30
register 0
retry_interval 1
}
define hostgroup {
hostgroup_name special-servers
alias My special servers
}
my-remotelocation-host.cfg:
Code: Select all
define host {
host_name RemoteLocationRouter
address 10.1.1.1
use generic-router,pnp4nagios_host
}
define host {
host_name RemoteLocationSrv1
address 10.1.1.10
parents RemoteLocationRouter
use special-srv,pnp4nagios_host
}
These are the perfs, in case they are of some interest:
Code: Select all
OBJECT CONFIG PROCESSING TIMES (* = Potential for precache savings with -u option)
----------------------------------
Read: 0.004944 sec
Resolve: 0.000281 sec *
Recomb Contactgroups: 0.000018 sec *
Recomb Hostgroups: 0.000297 sec *
Dup Services: 0.004701 sec *
Recomb Servicegroups: 0.000022 sec *
Duplicate: 0.000001 sec *
Inherit: 0.001039 sec *
Register: 0.005164 sec
Free: 0.000422 sec
============
TOTAL: 0.016889 sec * = 0.001590 sec (9.41%) estimated savings
Timing information on configuration verification is listed below.
CONFIG VERIFICATION TIMES
----------------------------------
Object Relationships: 0.002919 sec
Circular Paths: 0.000257 sec
Misc: 0.000169 sec
============
TOTAL: 0.003345 sec
RETENTION DATA TIMES
----------------------------------
Read and Process: 0.186711 sec
============
TOTAL: 0.186711 sec
EVENT SCHEDULING TIMES
-------------------------------------
Get service info: 0.006883 sec
Get host info info: 0.000349 sec
Get service params: 0.000018 sec
Schedule service times: 0.016055 sec
Schedule service events: 0.003675 sec
Get host params: 0.000001 sec
Schedule host times: 0.000216 sec
Schedule host events: 0.000070 sec
============
TOTAL: 0.027267 sec
Projected scheduling information for host and service checks
is listed below. This information assumes that you are going
to start running Nagios with your current config files.
HOST SCHEDULING INFORMATION
---------------------------
Total hosts: 147
Total scheduled hosts: 40
Host inter-check delay method: SMART
Average host check interval: 300.00 sec
Host inter-check delay: 7.50 sec
Max host check spread: 30 min
First scheduled check: Thu Feb 19 18:17:27 2015
Last scheduled check: Thu Feb 19 18:22:19 2015
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 2549
Total scheduled services: 2549
Service inter-check delay method: SMART
Average service check interval: 300.00 sec
Inter-check delay: 0.12 sec
Interleave factor method: SMART
Average services per host: 17.34
Service interleave factor: 18
Max service check spread: 30 min
First scheduled check: Thu Feb 19 18:17:28 2015
Last scheduled check: Thu Feb 19 18:22:27 2015
CHECK PROCESSING INFORMATION
----------------------------
Average check execution time: 26.10s
Estimated concurrent checks: 316 (158.00 per cpu core)
Max concurrent service checks: Unlimited
PERFORMANCE SUGGESTIONS
-----------------------
* Aim for a max of 50 concurrent checks / cpu core (current: 158.00)
NOTE: These are just guidelines and *not* hard numbers.
Ultimately, only testing will tell if your settings and hardware are
suitable for the types and number of checks you're planning to run.Any help is greatly appreciated.