Nagios Restart and Service/Host check latency

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
wvl
Posts: 3
Joined: Fri Mar 24, 2017 3:46 am

Nagios Restart and Service/Host check latency

Post by wvl »

Hi, we have a large setup with 125.000~ service checks with nagios3 and mod_gearman.

Something I found strange is that after restarts, the service check latency average goes through the roof (15 minutes). It makes sense that nagios is doing it's best to schedule things in a local and remote host friendly way. However, why is it 'rescheduling' checks like they never have been run before the restart, while we try to store as much state as possible:

Code: Select all

# Checks age management 
check_result_path=/srv/nagios/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1

# AUTO RESCHEDULING OPTIONS (EXPERIMENTAL, use with CAUTION)
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180

# SOME TIMINGS
sleep_time=0.25
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5

# Performance tuning
service_inter_check_delay_method=s
max_service_check_spread=30
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=30
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30

retain_state_information=1
state_retention_file=/var/log/nagios/retention.dat                                                                                                           
retention_update_interval=1
use_retained_program_state=1
use_retained_scheduling_info=1
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0

# Status file                                                                                                                                                
status_file=/var/log/nagios/status.dat
status_update_interval=120
Check latency graph. Not averaged, just the check latency of a few individual checks which should be running with check_interval = 1m

Image

Anything I'm missing here or can nagios3 simply not avoid this behavior?

Here's some more information from a nagios -s:

Code: Select all

Timing information on object configuration processing is listed
below.  You can use this information to see if precaching your
object configuration would be useful.

Object Config Source: Config files (uncached)

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
----------------------------------
Read:                 1.100689 sec
Resolve:              0.196292 sec  *
Recomb Contactgroups: 0.031873 sec  *
Recomb Hostgroups:    0.172127 sec  *
Dup Services:         0.163431 sec  *
Recomb Servicegroups: 0.009459 sec  *
Duplicate:            0.210780 sec  *
Inherit:              0.036253 sec  *
Recomb Contacts:      0.000000 sec  *
Sort:                 0.000000 sec  *
Register:             0.362647 sec
Free:                 0.063051 sec
                      ============
TOTAL:                2.346605 sec  * = 0.820218 sec (34.95%) estimated savings


RETENTION DATA TIMES
----------------------------------
Read and Process:     4.833108 sec
                      ============
TOTAL:                4.833108 sec


Timing information on configuration verification is listed below.

CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x option)
----------------------------------
Object Relationships: 0.356153 sec
Circular Paths:       0.000000 sec  *
Misc:                 0.045195 sec
                      ============
TOTAL:                0.401348 sec  * = 0.000000 sec (0.0%) estimated savings


EVENT SCHEDULING TIMES
-------------------------------------
Get service info:        0.278038 sec
Get host info info:      0.000542 sec
Get service params:      0.000002 sec
Schedule service times:  0.261014 sec
Schedule service events: 46.465771 sec
Get host params:         0.000000 sec
Schedule host times:     0.008024 sec
Schedule host events:    9.417014 sec
                         ============
TOTAL:                   56.430405 sec


Projected scheduling information for host and service checks
is listed below.  This information assumes that you are going
to start running Nagios with your current config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     5368
Total scheduled hosts:           5367
Host inter-check delay method:   SMART
Average host check interval:     300.00 sec
Host inter-check delay:          0.06 sec
Max host check spread:           30 min
First scheduled check:           Fri Mar 24 11:08:52 2017
Last scheduled check:            Fri Mar 24 11:08:52 2017


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     126026
Total scheduled services:           87914
Service inter-check delay method:   SMART
Average service check interval:     1102.51 sec
Inter-check delay:                  0.01 sec
Interleave factor method:           SMART
Average services per host:          23.48
Service interleave factor:          17
Max service check spread:           30 min
First scheduled check:              Fri Mar 24 11:09:56 2017
Last scheduled check:               Fri Mar 24 11:27:26 2017


CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval:       20 sec
Max concurrent service checks:      Unlimited
(note that I changed the check result reaper interval before I generated the above. It's now back to the same value as posted in the config however.)
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios Restart and Service/Host check latency

Post by tmcdonald »

Is this something that got worse as you added more and more checks, or did it start suddenly?

Without knowing more about your environment, the first thing I would have to suggest would be to upgrade to at least something in the 4x branch. We completely changed how Core handles checks, and there is a lot less overhead involved (threads vs forking). If this issue got worse over time, this is the most likely fix I would think.
Former Nagios employee
wvl
Posts: 3
Joined: Fri Mar 24, 2017 3:46 am

Re: Nagios Restart and Service/Host check latency

Post by wvl »

> did it start suddenly?

No, just a problem that was ignored for too long.

> I would have to suggest would be to upgrade to at least something in the 4x branch

Working on it. Anything tips? We use livestatus and mod_gearman brokers, which as far as my local testing goes, seem compatible when using the right versions.

> We completely changed how Core handles checks, and there is a lot less overhead involved (threads vs forking)

This would make me believe that active service/host checking would have less latency, but unless something has changed with regard to keeping state and startup scheduling....

> If this issue got worse over time, this is the most likely fix I would think

I'm sure Nagios 4 is going to be better. But is there a logical explanation for this happening in Nagios 3? I'd think with all the 'retain state' options we've set, we shouldn't haven't these extreme latency spikes right after restarts.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Nagios Restart and Service/Host check latency

Post by avandemore »

As @tmcdonald said, it's best to upgrade. There have been a lot of changes and speculating on your setup without your entire config isn't going to help much. And seeing your whole config isn't that enticing since your version is so out of date. You can profile it if you wish as you would any system app if you wish to pursue that version's characteristics.

You've got a lot of moving parts so in your upgrade I would do this one major component at a time. Then if there is an issue, it's easier to isolate and remedy it.
Previous Nagios employee
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Restart and Service/Host check latency

Post by scottwilkerson »

The check scheduler was completely re-written in 4 to fix some of these issues. I cannot speak to the specifics but often once people reached environments of your size, it would become a significant problem.. First it used to help some enabling

https://assets.nagios.com/downloads/nag ... ion_tweaks

But still, the scheduling engine wasn't nearly as efficient as what was rewritten into Nagios Core 4 3.5 years ago.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
wvl
Posts: 3
Joined: Fri Mar 24, 2017 3:46 am

Re: Nagios Restart and Service/Host check latency

Post by wvl »

A few months and an upgrade to Nagios 4.2.4 later and this problem has completely gone away. Latency for checks went from around 20 minutes for some checks after a restart/reload, to around 2 minutes.
Case closed!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios Restart and Service/Host check latency

Post by tmcdonald »

I'll be closing this thread now, but feel free to open another if you need anything in the future!
Former Nagios employee
Locked