Nagios Support Forum

Posted: **Sat Oct 04, 2014 9:22 am**

Hello,

We're using Nagios 4.0.7 with ~4 000 active checks and ~10 000 passive checks on a single machine (we plan to split them to two instances, but this is another story). What we observe is that there are regular, like scheduled, periods when Nagios stops processing anything for about 9-10 min. Both active and passive checks go to 0%, there are no notifications being sent either. Then, after 9-10 min it starts again. We could see this happen regularly four times a day at exactly the same time: e.g. 01:27, 03:27, 13:27, 22:27, and then the next day, again and again. From time to time the exact time changes but the distance between different outages is still kept 2-10-9-3 hours... We could not see anything suspicious on the machine (like high CPU or something), there is no clue in Nagios logs either. We had the same behavior with Nagios 3.3.

Is such behavior known to anybody? Could you please advice what to check and how to continue our investigations?

Thanks!

Posted: **Mon Oct 06, 2014 10:25 am**

My first suggestion would be to upgrade to 4.0.8 as there were a number of improvements to the scheduler. From the release notes:

Re-implemented auto-rescheduling of checks (Eric Mislivec)
Avoid bunching of checks delayed due to timeperiod constraints (Eric Stanley)
Limit the number of autocalculated core workers to not spawn too many on large systems (Eric Mislivec, Janice Singh)

After upgrading, I would suggest decreasing the nagios.cfg "directive auto_rescheduling_window" from:

Code: Select all

auto_rescheduling_window=180

To:

Code: Select all

auto_rescheduling_window=45

Posted: **Mon Oct 06, 2014 11:47 am**

Thanks, we'll try upgrading to the latest version.

Regarding auto-rescheduling - it is switched off:

Code: Select all

# AUTO-RESCHEDULING OPTION
# This option determines whether or not Nagios will attempt to
# automatically reschedule active host and service checks to
# "smooth" them out over time.  This can help balance the load on
# the monitoring server.  
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY

auto_reschedule_checks=0

Posted: **Mon Oct 06, 2014 11:59 am**

After the upgrade, turn it on as it should help smooth the rescheduling of checks, but remember to change the window interval to "45".

Posted: **Fri Oct 17, 2014 8:28 am**

Applied the proposed changes yesterday. No visible improvement. Still there are 10-15 min periods when no metric are processed and no notifications are being sent. Then, out of a sudden, everything starts again.

Here is the state when everything is ok:

And when no metrics are being processed:

I have a volatile service scheduled to check if everything is working. Here is the history of its notifications. You can spot the 15-min break:

Note that the schedule is still there – now those naps takes place at 00:10, 10:10, 19:10, 22:10 each day.

Any suggestions what to check next are welcome.

Thanks!

P.S. Hope you can see the images. They are just to depict the situation in more detail; I think more or less it is clear even without them...

Posted: **Fri Oct 17, 2014 2:05 pm**

Are you running any backup/cronjobs/etc that run at those mentioned times? Are you collecting load metrics? If so, can you check those times?

Posted: **Sun Oct 19, 2014 2:40 pm**

Could not find any. CPU is low hence no indication for any parallel activities.
Of course we continue to search for something scheduled but also wanted to check what and why Nagios is doing at that time...

Posted: **Mon Oct 20, 2014 5:04 pm**

Is there any way someone could be on standby to trap the nagios log at that point? An alternative would be pulling an entire archive file and looking at those time periods.

Nagios Support Forum

Strange regular outages

Strange regular outages

Re: Strange regular outages

Re: Strange regular outages

Re: Strange regular outages

Re: Strange regular outages

Re: Strange regular outages

Re: Strange regular outages

Re: Strange regular outages