Page 1 of 1

Strange regular outages

Posted: Sat Oct 04, 2014 9:22 am
by dunderm
Hello,

We're using Nagios 4.0.7 with ~4 000 active checks and ~10 000 passive checks on a single machine (we plan to split them to two instances, but this is another story). What we observe is that there are regular, like scheduled, periods when Nagios stops processing anything for about 9-10 min. Both active and passive checks go to 0%, there are no notifications being sent either. Then, after 9-10 min it starts again. We could see this happen regularly four times a day at exactly the same time: e.g. 01:27, 03:27, 13:27, 22:27, and then the next day, again and again. From time to time the exact time changes but the distance between different outages is still kept 2-10-9-3 hours... We could not see anything suspicious on the machine (like high CPU or something), there is no clue in Nagios logs either. We had the same behavior with Nagios 3.3.

Is such behavior known to anybody? Could you please advice what to check and how to continue our investigations?

Thanks!

Re: Strange regular outages

Posted: Mon Oct 06, 2014 10:25 am
by abrist
My first suggestion would be to upgrade to 4.0.8 as there were a number of improvements to the scheduler. From the release notes:
Re-implemented auto-rescheduling of checks (Eric Mislivec)
Avoid bunching of checks delayed due to timeperiod constraints (Eric Stanley)
Limit the number of autocalculated core workers to not spawn too many on large systems (Eric Mislivec, Janice Singh)
After upgrading, I would suggest decreasing the nagios.cfg "directive auto_rescheduling_window" from:

Code: Select all

auto_rescheduling_window=180
To:

Code: Select all

auto_rescheduling_window=45

Re: Strange regular outages

Posted: Mon Oct 06, 2014 11:47 am
by dunderm
Thanks, we'll try upgrading to the latest version.

Regarding auto-rescheduling - it is switched off:

Code: Select all

# AUTO-RESCHEDULING OPTION
# This option determines whether or not Nagios will attempt to
# automatically reschedule active host and service checks to
# "smooth" them out over time.  This can help balance the load on
# the monitoring server.  
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY

auto_reschedule_checks=0

Re: Strange regular outages

Posted: Mon Oct 06, 2014 11:59 am
by abrist
After the upgrade, turn it on as it should help smooth the rescheduling of checks, but remember to change the window interval to "45".

Re: Strange regular outages

Posted: Fri Oct 17, 2014 8:28 am
by dunderm
Applied the proposed changes yesterday. No visible improvement. Still there are 10-15 min periods when no metric are processed and no notifications are being sent. Then, out of a sudden, everything starts again.

Here is the state when everything is ok:

Image

And when no metrics are being processed:

Image

I have a volatile service scheduled to check if everything is working. Here is the history of its notifications. You can spot the 15-min break:

Image

Note that the schedule is still there – now those naps takes place at 00:10, 10:10, 19:10, 22:10 each day.

Any suggestions what to check next are welcome.

Thanks!

P.S. Hope you can see the images. They are just to depict the situation in more detail; I think more or less it is clear even without them...

Re: Strange regular outages

Posted: Fri Oct 17, 2014 2:05 pm
by abrist
Are you running any backup/cronjobs/etc that run at those mentioned times? Are you collecting load metrics? If so, can you check those times?

Re: Strange regular outages

Posted: Sun Oct 19, 2014 2:40 pm
by dunderm
Could not find any. CPU is low hence no indication for any parallel activities.
Of course we continue to search for something scheduled but also wanted to check what and why Nagios is doing at that time...

Re: Strange regular outages

Posted: Mon Oct 20, 2014 5:04 pm
by slansing
Is there any way someone could be on standby to trap the nagios log at that point? An alternative would be pulling an entire archive file and looking at those time periods.