Strange regular outages

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
dunderm
Posts: 4
Joined: Thu Oct 02, 2014 1:36 pm

Strange regular outages

Post by dunderm »

Hello,

We're using Nagios 4.0.7 with ~4 000 active checks and ~10 000 passive checks on a single machine (we plan to split them to two instances, but this is another story). What we observe is that there are regular, like scheduled, periods when Nagios stops processing anything for about 9-10 min. Both active and passive checks go to 0%, there are no notifications being sent either. Then, after 9-10 min it starts again. We could see this happen regularly four times a day at exactly the same time: e.g. 01:27, 03:27, 13:27, 22:27, and then the next day, again and again. From time to time the exact time changes but the distance between different outages is still kept 2-10-9-3 hours... We could not see anything suspicious on the machine (like high CPU or something), there is no clue in Nagios logs either. We had the same behavior with Nagios 3.3.

Is such behavior known to anybody? Could you please advice what to check and how to continue our investigations?

Thanks!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Strange regular outages

Post by abrist »

My first suggestion would be to upgrade to 4.0.8 as there were a number of improvements to the scheduler. From the release notes:
Re-implemented auto-rescheduling of checks (Eric Mislivec)
Avoid bunching of checks delayed due to timeperiod constraints (Eric Stanley)
Limit the number of autocalculated core workers to not spawn too many on large systems (Eric Mislivec, Janice Singh)
After upgrading, I would suggest decreasing the nagios.cfg "directive auto_rescheduling_window" from:

Code: Select all

auto_rescheduling_window=180
To:

Code: Select all

auto_rescheduling_window=45
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
dunderm
Posts: 4
Joined: Thu Oct 02, 2014 1:36 pm

Re: Strange regular outages

Post by dunderm »

Thanks, we'll try upgrading to the latest version.

Regarding auto-rescheduling - it is switched off:

Code: Select all

# AUTO-RESCHEDULING OPTION
# This option determines whether or not Nagios will attempt to
# automatically reschedule active host and service checks to
# "smooth" them out over time.  This can help balance the load on
# the monitoring server.  
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY

auto_reschedule_checks=0
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Strange regular outages

Post by abrist »

After the upgrade, turn it on as it should help smooth the rescheduling of checks, but remember to change the window interval to "45".
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
dunderm
Posts: 4
Joined: Thu Oct 02, 2014 1:36 pm

Re: Strange regular outages

Post by dunderm »

Applied the proposed changes yesterday. No visible improvement. Still there are 10-15 min periods when no metric are processed and no notifications are being sent. Then, out of a sudden, everything starts again.

Here is the state when everything is ok:

Image

And when no metrics are being processed:

Image

I have a volatile service scheduled to check if everything is working. Here is the history of its notifications. You can spot the 15-min break:

Image

Note that the schedule is still there – now those naps takes place at 00:10, 10:10, 19:10, 22:10 each day.

Any suggestions what to check next are welcome.

Thanks!

P.S. Hope you can see the images. They are just to depict the situation in more detail; I think more or less it is clear even without them...
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Strange regular outages

Post by abrist »

Are you running any backup/cronjobs/etc that run at those mentioned times? Are you collecting load metrics? If so, can you check those times?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
dunderm
Posts: 4
Joined: Thu Oct 02, 2014 1:36 pm

Re: Strange regular outages

Post by dunderm »

Could not find any. CPU is low hence no indication for any parallel activities.
Of course we continue to search for something scheduled but also wanted to check what and why Nagios is doing at that time...
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Strange regular outages

Post by slansing »

Is there any way someone could be on standby to trap the nagios log at that point? An alternative would be pulling an entire archive file and looking at those time periods.
Locked