Hello,
We're using Nagios 4.0.7 with ~4 000 active checks and ~10 000 passive checks on a single machine (we plan to split them to two instances, but this is another story). What we observe is that there are regular, like scheduled, periods when Nagios stops processing anything for about 9-10 min. Both active and passive checks go to 0%, there are no notifications being sent either. Then, after 9-10 min it starts again. We could see this happen regularly four times a day at exactly the same time: e.g. 01:27, 03:27, 13:27, 22:27, and then the next day, again and again. From time to time the exact time changes but the distance between different outages is still kept 2-10-9-3 hours... We could not see anything suspicious on the machine (like high CPU or something), there is no clue in Nagios logs either. We had the same behavior with Nagios 3.3.
Is such behavior known to anybody? Could you please advice what to check and how to continue our investigations?
Thanks!
Strange regular outages
Re: Strange regular outages
My first suggestion would be to upgrade to 4.0.8 as there were a number of improvements to the scheduler. From the release notes:
To:
After upgrading, I would suggest decreasing the nagios.cfg "directive auto_rescheduling_window" from:Re-implemented auto-rescheduling of checks (Eric Mislivec)
Avoid bunching of checks delayed due to timeperiod constraints (Eric Stanley)
Limit the number of autocalculated core workers to not spawn too many on large systems (Eric Mislivec, Janice Singh)
Code: Select all
auto_rescheduling_window=180
Code: Select all
auto_rescheduling_window=45
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Strange regular outages
Thanks, we'll try upgrading to the latest version.
Regarding auto-rescheduling - it is switched off:
Regarding auto-rescheduling - it is switched off:
Code: Select all
# AUTO-RESCHEDULING OPTION
# This option determines whether or not Nagios will attempt to
# automatically reschedule active host and service checks to
# "smooth" them out over time. This can help balance the load on
# the monitoring server.
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
auto_reschedule_checks=0
Re: Strange regular outages
After the upgrade, turn it on as it should help smooth the rescheduling of checks, but remember to change the window interval to "45".
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Strange regular outages
Applied the proposed changes yesterday. No visible improvement. Still there are 10-15 min periods when no metric are processed and no notifications are being sent. Then, out of a sudden, everything starts again.
Here is the state when everything is ok:
And when no metrics are being processed:
I have a volatile service scheduled to check if everything is working. Here is the history of its notifications. You can spot the 15-min break:
Note that the schedule is still there – now those naps takes place at 00:10, 10:10, 19:10, 22:10 each day.
Any suggestions what to check next are welcome.
Thanks!
P.S. Hope you can see the images. They are just to depict the situation in more detail; I think more or less it is clear even without them...
Here is the state when everything is ok:
And when no metrics are being processed:
I have a volatile service scheduled to check if everything is working. Here is the history of its notifications. You can spot the 15-min break:
Note that the schedule is still there – now those naps takes place at 00:10, 10:10, 19:10, 22:10 each day.
Any suggestions what to check next are welcome.
Thanks!
P.S. Hope you can see the images. They are just to depict the situation in more detail; I think more or less it is clear even without them...
Re: Strange regular outages
Are you running any backup/cronjobs/etc that run at those mentioned times? Are you collecting load metrics? If so, can you check those times?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Strange regular outages
Could not find any. CPU is low hence no indication for any parallel activities.
Of course we continue to search for something scheduled but also wanted to check what and why Nagios is doing at that time...
Of course we continue to search for something scheduled but also wanted to check what and why Nagios is doing at that time...
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Strange regular outages
Is there any way someone could be on standby to trap the nagios log at that point? An alternative would be pulling an entire archive file and looking at those time periods.