Page 7 of 8

Re: CPU Load Spike daily

Posted: Tue Jul 22, 2014 8:49 am
by tylerhoadley
Updated results from load patch.

Before patch without the added 1000 HTTP checks

Image

After Patch and with 1000 added HTTP checks and retention status cleared and in the past...

Image

for the most part, looks balanced and more stable

Monitor queue view

Image

Re: CPU Load Spike daily

Posted: Tue Jul 22, 2014 8:57 am
by tylerhoadley
I can revert the snapshot and test out the new commit if preferred?

Let me know....

Cheers,

Re: CPU Load Spike daily

Posted: Tue Jul 22, 2014 10:45 am
by sreinhardt
It would be preferable to have the new commit get some testing as well, if you have a chance and don't mind!

Re: CPU Load Spike daily

Posted: Tue Jul 22, 2014 10:50 am
by BanditBBS
sreinhardt wrote:It would be preferable to have the new commit get some testing as well, if you have a chance and don't mind!
Looks like Tyler will be testing it. I will try and test it tonight while watching AGT :ugeek: It'll give me a reason to not do actual work this evening!

Re: CPU Load Spike daily

Posted: Tue Jul 22, 2014 12:06 pm
by tylerhoadley
reverting snapshot....

Update... (couldn't wait to share check results)

1 min checks
--
Service State: Ok
Duration: 2m 19s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:41:37
Next Check: 2014-07-22 13:42:37


Service State: Ok
Duration: 3m 26s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:42:37
Next Check: 2014-07-22 13:43:37

Service State: Ok
Duration: 4m 36s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:43:37
Next Check: 2014-07-22 13:44:37

Service State: Ok
Duration: 5m 29s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:44:37
Next Check: 2014-07-22 13:45:37

Service State: Ok
Duration: 6m 18s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:45:37
Next Check: 2014-07-22 13:46:37

--

5 mins checks
--
Service State: Ok
Duration: 2m 17s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:40:12
Next Check: 2014-07-22 13:45:12


Service State: Ok
Duration: 7m 15s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:45:12
Next Check: 2014-07-22 13:50:12

Service State: Ok
Duration: 10m 18s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:50:11
Next Check: 2014-07-22 13:55:11
--

monitoring queue once nagios was started after retention.dat was removed and time lapses.

Image

current monitoring queue

Image


I'll post the load screenshot in due time... the trend looks consistent like the commit testing I did prior... but will follow up tonight or tomorrow with the data.
Cheers,

Re: CPU Load Spike daily

Posted: Wed Jul 23, 2014 7:49 am
by tylerhoadley
load trends from post patch til now.

Image

Monitoring engine Queue

Image

Re: CPU Load Spike daily

Posted: Wed Jul 23, 2014 5:07 pm
by abrist
The value of the y-axis on your earlier posts was cropped. How does the sub 1.0 load average from the latest graph compare to the earlier ones??

Re: CPU Load Spike daily

Posted: Thu Jul 24, 2014 7:44 am
by tylerhoadley
Here is this mornings view on patch

24 Hour view on load

Image

latest queue monitor snapshot.

Image


Also the value of the y-axis is 2 on both prior images.

Re: CPU Load Spike daily

Posted: Thu Jul 24, 2014 4:21 pm
by abrist
So it might have helped just a little bit?
Heads up: A few new commits for some of the rescheduling woes can be found in Eric[1]'s branch at:
https://github.com/NagiosEnterprises/na ... scheduling

Re: CPU Load Spike daily

Posted: Fri Jul 25, 2014 1:38 pm
by emislivec
The earlier changes from estanley address scheduling at startup or reload, and when (re)scheduling checks that shouldn't be run at their next check interval because of timeperiod constraints. The aim is to have a smoother schedule, and these changes seem to have helped somewhat.

The commits abrist mentioned relate to the auto-rescheduling of checks, and are intended to smooth out the schedule while running. I've fixed some of the arithmetic to be more precise so the reschedule will run when it is needed. Anyone testing should use the latest commit on that branch, a6470006691fba943334bf031d18f9ae2bf83645 at this time.

For the rescheduling to be applied, auto_reschedule_checks=1 needs to be set in nagios.cfg. Two other settings affect this:
auto_rescheduling_interval controls how often Core will check if auto-rescheduling is needed (the schedule is only changed if there are checks scheduled close to each other).
auto_rescheduling_window controls how far in the future the schedule will be examined and possibly changed if needed.
I've been using 30 and 90 seconds for these on a test system running mostly two minute checks.

There are some fixed thresholds used internally. Right now, the rescheduling wont be run if no checks are scheduled closer than 1/8 second. On lightly loaded systems this might run rarely if at all. On systems with many checks it may be too frequent. I do plan on changing this threshold to be calculated based on the actually configured checks. Additionally the way host checks are handled when adjusting the schedule can lead to some gaps between checks, I will be correcting this as well

In my testing I've seen these changes mostly eliminate the seven hour peaks, but wider testing is needed to confirm this in general. Anyone who has the time and a system to test would be greatly appreciated.