CPU Load Spike daily

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
tylerhoadley
Posts: 43
Joined: Tue Jul 02, 2013 1:41 pm

Re: CPU Load Spike daily

Post by tylerhoadley »

Updated results from load patch.

Before patch without the added 1000 HTTP checks

Image

After Patch and with 1000 added HTTP checks and retention status cleared and in the past...

Image

for the most part, looks balanced and more stable

Monitor queue view

Image
User avatar
tylerhoadley
Posts: 43
Joined: Tue Jul 02, 2013 1:41 pm

Re: CPU Load Spike daily

Post by tylerhoadley »

I can revert the snapshot and test out the new commit if preferred?

Let me know....

Cheers,
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: CPU Load Spike daily

Post by sreinhardt »

It would be preferable to have the new commit get some testing as well, if you have a chance and don't mind!
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: CPU Load Spike daily

Post by BanditBBS »

sreinhardt wrote:It would be preferable to have the new commit get some testing as well, if you have a chance and don't mind!
Looks like Tyler will be testing it. I will try and test it tonight while watching AGT :ugeek: It'll give me a reason to not do actual work this evening!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
tylerhoadley
Posts: 43
Joined: Tue Jul 02, 2013 1:41 pm

Re: CPU Load Spike daily

Post by tylerhoadley »

reverting snapshot....

Update... (couldn't wait to share check results)

1 min checks
--
Service State: Ok
Duration: 2m 19s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:41:37
Next Check: 2014-07-22 13:42:37


Service State: Ok
Duration: 3m 26s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:42:37
Next Check: 2014-07-22 13:43:37

Service State: Ok
Duration: 4m 36s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:43:37
Next Check: 2014-07-22 13:44:37

Service State: Ok
Duration: 5m 29s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:44:37
Next Check: 2014-07-22 13:45:37

Service State: Ok
Duration: 6m 18s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:45:37
Next Check: 2014-07-22 13:46:37

--

5 mins checks
--
Service State: Ok
Duration: 2m 17s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:40:12
Next Check: 2014-07-22 13:45:12


Service State: Ok
Duration: 7m 15s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:45:12
Next Check: 2014-07-22 13:50:12

Service State: Ok
Duration: 10m 18s
Service Stability: Unchanging (stable)
Last Check: 2014-07-22 13:50:11
Next Check: 2014-07-22 13:55:11
--

monitoring queue once nagios was started after retention.dat was removed and time lapses.

Image

current monitoring queue

Image


I'll post the load screenshot in due time... the trend looks consistent like the commit testing I did prior... but will follow up tonight or tomorrow with the data.
Cheers,
User avatar
tylerhoadley
Posts: 43
Joined: Tue Jul 02, 2013 1:41 pm

Re: CPU Load Spike daily

Post by tylerhoadley »

load trends from post patch til now.

Image

Monitoring engine Queue

Image
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: CPU Load Spike daily

Post by abrist »

The value of the y-axis on your earlier posts was cropped. How does the sub 1.0 load average from the latest graph compare to the earlier ones??
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
User avatar
tylerhoadley
Posts: 43
Joined: Tue Jul 02, 2013 1:41 pm

Re: CPU Load Spike daily

Post by tylerhoadley »

Here is this mornings view on patch

24 Hour view on load

Image

latest queue monitor snapshot.

Image


Also the value of the y-axis is 2 on both prior images.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: CPU Load Spike daily

Post by abrist »

So it might have helped just a little bit?
Heads up: A few new commits for some of the rescheduling woes can be found in Eric[1]'s branch at:
https://github.com/NagiosEnterprises/na ... scheduling
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
emislivec
Posts: 52
Joined: Tue Feb 25, 2014 10:06 am

Re: CPU Load Spike daily

Post by emislivec »

The earlier changes from estanley address scheduling at startup or reload, and when (re)scheduling checks that shouldn't be run at their next check interval because of timeperiod constraints. The aim is to have a smoother schedule, and these changes seem to have helped somewhat.

The commits abrist mentioned relate to the auto-rescheduling of checks, and are intended to smooth out the schedule while running. I've fixed some of the arithmetic to be more precise so the reschedule will run when it is needed. Anyone testing should use the latest commit on that branch, a6470006691fba943334bf031d18f9ae2bf83645 at this time.

For the rescheduling to be applied, auto_reschedule_checks=1 needs to be set in nagios.cfg. Two other settings affect this:
auto_rescheduling_interval controls how often Core will check if auto-rescheduling is needed (the schedule is only changed if there are checks scheduled close to each other).
auto_rescheduling_window controls how far in the future the schedule will be examined and possibly changed if needed.
I've been using 30 and 90 seconds for these on a test system running mostly two minute checks.

There are some fixed thresholds used internally. Right now, the rescheduling wont be run if no checks are scheduled closer than 1/8 second. On lightly loaded systems this might run rarely if at all. On systems with many checks it may be too frequent. I do plan on changing this threshold to be calculated based on the actually configured checks. Additionally the way host checks are handled when adjusting the schedule can lead to some gaps between checks, I will be correcting this as well

In my testing I've seen these changes mostly eliminate the seven hour peaks, but wider testing is needed to confirm this in general. Anyone who has the time and a system to test would be greatly appreciated.
Locked