CPU Load Spike daily

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

CPU Load Spike daily

Post by BanditBBS »

I figured I'd start my own thread because this seems different from the "every 7 hours" issue.

Every day from 12:30 to 16:30 my NagiosXI 2014r1.2 server is spiking bad to the point where it can't even be used. I did reboot the server one day during this load spike and it worked fine as soon as it came back up until the next day at 12:30. This was happening with nothing but the Nagios server itself being monitored and has not gotten any worse as I have added hosts. My OS team looked at it in depth and the spike seems to be postgresql related. Is there anything in the database that is scheduled at that time or once per day type of schedule? Today I restarted postgresql during the high load and BAM! Instantly dropped the load back to a normal one. It started to climb again so I stopped postgres and it fell again.

HELP!
EDIT: Before you ask: psql (PostgreSQL) 8.4.20
EDIT2: Maybe I spoke too soon. It spiked again, I shutdown postgres and left it stopped. Eventually the server spiked again. I shutdown nagios and started postgres and it hasn't spike since. So now I am leaning towards it being the nagios process that's doing it. So now I am even more lost as I have no clue how the nagios process could be spiking every day at the same time.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: CPU Load Spike daily

Post by sreinhardt »

This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: CPU Load Spike daily

Post by BanditBBS »

sreinhardt wrote:This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.
You'll have it today good sir!

Dumb question #1: Where is the debug log stored?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: CPU Load Spike daily

Post by slansing »

By default it should be at:

Code: Select all

/usr/local/nagios/var/nagios.debug
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: CPU Load Spike daily

Post by BanditBBS »

Attached is the log. The issue started right on time as usual. I'll be able to use the system again in 4 hours.
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: CPU Load Spike daily

Post by BanditBBS »

Also, here is a capture from mytop:

Code: Select all

MySQL on localhost (5.1.73)                                                                                                 up 2+22:15:09 [13:29:29]
 Queries: 2.3M   qps:    9 Slow:   107.0         Se/In/Up/De(%):    16/45/01/02
             qps now:    7 Slow qps: 0.0  Threads:   53 (   1/   0) 03/00/00/00
 Key Efficiency: 98.0%  Bps in/out:  7.3k/ 1.8k   Now in/out: 264.6/ 2.0k

      Id      User         Host/IP         DB      Time    Cmd Query or State
       --      ----         -------         --      ----    --- ----------
    98125  nagiosql       localhost   nagiosql         0  Query show full processlist
    29381  nagiosql       localhost   nagiosql         4  Sleep
    68374  nagiosql       localhost   nagiosql         4  Sleep
    97675  nagiosql       localhost   nagiosql         4  Sleep
    97719  nagiosql       localhost   nagiosql         4  Sleep
    97701  nagiosql       localhost   nagiosql         8  Sleep
    97679  nagiosql       localhost   nagiosql         9  Sleep
    97697  nagiosql       localhost   nagiosql         9  Sleep
    97718  nagiosql       localhost   nagiosql         9  Sleep
    97721  nagiosql       localhost   nagiosql        11  Sleep
    67286  nagiosql       localhost   nagiosql        14  Sleep
    97685  nagiosql       localhost   nagiosql        14  Sleep
    97695  nagiosql       localhost   nagiosql        14  Sleep
    30637  nagiosql       localhost   nagiosql        18  Sleep
    68384  nagiosql       localhost   nagiosql        19  Sleep
    68392  nagiosql       localhost   nagiosql        19  Sleep
    97683  nagiosql       localhost   nagiosql        19  Sleep
    67276  nagiosql       localhost   nagiosql        24  Sleep
    68383  nagiosql       localhost   nagiosql        24  Sleep
    97724  nagiosql       localhost   nagiosql        24  Sleep
    98129  nagiosql       localhost   nagiosql        29  Sleep
    98136  nagiosql       localhost   nagiosql        29  Sleep
    98139  nagiosql       localhost   nagiosql        29  Sleep
    98141  nagiosql       localhost   nagiosql        29  Sleep
    98143  nagiosql       localhost   nagiosql        29  Sleep
    97703  nagiosql       localhost   nagiosql        34  Sleep
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
lmiltchev
Former Nagios Staff
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: CPU Load Spike daily

Post by lmiltchev »

Thanks, BanditBBS! Our developers will be looking at the debug info that you provided as soon as they can.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: CPU Load Spike daily

Post by BanditBBS »

Just an FYI - I just implemented ramdisk. I don't see how this can help this weird issue, but anything is worth a shot, so I will report back tomorrow evening if the issue persists.

Edit: Ramdisk didn't help at all
Edit2: I was home today during the spike unlike yesterday and can verify the ramdisk didn't help make it usable during the spike either
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
emislivec
Posts: 52
Joined: Tue Feb 25, 2014 10:06 am

Re: CPU Load Spike daily

Post by emislivec »

Thanks for the log Bandit. It looks like we are seeing some issues here with check scheduling bunching things up instead of spreading them out. However this was happening before things blew up toward the end of the log, so I think we're looking at two separate issues (check scheduling and whatever postgres is doing) that may be interacting. Postgres may be ultimately responsible for things becoming unusable for you, but the bunching of checks can't be helping.

Eric[0] made some changes that specifically targeted bunching of checks at the start of timeframes, but also affects checks more generally. If you have a test system that's showing load spikes, it would be worth trying out (hint, hint): https://github.com/NagiosEnterprises/na ... ac8dcb6f04
Spenser should have some install commands for you shortly.

One caveat to this fix is that you need to leave Core stopped long enough so that the next_check time for all (or most) checks in retention.dat is in the past (or greater than the check's interval time in the future) so that the new scheduling algorithm takes effect. So stopping Core for a bit over 5 minutes and then starting it should apply the new scheduling to the checks you're running.

Running with debug_level=12 and sending the log again would be awesome. Also, can you post, PM or email your nagios.cfg from the earlier run?
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: CPU Load Spike daily

Post by sreinhardt »

Some steps to test out the patch. This will only reinstall the current version of core from 2014r1.2 with the scheduler patch. Sorry this is the simplest way to do it since this isn't a release.

Code: Select all

cd /tmp
rm -rf ./nagios* ./xi-*
wget http://assets.nagios.com/downloads/nagiosxi/xi-latest.tar.gz
tar xzf xi-latest.tar.gz
cd nagiosxi/subcomponents/nagioscore/
rm -f apply-patches

Download the attached zip, and move into this directory

unzip schedulerpatch.zip
mv scheduler.patch patches/
chmod +x apply-patches
./upgrade
You do not have the required permissions to view the files attached to this post.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Locked