CPU Load Spike daily

Post by **BanditBBS** » Wed Jul 02, 2014 2:29 pm

I figured I'd start my own thread because this seems different from the "every 7 hours" issue.

Every day from 12:30 to 16:30 my NagiosXI 2014r1.2 server is spiking bad to the point where it can't even be used. I did reboot the server one day during this load spike and it worked fine as soon as it came back up until the next day at 12:30. This was happening with nothing but the Nagios server itself being monitored and has not gotten any worse as I have added hosts. My OS team looked at it in depth and the spike seems to be postgresql related. Is there anything in the database that is scheduled at that time or once per day type of schedule? Today I restarted postgresql during the high load and BAM! Instantly dropped the load back to a normal one. It started to climb again so I stopped postgres and it fell again.

HELP!
EDIT: Before you ask: psql (PostgreSQL) 8.4.20
EDIT2: Maybe I spoke too soon. It spiked again, I shutdown postgres and left it stopped. Eventually the server spiked again. I shutdown nagios and started postgres and it hasn't spike since. So now I am leaning towards it being the nagios process that's doing it. So now I am even more lost as I have no clue how the nagios process could be spiking every day at the same time.

sreinhardt · Post by **sreinhardt** » Thu Jul 03, 2014 10:18 am

This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.

Post by **BanditBBS** » Thu Jul 03, 2014 10:20 am

sreinhardt wrote:This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.

You'll have it today good sir!

Dumb question #1: Where is the debug log stored?

slansing · Post by **slansing** » Thu Jul 03, 2014 10:47 am

By default it should be at:

Code: Select all

/usr/local/nagios/var/nagios.debug

Post by **BanditBBS** » Thu Jul 03, 2014 1:24 pm

Attached is the log. The issue started right on time as usual. I'll be able to use the system again in 4 hours.

Post by **BanditBBS** » Thu Jul 03, 2014 1:33 pm

Also, here is a capture from mytop:

Code: Select all

MySQL on localhost (5.1.73)                                                                                                 up 2+22:15:09 [13:29:29]
 Queries: 2.3M   qps:    9 Slow:   107.0         Se/In/Up/De(%):    16/45/01/02
             qps now:    7 Slow qps: 0.0  Threads:   53 (   1/   0) 03/00/00/00
 Key Efficiency: 98.0%  Bps in/out:  7.3k/ 1.8k   Now in/out: 264.6/ 2.0k

      Id      User         Host/IP         DB      Time    Cmd Query or State
       --      ----         -------         --      ----    --- ----------
    98125  nagiosql       localhost   nagiosql         0  Query show full processlist
    29381  nagiosql       localhost   nagiosql         4  Sleep
    68374  nagiosql       localhost   nagiosql         4  Sleep
    97675  nagiosql       localhost   nagiosql         4  Sleep
    97719  nagiosql       localhost   nagiosql         4  Sleep
    97701  nagiosql       localhost   nagiosql         8  Sleep
    97679  nagiosql       localhost   nagiosql         9  Sleep
    97697  nagiosql       localhost   nagiosql         9  Sleep
    97718  nagiosql       localhost   nagiosql         9  Sleep
    97721  nagiosql       localhost   nagiosql        11  Sleep
    67286  nagiosql       localhost   nagiosql        14  Sleep
    97685  nagiosql       localhost   nagiosql        14  Sleep
    97695  nagiosql       localhost   nagiosql        14  Sleep
    30637  nagiosql       localhost   nagiosql        18  Sleep
    68384  nagiosql       localhost   nagiosql        19  Sleep
    68392  nagiosql       localhost   nagiosql        19  Sleep
    97683  nagiosql       localhost   nagiosql        19  Sleep
    67276  nagiosql       localhost   nagiosql        24  Sleep
    68383  nagiosql       localhost   nagiosql        24  Sleep
    97724  nagiosql       localhost   nagiosql        24  Sleep
    98129  nagiosql       localhost   nagiosql        29  Sleep
    98136  nagiosql       localhost   nagiosql        29  Sleep
    98139  nagiosql       localhost   nagiosql        29  Sleep
    98141  nagiosql       localhost   nagiosql        29  Sleep
    98143  nagiosql       localhost   nagiosql        29  Sleep
    97703  nagiosql       localhost   nagiosql        34  Sleep

Post by **lmiltchev** » Thu Jul 03, 2014 2:45 pm

Thanks, BanditBBS! Our developers will be looking at the debug info that you provided as soon as they can.

Post by **BanditBBS** » Thu Jul 03, 2014 11:17 pm

Just an FYI - I just implemented ramdisk. I don't see how this can help this weird issue, but anything is worth a shot, so I will report back tomorrow evening if the issue persists.

Edit: Ramdisk didn't help at all
Edit2: I was home today during the spike unlike yesterday and can verify the ramdisk didn't help make it usable during the spike either

emislivec · Post by **emislivec** » Mon Jul 07, 2014 12:39 pm

Thanks for the log Bandit. It looks like we are seeing some issues here with check scheduling bunching things up instead of spreading them out. However this was happening before things blew up toward the end of the log, so I think we're looking at two separate issues (check scheduling and whatever postgres is doing) that may be interacting. Postgres may be ultimately responsible for things becoming unusable for you, but the bunching of checks can't be helping.

Eric[0] made some changes that specifically targeted bunching of checks at the start of timeframes, but also affects checks more generally. If you have a test system that's showing load spikes, it would be worth trying out (hint, hint): https://github.com/NagiosEnterprises/na ... ac8dcb6f04
Spenser should have some install commands for you shortly.

One caveat to this fix is that you need to leave Core stopped long enough so that the next_check time for all (or most) checks in retention.dat is in the past (or greater than the check's interval time in the future) so that the new scheduling algorithm takes effect. So stopping Core for a bit over 5 minutes and then starting it should apply the new scheduling to the checks you're running.

Running with debug_level=12 and sending the log again would be awesome. Also, can you post, PM or email your nagios.cfg from the earlier run?

sreinhardt · Post by **sreinhardt** » Mon Jul 07, 2014 12:59 pm

Some steps to test out the patch. This will only reinstall the current version of core from 2014r1.2 with the scheduler patch. Sorry this is the simplest way to do it since this isn't a release.

Code: Select all

cd /tmp
rm -rf ./nagios* ./xi-*
wget http://assets.nagios.com/downloads/nagiosxi/xi-latest.tar.gz
tar xzf xi-latest.tar.gz
cd nagiosxi/subcomponents/nagioscore/
rm -f apply-patches

Download the attached zip, and move into this directory

unzip schedulerpatch.zip
mv scheduler.patch patches/
chmod +x apply-patches
./upgrade

Nagios Support Forum

CPU Load Spike daily

CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily