Page 1 of 8
CPU Load Spike daily
Posted: Wed Jul 02, 2014 2:29 pm
by BanditBBS
I figured I'd start my own thread because this seems different from the "every 7 hours" issue.
Every day from 12:30 to 16:30 my NagiosXI 2014r1.2 server is spiking bad to the point where it can't even be used. I did reboot the server one day during this load spike and it worked fine as soon as it came back up until the next day at 12:30. This was happening with nothing but the Nagios server itself being monitored and has not gotten any worse as I have added hosts. My OS team looked at it in depth and the spike seems to be postgresql related. Is there anything in the database that is scheduled at that time or once per day type of schedule? Today I restarted postgresql during the high load and BAM! Instantly dropped the load back to a normal one. It started to climb again so I stopped postgres and it fell again.
HELP!
EDIT: Before you ask: psql (PostgreSQL) 8.4.20
EDIT2: Maybe I spoke too soon. It spiked again, I shutdown postgres and left it stopped. Eventually the server spiked again. I shutdown nagios and started postgres and it hasn't spike since. So now I am leaning towards it being the nagios process that's doing it. So now I am even more lost as I have no clue how the nagios process could be spiking every day at the same time.
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 10:18 am
by sreinhardt
This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 10:20 am
by BanditBBS
sreinhardt wrote:This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.
You'll have it today good sir!
Dumb question #1: Where is the debug log stored?
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 10:47 am
by slansing
By default it should be at:
Code: Select all
/usr/local/nagios/var/nagios.debug
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 1:24 pm
by BanditBBS
Attached is the log. The issue started right on time as usual. I'll be able to use the system again in 4 hours.
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 1:33 pm
by BanditBBS
Also, here is a capture from mytop:
Code: Select all
MySQL on localhost (5.1.73) up 2+22:15:09 [13:29:29]
Queries: 2.3M qps: 9 Slow: 107.0 Se/In/Up/De(%): 16/45/01/02
qps now: 7 Slow qps: 0.0 Threads: 53 ( 1/ 0) 03/00/00/00
Key Efficiency: 98.0% Bps in/out: 7.3k/ 1.8k Now in/out: 264.6/ 2.0k
Id User Host/IP DB Time Cmd Query or State
-- ---- ------- -- ---- --- ----------
98125 nagiosql localhost nagiosql 0 Query show full processlist
29381 nagiosql localhost nagiosql 4 Sleep
68374 nagiosql localhost nagiosql 4 Sleep
97675 nagiosql localhost nagiosql 4 Sleep
97719 nagiosql localhost nagiosql 4 Sleep
97701 nagiosql localhost nagiosql 8 Sleep
97679 nagiosql localhost nagiosql 9 Sleep
97697 nagiosql localhost nagiosql 9 Sleep
97718 nagiosql localhost nagiosql 9 Sleep
97721 nagiosql localhost nagiosql 11 Sleep
67286 nagiosql localhost nagiosql 14 Sleep
97685 nagiosql localhost nagiosql 14 Sleep
97695 nagiosql localhost nagiosql 14 Sleep
30637 nagiosql localhost nagiosql 18 Sleep
68384 nagiosql localhost nagiosql 19 Sleep
68392 nagiosql localhost nagiosql 19 Sleep
97683 nagiosql localhost nagiosql 19 Sleep
67276 nagiosql localhost nagiosql 24 Sleep
68383 nagiosql localhost nagiosql 24 Sleep
97724 nagiosql localhost nagiosql 24 Sleep
98129 nagiosql localhost nagiosql 29 Sleep
98136 nagiosql localhost nagiosql 29 Sleep
98139 nagiosql localhost nagiosql 29 Sleep
98141 nagiosql localhost nagiosql 29 Sleep
98143 nagiosql localhost nagiosql 29 Sleep
97703 nagiosql localhost nagiosql 34 Sleep
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 2:45 pm
by lmiltchev
Thanks, BanditBBS! Our developers will be looking at the debug info that you provided as soon as they can.
Re: CPU Load Spike daily
Posted: Thu Jul 03, 2014 11:17 pm
by BanditBBS
Just an FYI - I just implemented ramdisk. I don't see how this can help this weird issue, but anything is worth a shot, so I will report back tomorrow evening if the issue persists.
Edit: Ramdisk didn't help at all
Edit2: I was home today during the spike unlike yesterday and can verify the ramdisk didn't help make it usable during the spike either
Re: CPU Load Spike daily
Posted: Mon Jul 07, 2014 12:39 pm
by emislivec
Thanks for the log Bandit. It looks like we are seeing some issues here with check scheduling bunching things up instead of spreading them out. However this was happening before things blew up toward the end of the log, so I think we're looking at two separate issues (check scheduling and whatever postgres is doing) that may be interacting. Postgres
may be ultimately responsible for things becoming unusable for you, but the bunching of checks can't be helping.
Eric[0] made some changes that specifically targeted bunching of checks at the start of timeframes, but also affects checks more generally. If you have a test system that's showing load spikes, it would be worth trying out (hint, hint):
https://github.com/NagiosEnterprises/na ... ac8dcb6f04
Spenser should have some install commands for you shortly.
One caveat to this fix is that you need to leave Core stopped long enough so that the next_check time for all (or most) checks in retention.dat is in the past (or greater than the check's interval time in the future) so that the new scheduling algorithm takes effect. So stopping Core for a bit over 5 minutes and then starting it should apply the new scheduling to the checks you're running.
Running with debug_level=12 and sending the log again would be awesome. Also, can you post, PM or email your nagios.cfg from the earlier run?
Re: CPU Load Spike daily
Posted: Mon Jul 07, 2014 12:59 pm
by sreinhardt
Some steps to test out the patch. This will only reinstall the current version of core from 2014r1.2 with the scheduler patch. Sorry this is the simplest way to do it since this isn't a release.
Code: Select all
cd /tmp
rm -rf ./nagios* ./xi-*
wget http://assets.nagios.com/downloads/nagiosxi/xi-latest.tar.gz
tar xzf xi-latest.tar.gz
cd nagiosxi/subcomponents/nagioscore/
rm -f apply-patches
Download the attached zip, and move into this directory
unzip schedulerpatch.zip
mv scheduler.patch patches/
chmod +x apply-patches
./upgrade