CPU Load Spike daily
CPU Load Spike daily
I figured I'd start my own thread because this seems different from the "every 7 hours" issue.
Every day from 12:30 to 16:30 my NagiosXI 2014r1.2 server is spiking bad to the point where it can't even be used. I did reboot the server one day during this load spike and it worked fine as soon as it came back up until the next day at 12:30. This was happening with nothing but the Nagios server itself being monitored and has not gotten any worse as I have added hosts. My OS team looked at it in depth and the spike seems to be postgresql related. Is there anything in the database that is scheduled at that time or once per day type of schedule? Today I restarted postgresql during the high load and BAM! Instantly dropped the load back to a normal one. It started to climb again so I stopped postgres and it fell again.
HELP!
EDIT: Before you ask: psql (PostgreSQL) 8.4.20
EDIT2: Maybe I spoke too soon. It spiked again, I shutdown postgres and left it stopped. Eventually the server spiked again. I shutdown nagios and started postgres and it hasn't spike since. So now I am leaning towards it being the nagios process that's doing it. So now I am even more lost as I have no clue how the nagios process could be spiking every day at the same time.
Every day from 12:30 to 16:30 my NagiosXI 2014r1.2 server is spiking bad to the point where it can't even be used. I did reboot the server one day during this load spike and it worked fine as soon as it came back up until the next day at 12:30. This was happening with nothing but the Nagios server itself being monitored and has not gotten any worse as I have added hosts. My OS team looked at it in depth and the spike seems to be postgresql related. Is there anything in the database that is scheduled at that time or once per day type of schedule? Today I restarted postgresql during the high load and BAM! Instantly dropped the load back to a normal one. It started to climb again so I stopped postgres and it fell again.
HELP!
EDIT: Before you ask: psql (PostgreSQL) 8.4.20
EDIT2: Maybe I spoke too soon. It spiked again, I shutdown postgres and left it stopped. Eventually the server spiked again. I shutdown nagios and started postgres and it hasn't spike since. So now I am leaning towards it being the nagios process that's doing it. So now I am even more lost as I have no clue how the nagios process could be spiking every day at the same time.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: CPU Load Spike daily
This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Re: CPU Load Spike daily
You'll have it today good sir!sreinhardt wrote:This is almost definitely related, although seemingly quite different time wise from the 7 and 1:45 hour increases. Could you increase the debug level in nagios.cfg to 12, and extend the max_debug_log_size to 10mb, then send them over after some time in a bad state? Ideally I would like to see some time where the system is usable and some portion of when it becomes bogged down in those debugs.
Dumb question #1: Where is the debug log stored?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: CPU Load Spike daily
By default it should be at:
Code: Select all
/usr/local/nagios/var/nagios.debug
Re: CPU Load Spike daily
Attached is the log. The issue started right on time as usual. I'll be able to use the system again in 4 hours.
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: CPU Load Spike daily
Also, here is a capture from mytop:
Code: Select all
MySQL on localhost (5.1.73) up 2+22:15:09 [13:29:29]
Queries: 2.3M qps: 9 Slow: 107.0 Se/In/Up/De(%): 16/45/01/02
qps now: 7 Slow qps: 0.0 Threads: 53 ( 1/ 0) 03/00/00/00
Key Efficiency: 98.0% Bps in/out: 7.3k/ 1.8k Now in/out: 264.6/ 2.0k
Id User Host/IP DB Time Cmd Query or State
-- ---- ------- -- ---- --- ----------
98125 nagiosql localhost nagiosql 0 Query show full processlist
29381 nagiosql localhost nagiosql 4 Sleep
68374 nagiosql localhost nagiosql 4 Sleep
97675 nagiosql localhost nagiosql 4 Sleep
97719 nagiosql localhost nagiosql 4 Sleep
97701 nagiosql localhost nagiosql 8 Sleep
97679 nagiosql localhost nagiosql 9 Sleep
97697 nagiosql localhost nagiosql 9 Sleep
97718 nagiosql localhost nagiosql 9 Sleep
97721 nagiosql localhost nagiosql 11 Sleep
67286 nagiosql localhost nagiosql 14 Sleep
97685 nagiosql localhost nagiosql 14 Sleep
97695 nagiosql localhost nagiosql 14 Sleep
30637 nagiosql localhost nagiosql 18 Sleep
68384 nagiosql localhost nagiosql 19 Sleep
68392 nagiosql localhost nagiosql 19 Sleep
97683 nagiosql localhost nagiosql 19 Sleep
67276 nagiosql localhost nagiosql 24 Sleep
68383 nagiosql localhost nagiosql 24 Sleep
97724 nagiosql localhost nagiosql 24 Sleep
98129 nagiosql localhost nagiosql 29 Sleep
98136 nagiosql localhost nagiosql 29 Sleep
98139 nagiosql localhost nagiosql 29 Sleep
98141 nagiosql localhost nagiosql 29 Sleep
98143 nagiosql localhost nagiosql 29 Sleep
97703 nagiosql localhost nagiosql 34 Sleep
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: CPU Load Spike daily
Thanks, BanditBBS! Our developers will be looking at the debug info that you provided as soon as they can.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: CPU Load Spike daily
Just an FYI - I just implemented ramdisk. I don't see how this can help this weird issue, but anything is worth a shot, so I will report back tomorrow evening if the issue persists.
Edit: Ramdisk didn't help at all
Edit2: I was home today during the spike unlike yesterday and can verify the ramdisk didn't help make it usable during the spike either
Edit: Ramdisk didn't help at all
Edit2: I was home today during the spike unlike yesterday and can verify the ramdisk didn't help make it usable during the spike either
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: CPU Load Spike daily
Thanks for the log Bandit. It looks like we are seeing some issues here with check scheduling bunching things up instead of spreading them out. However this was happening before things blew up toward the end of the log, so I think we're looking at two separate issues (check scheduling and whatever postgres is doing) that may be interacting. Postgres may be ultimately responsible for things becoming unusable for you, but the bunching of checks can't be helping.
Eric[0] made some changes that specifically targeted bunching of checks at the start of timeframes, but also affects checks more generally. If you have a test system that's showing load spikes, it would be worth trying out (hint, hint): https://github.com/NagiosEnterprises/na ... ac8dcb6f04
Spenser should have some install commands for you shortly.
One caveat to this fix is that you need to leave Core stopped long enough so that the next_check time for all (or most) checks in retention.dat is in the past (or greater than the check's interval time in the future) so that the new scheduling algorithm takes effect. So stopping Core for a bit over 5 minutes and then starting it should apply the new scheduling to the checks you're running.
Running with debug_level=12 and sending the log again would be awesome. Also, can you post, PM or email your nagios.cfg from the earlier run?
Eric[0] made some changes that specifically targeted bunching of checks at the start of timeframes, but also affects checks more generally. If you have a test system that's showing load spikes, it would be worth trying out (hint, hint): https://github.com/NagiosEnterprises/na ... ac8dcb6f04
Spenser should have some install commands for you shortly.
One caveat to this fix is that you need to leave Core stopped long enough so that the next_check time for all (or most) checks in retention.dat is in the past (or greater than the check's interval time in the future) so that the new scheduling algorithm takes effect. So stopping Core for a bit over 5 minutes and then starting it should apply the new scheduling to the checks you're running.
Running with debug_level=12 and sending the log again would be awesome. Also, can you post, PM or email your nagios.cfg from the earlier run?
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: CPU Load Spike daily
Some steps to test out the patch. This will only reinstall the current version of core from 2014r1.2 with the scheduler patch. Sorry this is the simplest way to do it since this isn't a release.
Code: Select all
cd /tmp
rm -rf ./nagios* ./xi-*
wget http://assets.nagios.com/downloads/nagiosxi/xi-latest.tar.gz
tar xzf xi-latest.tar.gz
cd nagiosxi/subcomponents/nagioscore/
rm -f apply-patches
Download the attached zip, and move into this directory
unzip schedulerpatch.zip
mv scheduler.patch patches/
chmod +x apply-patches
./upgradeYou do not have the required permissions to view the files attached to this post.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.