Page 1 of 1

Massive CPU Spikes for 15 min affecting performance

Posted: Wed Jul 01, 2015 8:28 pm
by iraba
Hi there

System Profile:
profile.zip
Linux Centos 6.6:

Code: Select all

root@nagios:~ $ uname -ra
Linux nagios 2.6.32-504.12.2.el6.i686 #1 SMP Wed Mar 11 19:05:53 UTC 2015 i686 i686 i386 GNU/Linux
Manual Install, nothing strange that I know of. But I did inherit the system so there maybe 'modifications'?

I posted initally under a different account: https://support.nagios.com/forum/viewto ... =6&t=32785

I've freed up more space so MYSQL sin't throwing errors anymore. But I'm sill getting the CPU spike but less frequently.

I hacked together a script to log the processes that are on the CPU every 2 secs when a spike starts:

Code: Select all

#!/bin/bash

load_threshold=2
end=$((SECONDS+86400))

while [ $SECONDS -lt $end ]; do
        loadavg=$(cat /proc/loadavg | awk '{print $1}' | cut -c1)
        if [ $loadavg -gt $load_threshold ]
        then
                date >> cpu.log
                ps aruxw | awk 'NR>1'| awk '{print $1, $3, $5, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19}' >> cpu.log

        fi
        sleep 2s
        :
done
I've attached a partial log of the output called on-cpu.log
on-cpu.log
There are large gaps of 5 mins where this script should have collected data but nothing responded. I looks like it maybe has something to do with Postgres? Or jbd2? But I'm not seeing any high iowait times...

Load graph from the Nagios GUI:
load spikes.png
As the 15min load jump is almost instantaneous at the start (and exactly the same values for 15 mins) I think the check isn't getting done at all until the end of the spike and 'fills in' the values from the current load.

Thanks,
Ira.

Re: Massive CPU Spikes for 15 min affecting performance

Posted: Thu Jul 02, 2015 9:46 am
by abrist
You have a large number of handle_nagioscore_event scripts running:

Code: Select all

/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host=xxx --service=LANDATA 
I assume there are plenty of these at all times:

Code: Select all

ps -aef | grep handle_nagioscore_event | wc -l
And they are using much load.
Do you use a global event handler? Do you have event handlers set on many objects?
I have noticed some odd behavior in the past where if event handlers were enabled on a template, but no event handler was selected, checks using the template would load spike with the handle_nagioscore_event.php script. If you do not use events, could you try turning off event handlers on the templates: xiwizard_generic_host and xiwizard_generic_service?