Massive CPU Spikes for 15 min affecting performance

iraba · Post by **iraba** » Wed Jul 01, 2015 8:28 pm

Hi there

System Profile:

profile.zip

Linux Centos 6.6:

root@nagios:~ $ uname -ra
Linux nagios 2.6.32-504.12.2.el6.i686 #1 SMP Wed Mar 11 19:05:53 UTC 2015 i686 i686 i386 GNU/Linux

Manual Install, nothing strange that I know of. But I did inherit the system so there maybe 'modifications'?

I posted initally under a different account: https://support.nagios.com/forum/viewto ... =6&t=32785

I've freed up more space so MYSQL sin't throwing errors anymore. But I'm sill getting the CPU spike but less frequently.

I hacked together a script to log the processes that are on the CPU every 2 secs when a spike starts:

Code: Select all

#!/bin/bash

load_threshold=2
end=$((SECONDS+86400))

while [ $SECONDS -lt $end ]; do
        loadavg=$(cat /proc/loadavg | awk '{print $1}' | cut -c1)
        if [ $loadavg -gt $load_threshold ]
        then
                date >> cpu.log
                ps aruxw | awk 'NR>1'| awk '{print $1, $3, $5, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19}' >> cpu.log

        fi
        sleep 2s
        :
done

I've attached a partial log of the output called on-cpu.log

on-cpu.log

There are large gaps of 5 mins where this script should have collected data but nothing responded. I looks like it maybe has something to do with Postgres? Or jbd2? But I'm not seeing any high iowait times...

Load graph from the Nagios GUI:

load spikes.png

As the 15min load jump is almost instantaneous at the start (and exactly the same values for 15 mins) I think the check isn't getting done at all until the end of the spike and 'fills in' the values from the current load.

Thanks,
Ira.

abrist · Post by **abrist** » Thu Jul 02, 2015 9:46 am

You have a large number of handle_nagioscore_event scripts running:

Code: Select all

/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host=xxx --service=LANDATA

I assume there are plenty of these at all times:

Code: Select all

ps -aef | grep handle_nagioscore_event | wc -l

And they are using much load.
Do you use a global event handler? Do you have event handlers set on many objects?
I have noticed some odd behavior in the past where if event handlers were enabled on a template, but no event handler was selected, checks using the template would load spike with the handle_nagioscore_event.php script. If you do not use events, could you try turning off event handlers on the templates: xiwizard_generic_host and xiwizard_generic_service?

Nagios Support Forum

Massive CPU Spikes for 15 min affecting performance

Massive CPU Spikes for 15 min affecting performance

Re: Massive CPU Spikes for 15 min affecting performance