Massive CPU Spikes for 15 min affecting performance

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
iraba
Posts: 1
Joined: Sun Jun 28, 2015 6:47 pm

Massive CPU Spikes for 15 min affecting performance

Post by iraba »

Hi there

System Profile:
profile.zip
Linux Centos 6.6:

Code: Select all

root@nagios:~ $ uname -ra
Linux nagios 2.6.32-504.12.2.el6.i686 #1 SMP Wed Mar 11 19:05:53 UTC 2015 i686 i686 i386 GNU/Linux
Manual Install, nothing strange that I know of. But I did inherit the system so there maybe 'modifications'?

I posted initally under a different account: https://support.nagios.com/forum/viewto ... =6&t=32785

I've freed up more space so MYSQL sin't throwing errors anymore. But I'm sill getting the CPU spike but less frequently.

I hacked together a script to log the processes that are on the CPU every 2 secs when a spike starts:

Code: Select all

#!/bin/bash

load_threshold=2
end=$((SECONDS+86400))

while [ $SECONDS -lt $end ]; do
        loadavg=$(cat /proc/loadavg | awk '{print $1}' | cut -c1)
        if [ $loadavg -gt $load_threshold ]
        then
                date >> cpu.log
                ps aruxw | awk 'NR>1'| awk '{print $1, $3, $5, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19}' >> cpu.log

        fi
        sleep 2s
        :
done
I've attached a partial log of the output called on-cpu.log
on-cpu.log
There are large gaps of 5 mins where this script should have collected data but nothing responded. I looks like it maybe has something to do with Postgres? Or jbd2? But I'm not seeing any high iowait times...

Load graph from the Nagios GUI:
load spikes.png
As the 15min load jump is almost instantaneous at the start (and exactly the same values for 15 mins) I think the check isn't getting done at all until the end of the spike and 'fills in' the values from the current load.

Thanks,
Ira.
You do not have the required permissions to view the files attached to this post.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Massive CPU Spikes for 15 min affecting performance

Post by abrist »

You have a large number of handle_nagioscore_event scripts running:

Code: Select all

/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host=xxx --service=LANDATA 
I assume there are plenty of these at all times:

Code: Select all

ps -aef | grep handle_nagioscore_event | wc -l
And they are using much load.
Do you use a global event handler? Do you have event handlers set on many objects?
I have noticed some odd behavior in the past where if event handlers were enabled on a template, but no event handler was selected, checks using the template would load spike with the handle_nagioscore_event.php script. If you do not use events, could you try turning off event handlers on the templates: xiwizard_generic_host and xiwizard_generic_service?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked