very high CPU load spikes
very high CPU load spikes
Hi,
I'm experiencing frequent, brief very high CPU load spikes since upgrading (XI 5.8.7 / CentOS 7.9) a few months ago, and the problem is getting worse. In the wake of these spikes I'm seeing timeouts and so many checks queued that the system stops processing / freezes up.
Load averages shoot up to over 300 or higher as an "explosion" of Python instances start; then typically within 3-5 minutes things drift back to normal.
I'm not seeing anything to explain this in the logs; I've searched the forums and not found anything useful or directly applicable to this situation.
Can I DM a profile.zip and other details to someone?
Rob
I'm experiencing frequent, brief very high CPU load spikes since upgrading (XI 5.8.7 / CentOS 7.9) a few months ago, and the problem is getting worse. In the wake of these spikes I'm seeing timeouts and so many checks queued that the system stops processing / freezes up.
Load averages shoot up to over 300 or higher as an "explosion" of Python instances start; then typically within 3-5 minutes things drift back to normal.
I'm not seeing anything to explain this in the logs; I've searched the forums and not found anything useful or directly applicable to this situation.
Can I DM a profile.zip and other details to someone?
Rob
Re: very high CPU load spikes
Hi Rob,
How frequently is this occurring ?
Go ahead and PM the profile to me and if you can generate it while the event is occurring that would be even better.
Also next time you do observe the event please run the following as root PM the file /tmp/info.txt to me as well ?
Thanks and Best Regards,
Keith
How frequently is this occurring ?
Go ahead and PM the profile to me and if you can generate it while the event is occurring that would be even better.
Also next time you do observe the event please run the following as root PM the file /tmp/info.txt to me as well ?
Code: Select all
ps -axef > /tmp/info.txt
printf "\n========================== `date` ==========================\n" >> /tmp/info.txt
ss -na >> /tmp/info.txt
printf "\n========================== `date` ==========================\n" >> /tmp/info.txt
sar -A >> /tmp/info.txt
Thanks and Best Regards,
Keith
Re: very high CPU load spikes
Hi Keith,
I have sent you a PM as discussed.
Rob
I have sent you a PM as discussed.
Rob
Re: very high CPU load spikes
Your system is showing it having IO wait spikes (taken from the top command output):
It could be caused by a piece of security software such as Crowdstrike/Falcon Sensor which we see on the system:
- I would try disabling them and see if that resolves it as that would be my first guess at where the IO wait is coming from
Anything over 5% will generally cause global performance issues on a system as it means that percentage of the time the CPU is waiting on storage/IO before being able to continue with the next request and what you can see as symptoms are the CPU backing up (increasing CPU usage), load average increasing, etc. You would usually see other anomalies as well (checks timing out, etc) that would not seem like they are related but are.
Let's take a look at the size of some things, send the output of these commands as root:
- NOTE: You may need to adjust the -uroot and -pnagiosxi in the last two commands if you've changed the root mysql password
Since you are seeing the IO wait, some things I would recommend that can help:
1. Setting up a RAM Disk:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
2. Edit your /usr/local/nagios/etc/nagios.cfg and set this:
- NOTE: This is duplicate data from /usr/local/nagios/var/nagios.log so you'll still have access to the logs
Then restart nagios:
3. Set ALL THREE Optimize Intervals to 300 or higher in Admin > Performance Settings > Databases tab.
Code: Select all
7.1 wa
- I would try disabling them and see if that resolves it as that would be my first guess at where the IO wait is coming from
Code: Select all
root 687 1 0 Feb09 ? 00:00:00 /opt/CrowdStrike/falcond
root 688 687 0 Feb09 ? 01:09:13 falcon-sensor
Let's take a look at the size of some things, send the output of these commands as root:
- NOTE: You may need to adjust the -uroot and -pnagiosxi in the last two commands if you've changed the root mysql password
Code: Select all
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache
mysql -uroot -pnagiosxi nagios -e 'SELECT COUNT(*) FROM nagios_objects;'
mysql -uroot -pnagiosxi --table -e "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');"
Since you are seeing the IO wait, some things I would recommend that can help:
1. Setting up a RAM Disk:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
2. Edit your /usr/local/nagios/etc/nagios.cfg and set this:
- NOTE: This is duplicate data from /usr/local/nagios/var/nagios.log so you'll still have access to the logs
Code: Select all
use_syslog=1
Code: Select all
systemctl restart nagios
Re: very high CPU load spikes
I am working on this and will follow up in a day or two (notably our Security people have major issues with me interfering with their Crowdstrike), thank you for being patient.
Rob
Rob
Re: very high CPU load spikes
No problem, we'll keep an eye out for your update.
Re: very high CPU load spikes
Sent an update via PM
Re: very high CPU load spikes
I apologize, can you get the output of this one when the CPU spike is occurring? The other one doesn't contain the CPU/mem use of each process and that will give us the information we need.
Code: Select all
ps -auxef > /tmp/info.txt
Re: very high CPU load spikes
Hi Sean,
Certainly; I'll follow up after the next spike occurs.
Rob
Certainly; I'll follow up after the next spike occurs.
Rob
Re: very high CPU load spikes
Thank you, received. I'll post an update shortly after this remote session I have.