Page 1 of 2
localhost current load spikes to 200 on 1 minute average
Posted: Fri Jan 25, 2019 11:05 am
by acentek
Ok so recently we've noticed that the localhost Load alarm has been triggered pretty regularly.
What can i look for to be causing this on the nagios xi server.
This is a VM in ESXi.
Linux Distribution and version?
32 or 64bit?
VMware Image or Manual Install of XI?
Are there special configurations on your system, ie; is Gnome installed? Are you using a proxy? Are you using SSL?
**If you are encountering multiple issues that may not be related, start a thread for each issue
1) CentOS Linux release 7.6.1810 (Core)
2) 64bit
3) Manual Install of XI
4) SSL is being used for https
total used free shared buff/cache available
Mem: 15867 3747 371 800 11747 10810
Swap: 3967 259 3708
24 vCPU's
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Installed Version: 5.5.8
I have the load log attached
Re: localhost current load spikes to 200 on 1 minute average
Posted: Fri Jan 25, 2019 2:32 pm
by tgriep
Without knowing which process is generating the load, it is hard to guess what the next step is.
You can run the top command in a shell and watch it to see what is causing the load.
Or, you can run this command in a shell which will run the top command every minutes and append the output with a time stamp to the /tmp/top.txt file.
Code: Select all
while true; do top -n1 >>/tmp/top.txt ; date >>/tmp/top.txt ;sleep 60; done
You can run that for a while until the load increases and then you can see which application is causing the issue.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Mon Jan 28, 2019 10:15 am
by acentek
I'm glad you responded. I will collect the output and post my findings.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Mon Jan 28, 2019 10:24 am
by tgriep
Let us know what you find out.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Mon Jan 28, 2019 10:43 am
by acentek
I was also looking at running this but output this to /tmp/processes.txt
watch "ps -eo pcpu,pid,user,args | sort -k1 -r -n | head -20"
Like i said this is crazy but after we ran the update its only spikes to a load of 50 once so this may be 30 days from now before it happens again.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Mon Jan 28, 2019 2:26 pm
by tgriep
Well, we may not be able to find out what was causing it, but if you find something, let us know.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Sun Feb 10, 2019 2:06 pm
by acentek
The issue has cropped up again. I am running the above and outputting to /tmp/top.txt and will post the results next time the load spikes again.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Sun Feb 10, 2019 9:13 pm
by acentek
Here is the output during the event this afternoon.
Attached is a PNG of the localhost load last 4 hours.
https://pastebin.com/YTA4Tatq
Above is the pastebin output of top.txt during the event.
Re: localhost current load spikes to 200 on 1 minute average
Posted: Mon Feb 11, 2019 12:38 pm
by npolovenko
@acentek, Here are the top resource consumers:
9338 mysql 20 0 10.4g 1.3g 6608 S 106.2 8.7 2549:28 mysqld
19930 nagios 20 0 221520 20864 3976 R 100.0 0.1 0:00.24 python
19883 nagios 20 0 221520 20856 3956 S 81.2 0.1 0:00.23 python
19854 nagios 20 0 221520 20848 3956 S 62.5 0.1 0:00.20 python
10487 root 20 0 348212 225164 2088 S 50.0 1.4 0:03.71 mrtg
A spike in MySQL load could mean that the server was running a scheduled backup and utilizing the database.
Let's check which python plugins are taking up the CPU. You can search the process by a PID number.
For example,
ps -ef | grep 19930.
This should give us more info about this process:
19930 nagios 20 0 221520 20864 3976 R 100.0 0.1 0:00.24 python
Also, please go over the mrtg configuration folder /var/lib/mrtg/ and delete no longer used .cfg files. Each .cfg file carries the name of a host, so if you identify that some of these hosts(switches and routers particularly) no longer exist you may remove the corresponding cfg files. That should lower the system load.
https://support.nagios.com/kb/article.php?id=510
Re: localhost current load spikes to 200 on 1 minute average
Posted: Mon Feb 11, 2019 4:23 pm
by SteveBeauchemin
I also run on ESX. I had really bad performance for a couple months until I noticed something.
The 'Other' VM sharing the ESX I run on were causing my problems. There were Mail servers (Many disk drives) and MSSQL systems (High CPU Load) that were stealing all my resources. I felt it the worst when the nightly backups would run. But also when applying changes.
To solve this we implemented ESX Affinity rules stating that if Nagios was on a specific ESX host that no Mail or SQL servers were allowed.
The best indicator of this was that the Nagios Server Statistics dashlet would show I/O Wait in yellow or red.
This may or may not help you.
Just remember, in ESX you are sharing resources. So don't be afraid look to other systems for resource consumption.
Good Luck chasing this.
Steve B