Ok so recently we've noticed that the localhost Load alarm has been triggered pretty regularly.
What can i look for to be causing this on the nagios xi server.
This is a VM in ESXi.
Linux Distribution and version?
32 or 64bit?
VMware Image or Manual Install of XI?
Are there special configurations on your system, ie; is Gnome installed? Are you using a proxy? Are you using SSL?
**If you are encountering multiple issues that may not be related, start a thread for each issue
1) CentOS Linux release 7.6.1810 (Core)
2) 64bit
3) Manual Install of XI
4) SSL is being used for https
total used free shared buff/cache available
Mem: 15867 3747 371 800 11747 10810
Swap: 3967 259 3708
24 vCPU's
processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Installed Version: 5.5.8
I have the load log attached
localhost current load spikes to 200 on 1 minute average
localhost current load spikes to 200 on 1 minute average
You do not have the required permissions to view the files attached to this post.
Re: localhost current load spikes to 200 on 1 minute average
Without knowing which process is generating the load, it is hard to guess what the next step is.
You can run the top command in a shell and watch it to see what is causing the load.
Or, you can run this command in a shell which will run the top command every minutes and append the output with a time stamp to the /tmp/top.txt file.
You can run that for a while until the load increases and then you can see which application is causing the issue.
You can run the top command in a shell and watch it to see what is causing the load.
Or, you can run this command in a shell which will run the top command every minutes and append the output with a time stamp to the /tmp/top.txt file.
Code: Select all
while true; do top -n1 >>/tmp/top.txt ; date >>/tmp/top.txt ;sleep 60; doneYou can run that for a while until the load increases and then you can see which application is causing the issue.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: localhost current load spikes to 200 on 1 minute average
I'm glad you responded. I will collect the output and post my findings.
Re: localhost current load spikes to 200 on 1 minute average
Let us know what you find out.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: localhost current load spikes to 200 on 1 minute average
I was also looking at running this but output this to /tmp/processes.txt
watch "ps -eo pcpu,pid,user,args | sort -k1 -r -n | head -20"
Like i said this is crazy but after we ran the update its only spikes to a load of 50 once so this may be 30 days from now before it happens again.
watch "ps -eo pcpu,pid,user,args | sort -k1 -r -n | head -20"
Like i said this is crazy but after we ran the update its only spikes to a load of 50 once so this may be 30 days from now before it happens again.
Re: localhost current load spikes to 200 on 1 minute average
Well, we may not be able to find out what was causing it, but if you find something, let us know.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: localhost current load spikes to 200 on 1 minute average
The issue has cropped up again. I am running the above and outputting to /tmp/top.txt and will post the results next time the load spikes again.
Re: localhost current load spikes to 200 on 1 minute average
Here is the output during the event this afternoon.
Attached is a PNG of the localhost load last 4 hours.
https://pastebin.com/YTA4Tatq
Above is the pastebin output of top.txt during the event.
Attached is a PNG of the localhost load last 4 hours.
https://pastebin.com/YTA4Tatq
Above is the pastebin output of top.txt during the event.
You do not have the required permissions to view the files attached to this post.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: localhost current load spikes to 200 on 1 minute average
@acentek, Here are the top resource consumers:
Let's check which python plugins are taking up the CPU. You can search the process by a PID number.
For example, ps -ef | grep 19930.
This should give us more info about this process:
https://support.nagios.com/kb/article.php?id=510
A spike in MySQL load could mean that the server was running a scheduled backup and utilizing the database.9338 mysql 20 0 10.4g 1.3g 6608 S 106.2 8.7 2549:28 mysqld
19930 nagios 20 0 221520 20864 3976 R 100.0 0.1 0:00.24 python
19883 nagios 20 0 221520 20856 3956 S 81.2 0.1 0:00.23 python
19854 nagios 20 0 221520 20848 3956 S 62.5 0.1 0:00.20 python
10487 root 20 0 348212 225164 2088 S 50.0 1.4 0:03.71 mrtg
Let's check which python plugins are taking up the CPU. You can search the process by a PID number.
For example, ps -ef | grep 19930.
This should give us more info about this process:
Also, please go over the mrtg configuration folder /var/lib/mrtg/ and delete no longer used .cfg files. Each .cfg file carries the name of a host, so if you identify that some of these hosts(switches and routers particularly) no longer exist you may remove the corresponding cfg files. That should lower the system load.19930 nagios 20 0 221520 20864 3976 R 100.0 0.1 0:00.24 python
https://support.nagios.com/kb/article.php?id=510
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
SteveBeauchemin
- Posts: 524
- Joined: Mon Oct 14, 2013 7:19 pm
Re: localhost current load spikes to 200 on 1 minute average
I also run on ESX. I had really bad performance for a couple months until I noticed something.
The 'Other' VM sharing the ESX I run on were causing my problems. There were Mail servers (Many disk drives) and MSSQL systems (High CPU Load) that were stealing all my resources. I felt it the worst when the nightly backups would run. But also when applying changes.
To solve this we implemented ESX Affinity rules stating that if Nagios was on a specific ESX host that no Mail or SQL servers were allowed.
The best indicator of this was that the Nagios Server Statistics dashlet would show I/O Wait in yellow or red.
This may or may not help you.
Just remember, in ESX you are sharing resources. So don't be afraid look to other systems for resource consumption.
Good Luck chasing this.
Steve B
The 'Other' VM sharing the ESX I run on were causing my problems. There were Mail servers (Many disk drives) and MSSQL systems (High CPU Load) that were stealing all my resources. I felt it the worst when the nightly backups would run. But also when applying changes.
To solve this we implemented ESX Affinity rules stating that if Nagios was on a specific ESX host that no Mail or SQL servers were allowed.
The best indicator of this was that the Nagios Server Statistics dashlet would show I/O Wait in yellow or red.
This may or may not help you.
Just remember, in ESX you are sharing resources. So don't be afraid look to other systems for resource consumption.
Good Luck chasing this.
Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1