Nagios 2014R2.3 on VM HIGH LOAD SPIKES

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by lmiltchev »

Are these VM's running on server that has other VM's that could have scheduled jobs or something monopolizing the resources on the machine?
carlos.atos, I believe swilkerson wanted to ask if you had something else (besides these nagios VMs) running, perhaps software doing backups, etc.
Maybe, as a test, you could disable the services and hosts making use of mrtg and temporarily stop mrtg. If you still get load spikes, at least you know it's not caused by mrtg
You can do this if you don't mind stopping monitoring these hosts/services for a while.
Be sure to check out our Knowledgebase for helpful articles and solutions!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by abrist »

We really need to know what processes are the main offenders. I am thinking you need an event handler that dumps top output to a file when the state of the load average check changes to CRITICAL. Something like:

Code: Select all

#!/bin/bash
DATE=$(date +%s)
top -n 1 -b -c > /tmp/top_${DATE}.txt
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
carlos.atos
Posts: 29
Joined: Mon Nov 10, 2014 1:08 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by carlos.atos »

hi guys,

I'm having a load spike finally
this is the top.

Code: Select all

top - 19:27:03 up 1 day, 10:00,  1 user,  load average: 7.94, 5.11, 2.80
Tasks: 219 total,  14 running, 205 sleeping,   0 stopped,   0 zombie
Cpu(s): 76.3%us, 23.3%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   3924212k total,  2759612k used,  1164600k free,   140016k buffers
Swap:  2064380k total,        0k used,  2064380k free,   329856k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                       
  690 apache    20   0  456m  38m 8512 R 28.8  1.0   6:53.23 httpd                                                                         
18304 nagios    20   0  293m 7400 5168 R 17.5  0.2   0:00.53 php                                                                           
  722 postgres  20   0  215m 6436 4040 S 16.2  0.2   0:40.89 postmaster                                                                    
18053 nagios    20   0  321m  24m 9.9m R 15.2  0.7   0:03.04 php                                                                           
  694 apache    20   0  453m  37m 8600 R 14.9  1.0   7:03.12 httpd                                                                         
18306 nagios    20   0  311m  20m 7500 R 14.9  0.5   0:00.45 php                                                                           
18305 nagios    20   0  312m  21m 7636 S 14.2  0.6   0:00.43 php                                                                           
28914 apache    20   0  454m  38m 8248 S 14.2  1.0   3:19.89 httpd                                                                         
18310 nagios    20   0  308m  17m 7472 R 13.6  0.5   0:00.41 php                                                                           
18309 nagios    20   0  306m  15m 7448 R 12.9  0.4   0:00.39 php                                                                           
18314 nagios    20   0  177m 4112 2936 R 11.9  0.1   0:00.36 php                                                                           
  916 apache    20   0  453m  37m 8544 R 10.9  1.0   6:55.11 httpd                                                                         
26971 postgres  20   0  215m 5932 3620 S 10.6  0.2   0:20.78 postmaster                                                                    
 1594 apache    20   0  453m  38m 8416 S 10.3  1.0   5:35.44 httpd                                                                         
  695 apache    20   0  454m  38m 8500 S  9.9  1.0   5:54.28 httpd                                                                         
 2937 apache    20   0  454m  39m 8520 S  9.9  1.0   6:18.88 httpd                                                                         
25379 apache    20   0  454m  38m 8268 S  9.9  1.0   3:16.90 httpd                                                                         
  692 apache    20   0  454m  38m 8468 S  9.6  1.0   6:55.55 httpd                                                                         
  693 apache    20   0  453m  37m 8480 S  9.6  1.0   5:46.33 httpd                                                                         
18315 nagios    20   0 64756 2784 2100 R  9.6  0.1   0:00.29 php                                                                           
  691 apache    20   0  454m  39m 8492 S  9.3  1.0   6:38.58 httpd                                                                         
 3865 apache    20   0  453m  37m 8448 S  9.3  1.0   6:13.77 httpd                                                                         
  696 apache    20   0  454m  38m 8400 S  8.9  1.0   6:55.90 httpd                                                                         
18316 nagios    20   0  300m 9256 6068 R  8.9  0.2   0:00.27 php                                                                           
10317 apache    20   0  465m  47m 8424 S  8.3  1.3   6:04.00 httpd                                                                         
12165 apache    20   0  454m  38m 8424 S  8.3  1.0   5:09.06 httpd                                                                         
12382 postgres  20   0  215m 6276 3880 S  6.9  0.2   0:37.16 postmaster                                                                    
  697 apache    20   0  454m  38m 8436 S  6.6  1.0   6:21.47 httpd                                                                         
20305 apache    20   0  454m  38m 8296 R  6.6  1.0   3:49.02 httpd                                                                         
26957 apache    20   0  454m  38m 8224 S  5.3  1.0   2:34.73 httpd                                                                         
18294 root      20   0  136m 1740  988 S  5.0  0.0   0:00.15 crond                                                                         
18296 root      20   0  136m 1740  988 S  5.0  0.0   0:00.15 crond                                                                         
18292 root      20   0  136m 1740  988 S  4.0  0.0   0:00.12 crond                                                                         
18295 root      20   0  136m 1740  988 S  4.0  0.0   0:00.12 crond                                                                         
 1456 root      20   0  243m 3472 1052 S  3.3  0.1   1:13.07 rsyslogd                                                                      
18300 nagios    20   0  103m 1264 1096 S  2.3  0.0   0:00.07 sh                                                                            
18044 nagios    20   0  312m  22m 8100 S  2.0  0.6   0:02.36 php                                                                           
18301 nagios    20   0  103m 1268 1096 S  2.0  0.0   0:00.06 sh                                                                            
18312 nagios    20   0  103m 1268 1096 S  2.0  0.0   0:00.06 sh                                                                            
18313 nagios    20   0  103m 1264 1096 S  2.0  0.0   0:00.06 sh                                                                            
18297 root      20   0  136m 1740  988 S  1.3  0.0   0:00.04 crond                                                                         
18308 nagios    20   0  103m 1268 1096 S  1.3  0.0   0:00.04 sh                                                                            
 1684 mysql     20   0 2177m  41m 6720 S  1.0  1.1   9:41.05 mysqld                                                                        
18293 root      20   0  136m 1740  988 S  1.0  0.0   0:00.03 crond                                                                         
18302 nagios    20   0  103m 1268 1096 S  1.0  0.0   0:00.03 sh                                                                            
18303 nagios    20   0  103m 1264 1096 S  1.0  0.0   0:00.03 sh                                                                            
 1664 postgres  20   0  215m 6212 3824 S  0.7  0.2   0:37.16 postmaster                                                                    
18298 root      20   0  136m 1740  988 S  0.7  0.0   0:00.02 crond                                                                         
18299 root      20   0  136m 1740  988 S  0.7  0.0   0:00.02 crond                                                                         
   22 root      20   0     0    0    0 S  0.3  0.0   0:26.39 events/3                                                                      
  713 postgres  20   0  215m 6224 3836 S  0.3  0.2   0:43.08 postmaster        
ps aux:

Code: Select all

os   18654  1.0  0.0 106060  1268 ?        Ss   19:28   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/loc
nagios   18655 23.0  0.5 319508 22056 ?        S    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/nom.php
nagios   18656 29.5  0.2 308936 11200 ?        R    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/reportengine.php
nagios   18657 23.0  0.5 319280 21724 ?        R    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios   18658 23.0  0.5 319644 22700 ?        S    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios   18659  2.5  0.0 106060  1268 ?        Ss   19:28   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/loc
nagios   18660  5.0  0.0 106060  1268 ?        Ss   19:28   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/lo
nagios   18661 22.0  0.5 319652 22280 ?        S    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios   18662  1.5  0.0 106060  1264 ?        Ss   19:28   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/lo
nagios   18663 23.0  0.5 319552 22044 ?        R    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
nagios   18664 21.5  0.1 285908  7140 ?        R    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios   18665 25.0  0.4 313644 15872 ?        R    19:28   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
postgres 18666 15.0  0.0 220108  3444 ?        Rs   19:28   0:00 postgres: nagiosxi nagiosxi [local] idle          
postgres 18667  4.0  0.1 220356  6068 ?        Ss   19:28   0:00 postgres: nagiosxi nagiosxi [local] idle          
root     18670  6.0  0.0 110232  1160 pts/0    R+   19:28   0:00 ps aux
postgres 18671  4.0  0.0 218892  1304 ?        R    19:28   0:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
apache   20305  3.9  1.0 465196 39704 ?        S    17:49   3:53 /usr/sbin/httpd
postgres 20307  0.4  0.1 220300  5976 ?        Ss   17:49   0:25 postgres: nagiosxi nagiosxi [local] idle          
apache   25379  4.0  1.0 465228 39776 ?        S    18:04   3:21 /usr/sbin/httpd
postgres 25404  0.4  0.1 220300  5988 ?        Ss   18:04   0:22 postgres: nagiosxi nagiosxi [local] idle          
apache   26957  3.3  1.0 465460 39844 ?        S    18:09   2:40 /usr/sbin/httpd
postgres 26971  0.4  0.1 220300  5932 ?        Ss   18:09   0:21 postgres: nagiosxi nagiosxi [local] idle          
apache   28914  4.6  1.0 464932 39376 ?        S    18:15   3:22 /usr/sbin/httpd
this spike was last like a 10 minutes only, I hope you can get more info from this.

and regarding to the other questions:
Are these VM's running on server that has other VM's that could have scheduled jobs or something monopolizing the resources on the machine?
nope, there is no more VMs, I had shutted down them all.
Maybe, as a test, you could disable the services and hosts making use of mrtg and temporarily stop mrtg. If you still get load spikes, at least you know it's not caused by mrtg

i'll deactivate all the services, except the localhost one (to check later the load)

I'll leave it until tomorrow to see what happens.
cheers,
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by abrist »

Alright. Both the top and ps outputs looked smaller than they should be. Also, did the load spike over 10?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
carlos.atos
Posts: 29
Joined: Mon Nov 10, 2014 1:08 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by carlos.atos »

Hello abrist,

This spike was lower than 10 in load.
Now my services are deactivated and these is not weird activity, I will keep sitting it for until monday.

cheers
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by tmcdonald »

Hello carlos.atos

Any update on this?
Former Nagios employee
carlos.atos
Posts: 29
Joined: Mon Nov 10, 2014 1:08 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by carlos.atos »

Hello tmcdonald

The server has behave good this days, during the weekend the services were deactivated
I had a short spike up to 17 on sunday without any active service ( only the load on localhost to monitor it)
localhost-current_load Feb17jpg.jpg
Is this a normal behaviour? It's okey to have maximum of 4 ad avergaes of 1.2 ?
I had activated the services again and it's looking good till now.

Cheers,
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by scottwilkerson »

carlos.atos wrote:Is this a normal behaviour? It's okey to have maximum of 4 ad avergaes of 1.2 ?
Honestly, if all this server is doing is monitoring load on localhost, there is definitely something wrong, likely with hardware, as most VM's I run with ~ 500 services and 4vCPUS and the load rarely goes above 1, and this is running on a very loaded down ESX server.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
carlos.atos
Posts: 29
Joined: Mon Nov 10, 2014 1:08 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by carlos.atos »

Well scott,

i had activated the services and I've noticed no more spikes, but as you say, my nagios should be below 1, it's monitoring only 8 ping hosts and 2 snmp routers.
this is my load history fo rthe week
localhost-current_load week14-19 FEB.jpg
we are thinking on move his VM to anothes ESXI, but we dont have one available by the moment.

we have finally activated the Nagios XI license on this server, so now I am a former client. :D
There is another way of analize my situation.?

cheers
You do not have the required permissions to view the files attached to this post.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios 2014R2.3 on VM HIGH LOAD SPIKES

Post by abrist »

carlos.atos wrote:so now I am a former client
But not current? If you had maintenance we could move to a remote session and troubleshoot it.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked