Page 3 of 3
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Tue Apr 07, 2015 7:43 pm
by rajasegar
Notice all these fork and fdopen error is gearman_worker_log
This is from the worker in the nagios server.
Any idea how to resolve this?
Code: Select all
[2015-04-08 07:27:03][20067][ERROR] fork error
[2015-04-08 07:27:03][20062][ERROR] fdopen error
[2015-04-08 07:27:03][20074][ERROR] fork error
[2015-04-08 07:27:03][20063][ERROR] fork error
[2015-04-08 07:27:03][18743][ERROR] fdopen error
[2015-04-08 07:27:03][20068][ERROR] fork error
[2015-04-08 07:27:03][30215][ERROR] fdopen error
[2015-04-08 07:27:03][18460][ERROR] fdopen error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][19073][ERROR] fork error
[2015-04-08 07:27:04][20252][ERROR] fork error
[2015-04-08 07:27:04][20255][ERROR] fork error
[2015-04-08 07:27:04][20262][ERROR] fdopen error
[2015-04-08 07:27:04][20251][ERROR] fdopen error
[2015-04-08 07:27:04][20256][ERROR] fork error
[2015-04-08 07:27:04][20254][ERROR] fork error
[2015-04-08 07:27:04][18778][ERROR] fdopen error
[2015-04-08 07:30:55][11628][INFO ] timeout (240s) hit for servicecheck: MY1PIPP1 - SIBPRBSBS / S SRUNC
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Wed Apr 08, 2015 4:32 pm
by abrist
This usually happens when the system is out of resources. Did you run out of memory/swap, or did load get ungodly high?
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Wed Apr 08, 2015 6:05 pm
by rajasegar
abrist wrote:This usually happens when the system is out of resources. Did you run out of memory/swap, or did load get ungodly high?
No, the server has 20vCPU and 20Gb ram running on RAID-10 disks.
Troy did a remote yesterday and root cause was messages not being processed fast enough by ndodb or something like that.
ipcq -q showed very high messages in the queue all the time.
The problem is still not resolved yet.
Yesterday we had to restart services constantly every 15 minutes because the scheduling was going hay wire.
I disabled all the 3 external worker servers and now it is a bit stable but the latency is about 6 seconds
Code: Select all
2015-04-09 07:05:19 - localhost:4730 - v1.1.8
Queue Name | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------------
check_results | 2 | 0 | 0
eventhandler | 129 | 0 | 0
host | 129 | 0 | 0
hostgroup_LOAD_BALANCER_MSB | 129 | 0 | 10
service | 129 | 0 | 10
worker_nagiosprodxi1 | 1 | 0 | 0
------------------------------------------------------------------------------
[nagios@nagiosprodxi1 scripts]$ ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x7f000002 0 nagios 600 100983808 98617
[nagios@nagiosprodxi1 scripts]$ ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x7f000002 0 nagios 600 99923968 97582
[nagios@nagiosprodxi1 scripts]$ ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x7f000002 0 nagios 600 101514240 99135
2015-04-09_07-04-21.png
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Thu Apr 09, 2015 12:15 am
by rajasegar
I noticed that when the messages queue is low, the scheduling is good.
When it hits the high 5 figures it does downhill fast and never recovers.
Now the message queue is low, the scheduling is ok.
2015-04-09_13-11-19.png
2015-04-09_13-11-13.png
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Thu Apr 09, 2015 2:28 am
by rajasegar
Looks like we might have a temp solution to the problem.
At least we dont have to restart every hour.
Steps taken
1) Installed rrdcached. Default installation
wget
http://assets.nagios.com/downloads/nagi ... dcached.sh
Note: The rrdtool file downloaded is a tar file even though it is named as rrdtool-1.4.4.tar.gz
Change line 114 from in xi-rrdcached.sh
Original: tar -xzf rrdtool-1.4.4.tar.gz
Revised: tar -xf rrdtool-1.4.4.tar.gz
2) Disabled process perfdata for services like port response, uptime etc
The message queue is howering wildly around 0 - 3500 now.
When the rrdcached flushes, messages shooting up to 16000 and takes about 5 - 10 minutes to go down again.
If we Apply Configuration or restart Nagios, it shoots up to 150000 and takes ages to go down to normal again.
Don't know how this is going to scale when with just 2000 hosts and 14000 services it is choking already.
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Thu Apr 09, 2015 4:51 am
by rajasegar
More updates.
Moved all to ram disk as per link below.
How to Utilize a RAM Disk With Nagios XI
http://library.nagios.com/library/produ ... n-nagiosxi
Now I/O wait is almost always at 0.0% - 0.0x% with occasional spikes.
Messages queue is around 0 - 3000 after it has stabilised from Nagios services restart.
2015-04-09_17-53-20.png
Will monitor for a few days before closing this case.
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Thu Apr 09, 2015 12:31 pm
by cmerchant
Definitely ramdisk improves things, hope that the ipcs -q problem can be isolated and fixed. I will keep this thread open. Keep us updated.
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Thu Apr 09, 2015 6:06 pm
by rajasegar
cmerchant wrote:Definitely ramdisk improves things, hope that the ipcs -q problem can be isolated and fixed. I will keep this thread open. Keep us updated.
12 hours later things still look ok. However the ipcs -q showing average about 1x000.
So the issue is still there as you mentioned. Nagios team is working on this and hopefully we can settle it fast.
Code: Select all
Every 5.0s: ipcs -q Fri Apr 10 07:05:17 2015
------ Message Queues --------
key msqid owner perms used-bytes messages
0xe5000002 262144 nagios 600 13554688 13237
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Fri Apr 10, 2015 1:44 am
by rajasegar
Is it advisable to move the nagios.log and nagios.tmp.xxxxxxx to /var/nagiosramdisk/log?
Code: Select all
temp_file=/usr/local/nagios/var/nagios.tmp
log_file=/usr/local/nagios/var/nagios.log
Re: Mod Gearman Performance Issues Nagios 2014R2.6
Posted: Fri Apr 10, 2015 10:51 am
by abrist
rajasegar wrote:Is it advisable to move the nagios.log and nagios.tmp.xxxxxxx to /var/nagiosramdisk/log?
It is generally not a good idea, though if your ramdisk is persistent (in case of an unscheduled reboot) and large enough for the log, then it may be ok.