Mod Gearman Performance Issues Nagios 2014R2.6

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

Notice all these fork and fdopen error is gearman_worker_log

This is from the worker in the nagios server.
Any idea how to resolve this?

Code: Select all

[2015-04-08 07:27:03][20067][ERROR] fork error
[2015-04-08 07:27:03][20062][ERROR] fdopen error
[2015-04-08 07:27:03][20074][ERROR] fork error
[2015-04-08 07:27:03][20063][ERROR] fork error
[2015-04-08 07:27:03][18743][ERROR] fdopen error
[2015-04-08 07:27:03][20068][ERROR] fork error
[2015-04-08 07:27:03][30215][ERROR] fdopen error
[2015-04-08 07:27:03][18460][ERROR] fdopen error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][31185][ERROR] fork error
[2015-04-08 07:27:03][19073][ERROR] fork error
[2015-04-08 07:27:04][20252][ERROR] fork error
[2015-04-08 07:27:04][20255][ERROR] fork error
[2015-04-08 07:27:04][20262][ERROR] fdopen error
[2015-04-08 07:27:04][20251][ERROR] fdopen error
[2015-04-08 07:27:04][20256][ERROR] fork error
[2015-04-08 07:27:04][20254][ERROR] fork error
[2015-04-08 07:27:04][18778][ERROR] fdopen error
[2015-04-08 07:30:55][11628][INFO ] timeout (240s) hit for servicecheck: MY1PIPP1 - SIBPRBSBS / S                                                                       SRUNC

5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by abrist »

This usually happens when the system is out of resources. Did you run out of memory/swap, or did load get ungodly high?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

abrist wrote:This usually happens when the system is out of resources. Did you run out of memory/swap, or did load get ungodly high?
No, the server has 20vCPU and 20Gb ram running on RAID-10 disks.
Troy did a remote yesterday and root cause was messages not being processed fast enough by ndodb or something like that.
ipcq -q showed very high messages in the queue all the time.
The problem is still not resolved yet.

Yesterday we had to restart services constantly every 15 minutes because the scheduling was going hay wire.

I disabled all the 3 external worker servers and now it is a bit stable but the latency is about 6 seconds

Code: Select all

2015-04-09 07:05:19  -  localhost:4730  -  v1.1.8

 Queue Name                  | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------------
 check_results               |               2  |           0  |           0
 eventhandler                |             129  |           0  |           0
 host                        |             129  |           0  |           0
 hostgroup_LOAD_BALANCER_MSB |             129  |           0  |          10
 service                     |             129  |           0  |          10
 worker_nagiosprodxi1        |               1  |           0  |           0
------------------------------------------------------------------------------

[nagios@nagiosprodxi1 scripts]$ ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x7f000002 0          nagios     600        100983808    98617

[nagios@nagiosprodxi1 scripts]$ ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x7f000002 0          nagios     600        99923968     97582

[nagios@nagiosprodxi1 scripts]$ ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x7f000002 0          nagios     600        101514240    99135


2015-04-09_07-04-21.png
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

I noticed that when the messages queue is low, the scheduling is good.
When it hits the high 5 figures it does downhill fast and never recovers.

Now the message queue is low, the scheduling is ok.
2015-04-09_13-11-19.png
2015-04-09_13-11-13.png
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

Looks like we might have a temp solution to the problem.
At least we dont have to restart every hour.

Steps taken
1) Installed rrdcached. Default installation
wget http://assets.nagios.com/downloads/nagi ... dcached.sh
Note: The rrdtool file downloaded is a tar file even though it is named as rrdtool-1.4.4.tar.gz
Change line 114 from in xi-rrdcached.sh
Original: tar -xzf rrdtool-1.4.4.tar.gz
Revised: tar -xf rrdtool-1.4.4.tar.gz

2) Disabled process perfdata for services like port response, uptime etc

The message queue is howering wildly around 0 - 3500 now.
When the rrdcached flushes, messages shooting up to 16000 and takes about 5 - 10 minutes to go down again.

If we Apply Configuration or restart Nagios, it shoots up to 150000 and takes ages to go down to normal again.

Don't know how this is going to scale when with just 2000 hosts and 14000 services it is choking already.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

More updates.

Moved all to ram disk as per link below.

How to Utilize a RAM Disk With Nagios XI
http://library.nagios.com/library/produ ... n-nagiosxi

Now I/O wait is almost always at 0.0% - 0.0x% with occasional spikes.
Messages queue is around 0 - 3000 after it has stabilised from Nagios services restart.
2015-04-09_17-53-20.png

Will monitor for a few days before closing this case.
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by cmerchant »

Definitely ramdisk improves things, hope that the ipcs -q problem can be isolated and fixed. I will keep this thread open. Keep us updated.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

cmerchant wrote:Definitely ramdisk improves things, hope that the ipcs -q problem can be isolated and fixed. I will keep this thread open. Keep us updated.
12 hours later things still look ok. However the ipcs -q showing average about 1x000.
So the issue is still there as you mentioned. Nagios team is working on this and hopefully we can settle it fast.

Code: Select all

Every 5.0s: ipcs -q                                                                 Fri Apr 10 07:05:17 2015


------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xe5000002 262144     nagios     600        13554688     13237
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by rajasegar »

Is it advisable to move the nagios.log and nagios.tmp.xxxxxxx to /var/nagiosramdisk/log?

Code: Select all

temp_file=/usr/local/nagios/var/nagios.tmp
log_file=/usr/local/nagios/var/nagios.log

5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Mod Gearman Performance Issues Nagios 2014R2.6

Post by abrist »

rajasegar wrote:Is it advisable to move the nagios.log and nagios.tmp.xxxxxxx to /var/nagiosramdisk/log?
It is generally not a good idea, though if your ramdisk is persistent (in case of an unscheduled reboot) and large enough for the log, then it may be ok.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked