Nagios Support Forum

Posted: **Tue Jul 07, 2015 9:44 pm**

Hi Nagios!

We are using mod_gearman to handle about 37,000 service checks, and running into a problem where the service check latency is really high.

I have a gearmand server defined in /usr/local/etc/mod_gearman2/module.conf

Code: Select all

server=192.168.249.37:4730

Configured to start with 10 threads,
(8GB RAM, 16 Cores)
we tried offloading it to a separate server, but the performance is/was the same.

And we have multiple worker servers (3) pointed to that gearmand server.
(4GB RAM, 8 Cores)
The workers are configured with the following:

Code: Select all

max-worker=1500
max-jobs=1000
spawn-rate=100

Each worker instance is consuming about 300(-ish) connections, but when I have only one worker, it goes up to 1000 connections.

And with that - the gearmand server - is max'ing out at about 1000-1200 connections:

Code: Select all

# netstat -anp | grep 4730 | wc -l
1029

The gearadmin --status command on the gearmand server doesn't return anything after there are about 500 connections (it just stops responding) .. so its not too useful for us, but I can see that there's activity on the server from "top":
****

Code: Select all

top - 21:33:04 up  1:12,  2 users,  load average: 0.96, 0.63, 0.69
Tasks: 277 total,   2 running, 275 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.5%us,  4.0%sy,  0.0%ni, 93.5%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8058952k total,   219040k used,  7839912k free,     8684k buffers
Swap:  2064380k total,        0k used,  2064380k free,    36404k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                             
 3479 naemon    20   0  840m  17m  720 R 105.9  0.2   2:48.30 gearmand

Am i correct in assuming that the gearmand server is the bottleneck? the workers can scale up their connections as needed, so for some reason - we can't get the gearmand to keep up with the mod_gearman broker service on XI. (or the mod_gearman broker is the bottleneck?)

Any ideas are very much appreciated. I tried uploading our profile, but its too big .. so I've attached the summary details in the .txt file attached.

Thanks!

- Ian

Posted: **Wed Jul 08, 2015 10:08 am**

I personally haven't seen an environment of that scale yet so I can't neither confirm nor deny that's a good number to expect to see job server saturation. Your logic is sound though. I'm assuming that gearmand keeps a single core on your system pegged all the time?

Where to go from there - you'll need a ramdisk if you don't already have one, then start submitting passive results from a 2nd instance as an additional layer of distributedness would be my guess.

Posted: **Wed Jul 08, 2015 6:43 pm**

Thanks for the quick reply - you guys are awesome : )

It took 3 of us about 5 hours of conference calls to figure out the bottleneck. It was the limit of files open on the server running gearmand.
We tried and tried to adjust it using /etc/security/limits.conf - but the default soft limit of 1024 was still being applied to the gearmand process.

Then we found a way to adjust the limits of the process is real-time:

Code: Select all

# prlimit --pid (PID) --nofile=65535:65535
# prlimit --pid (PID) | grep NOFILE
NOFILE     max number of open files               65535     65535

And right after the limits were updated, the TCP connections from the worker servers(3) jumped up from about 300 each to over 1000 each! And the avg latency for service checks is down from 120 secs to 5 secs.
SOLVED!!

We're not stopping at 37,000 checks - this was just the first batch .. I'm sure you'll hear more from us as we make progress, thanks again -

- Ian

Posted: **Wed Jul 08, 2015 7:00 pm**

Thanks for the update, much appreciated.

Keep us posted on your progress, your feedback will be helpful to others with similar problems in the future.

Nagios Support Forum

Gearman Bottleneck

Gearman Bottleneck

Re: Gearman Bottleneck

Re: Gearman Bottleneck

Re: Gearman Bottleneck