Nagios XI Event queue stalling? (Mod Gearman)

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

Sorry for resurrecting an older post, but just thought we'd provide an update.

We've been working fairly constantly over the past 4-6 weeks with some positive results.

Basically we upgraded to the latest gearman suite. We pulled apart the provided install script to make sure we installed the right components in the right place. We sourced the files from:

http://mod-gearman.org/download/v1.4.14/rhel6/x86_64/

which were updated versions of the files:

gearmand-0.33-1.rhel6.x86_64.rpm
gearmand-devel-0.33-1.rhel6.x86_64.rpm
gearmand-server-0.33-1.rhel6.x86_64.rpm
mod_gearman-1.4.14-1.e.rhel6.x86_64.rpm

At the time these were the latest that we could find.

We also moved all our Nagios virtual machines onto identified storage LUNs that had low Disk IO wait. They were originally hosted on storage that was constantly being replicated to our DR site, which caused some higher IO wait issues for us with the mod_gearman setup.

We also made some configuration changes in the config files based on some forum posts from consol labs & the mod_gearman google group :

Master Node
/etc/mod_gearman/mod_gearman_neb.conf
use_uniq_jobs=off

Worker Nodes
/etc/mod_gearman_mod_gearman_worker.conf
idle-timeout=20 (reduced from 30 from memory)

We cant really pinpoint what specific change we performed resolved the issue for us, or whether it was a combination of what we did. Thing is, we haven't seen the issue since 12/3. And the deployment of one master & 3 worker nodes are now comfortably servicing @ 1100 host checks & just over 11000 service checks.

Thanks

Lincoln
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by slansing »

Wow you and your team have been hard at work! We still have not seen this behavior with others, not with the currently posted version of mod gearman in the XI document, nor the latest one that works with Core 4. How do your system vitals look? I/O, load, check latency, etc?
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

Hi there,

yeah we're pretty happy with where we've got to.

We've got 4 VM's - 1 master & 3 workers. All with 4 vCPU & 4 gig RAM. The master is configured as a worker as well, as it wasn't really doing that much anyway! We've got some specific service queues setup being serviced by specific workers, & that seems to up the load on those workers a bit, but with the 4 cpu, they occasionally get above a 5 min load avg of 5, but not for that long. IO wait is generally less than 1-3% on the Master & more or less the same for the workers. We get a spike overnight about midnight still (backups running accross the san), but usually only to a max of 5% or there abouts and not sustained for that long.

Also, latency stats look pretty good:
Latency.jpg
Ive noticed too that the the master seems to cope better if workers are lost - ive seen the Jobs waiting get upt to 2-3k, but once the workers return Gearman churns through the waiting jobs pretty quickly..

Appreciate your advice

regards

Lincoln
You do not have the required permissions to view the files attached to this post.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by sreinhardt »

Awesome job and thanks for the update! I think it might be time to update that gearman install script to avoid this for others.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by mon-team »

Hi everybody,

we experienced exactly the same issue on our production environment last Sunday.

All active services checks were orphaned. We were able to have Nagios working again setting use_retained_scheduling_info=0 and restarting nagios.

We have a central Nagios XI server (physical machine) and 3 workers (virtual machines, VMware)

Nagios XI 2012R2.9
mod_gearman v. 1.3.8

We actively monitor around 1200 Hosts /12500 Services

Mod-Gearman was installed following the procedure provided in the Nagios XI Administration Guide

http://assets.nagios.com/downloads/nagi ... ios_XI.pdf

We will go through the steps suggested by lance and see if we will fix the issue.

Does anybody know if the gearman install script has allready been rewritten to install the latest gearman stable version (v 1.4.14)?

It would speed up gearman upgrade.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by slansing »

Hi mon-team,

I have not updated the document yet, I am going to be testing with XI 2014 today hopefully, I want to make sure that we do this the right way. We wouldn't want people to end up with half the correct and working pre-reqs. We also need to make sure this still works with XI 2012 versions (it should) as the majority of users will likely not upgrade to 2014 on version 1.0.

Has anyone here verified that you are getting current, and good, performance data through from your gearman checks in either the 2014 beta, or the 2014 release?

Addition: Apparently we have reports of older versions of gearman working with Core 4 (not the most recent one mentioned here) as well, I'll be looking into that.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by BanditBBS »

slansing wrote:Hi mon-team,

I have not updated the document yet, I am going to be testing with XI 2014 today hopefully, I want to make sure that we do this the right way. We wouldn't want people to end up with half the correct and working pre-reqs. We also need to make sure this still works with XI 2012 versions (it should) as the majority of users will likely not upgrade to 2014 on version 1.0.
Sam, from what I've seen gearman only work with 4.0 or greater Core OR <4.0 Core...no version will work with both! And as far as I know it is only the one released version what works with Core 4.x. So you may have to write two version of the document, one for XI 2012 and one for XI 2014. So the 1.4.14 only works with XI 2012 and you have to use the 1.4nagios4 version for XI 2014.

Please feel free to prove me wrong though, as some of the additions to the last couple versions would be welcomed in XI2014.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by mon-team »

Thanks for the update slansing.

I think we will proceed to upgrade mod gearman manually to v1.4.14.

We will try it first in our test environment

Anyway, please let us know as soon as an official uprgrade procedure will be ready for

Nagios XI 2012R2.9
mod_gearman v.1.4.14
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by slansing »

Definitely, I would of course recommend you test this first with your XI test server, the last thing I'd want to see is your production system's monitoring capabilities go down and having to do damage control.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by slansing »

Dropping this in a new reply since some of you subscribed to this thread. You can use the attached script to get a working version of gearmand on your XI 2014 / Core 4 server, and it can also be used to get a working worker install on you're remote workers. I also added in the option for you to specify a key (during install) if you are using encryption on the master server, so you don't have to manually go in and add just that.

You can follow the same old document, though I'll be updating that as well once support is added in to check your server for Core4 so we don't break other peoples installs if they are using core 3 still and want gearman.

NOTE: If you did not know already, you can not use the perfdata processing function of mod gearman as we are not shipping the version of PNP that it requires you to install remotely, we actually don't really use PNP heavily on the backend of XI anymore.

Currently, until we find a better way, you will also need to update your performance data commands in the CCM in order to get perf data pushed from the workers up to XI and displayed properly.

You will need to change-

process-host-perfdata-file-bulk and process-service-perfdata-file-bulk's command's to:

Code: Select all

sed -i 's/\\n//g' /usr/local/nagios/var/host-perfdata && /bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/$TIMET$.perfdata.host
And:

Code: Select all

sed -i 's/\\n//g' /usr/local/nagios/var/service-perfdata && /bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/$TIMET$.perfdata.service
And apply configuration.
You do not have the required permissions to view the files attached to this post.
Locked