Nagios XI Event queue stalling? (Mod Gearman)

lance · Post by **lance** » Fri Feb 14, 2014 8:03 pm

Hi,

Apologies for the long post..Having an issue with all checks failing (marked as orphaned) after a period of time on our new Nagios deployment.

Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6.
We've gone through a process of migrating from a NSCA/NRDP deployment and have basically done a side by side deployment, copying all the relevant configs from our legacy Nagios instances.

Its been running for 2 months without incident while we migrated all the hosts/service checks fom the legacy deployment. We're ramped up the number of host & service checks over the past week to around 800 hosts & close to 7000 services.

Since thursday just gone (pretty much when we loaded it up) we've had 2 issues where the Event Queue bunches up its checks (basically says theres 7k scheduled immediately see pic).

MonitoringEngineEventQueue.jpg

Watching the logs on the workers, the only lines that appear are the following:

mod_gearman_worker-log.jpg

All the checks go critical with "service/host check orphaned, is the mod-gearman worker on queue 'service' running?"

io on the master server seems ok, as does disk space

Just seems like for some reason the workers are unable to retrieve the jobs form the master.

When the issue has occured, it takes a few times to restart Nagios (service stop nagios, gearmand, nagiosxi, ndo2db, mysqld, then start in the reverses order). Initially after the event queue will indicate that events are being passed, but after a couple of minutes, the events start bunching up, and failing with the orphaned message.

I'm not sure why it settles down (maybe because I restart everything 3 times..) but after restarting the services a number of times, it seems to settle down & gain some stability.

To Troubleshoot, have made the following changes:

- Upgraded to 2012r2.9
- followed the orphaned checks section form the support wiki,
- nagios.conf - use_retained_scheduling_info=0 (This seemed to distribute the checks more evenly)
- /etc/mod_gearman/mod_gearman_neb.conf - use_uniq_jobs=off

The issue seems to take at least a 24-72hrs to surface, so at the moment am waiting.

Here is a current pic of the Event Queue:

currentEventQueue.jpg

Previously the graph was quite a sharp sawtooth type pattern

Appreciate any advice

Thanks

Lincoln

Post by **BanditBBS** » Fri Feb 14, 2014 8:31 pm

I had an issue this past weekend after some network outage. After restarts of all the workers and the nagios process most were still orphaned. I rebooted NagiosXI and then half came back up fine. After that, I ended up having to reschedule the next active check of all the individual services and they would work. It just keep saying orphaned every time the check would run until I scheduled manually and then it started working.

Hope you figure something out so it can possibly help me next time....good luck!

lance · Post by **lance** » Sat Feb 15, 2014 12:14 am

Yeah - I had he same symptom with needing to re-schedule active checks. But I found that changing the nagios.conf setting:

from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0

& restarting Nagios the checks kicked themselves off. I haven't' reverted this change yet.

abrist · Post by **abrist** » Mon Feb 17, 2014 12:14 pm

lance wrote:from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0

This may be the best "solution" as of right now. It should force all your checks to pending state after a restart of the nagios service - so it is not optimal.

lance · Post by **lance** » Tue Feb 18, 2014 12:13 am

OK,

so have noticed that the event queue returns to what seems to be the "normal" pattern over time. Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications (including to oue event console - looks like an environment meltdown...). This has happened twice only, but not since we made the config changes above.

Currently we seem to be running OK, but I'm a bit hesitant to cut over to the new instance as of yet as I'm still a bit unsure if the changes we did in troubleshooting actually rectified the issue.

So am just wondering if this is a potential issue that you guys are aware of &/or working through. Just keen to move to the new distributed solution!!

Just to confirm though - I havent seen the issue since we did the troubleshooting on the weekend. And the checks have been running for almost a day and a half without issue..

Thanks

Lincoln

abrist · Post by **abrist** » Tue Feb 18, 2014 10:21 am

lance wrote: Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications

If you are experiencing orphaned checks, you may want to check the system ulimts:
http://support.nagios.com/wiki/index.ph ... g_Orphaned

lance · Post by **lance** » Tue Feb 18, 2014 2:48 pm

Hi,

Yeah - I've followed that and as far as I can tell they've been applied. Will let you know if we see the issue occurs again. Certainly haven't seen it since those changes (& the other stuff we did) were made.

Thanks
Lincoln

Post by **lmiltchev** » Tue Feb 18, 2014 2:56 pm

Sure, let us know if the issue come up again.

lance · Post by **lance** » Fri Feb 28, 2014 10:25 pm

Hi,

unfortunately we seem to be still getting issues with losing the worker nodes. Was looking at the versions installed & seems that the version of mod gearman installed by the script is 1.3.8. Is there any issue in upgrading mod gearman to the latest version hosted by modgearman.org - 1.4.14 As a trouble shooting measure? Seems there's RPMs available for RHEL6 64 bit.

thanks

Lincoln

slansing · Post by **slansing** » Mon Mar 03, 2014 10:51 am

There should be no issue updating, I've not personally tested that version with XI but it should work fine, I suggest setting this up with your test XI server.

Nagios Support Forum

Nagios XI Event queue stalling? (Mod Gearman)

Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)

Re: Nagios XI Event queue stalling? (Mod Gearman)