Nagios XI Event queue stalling? (Mod Gearman)

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

Hi,

Apologies for the long post..Having an issue with all checks failing (marked as orphaned) after a period of time on our new Nagios deployment.


Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6.
We've gone through a process of migrating from a NSCA/NRDP deployment and have basically done a side by side deployment, copying all the relevant configs from our legacy Nagios instances.

Its been running for 2 months without incident while we migrated all the hosts/service checks fom the legacy deployment. We're ramped up the number of host & service checks over the past week to around 800 hosts & close to 7000 services.

Since thursday just gone (pretty much when we loaded it up) we've had 2 issues where the Event Queue bunches up its checks (basically says theres 7k scheduled immediately see pic).
MonitoringEngineEventQueue.jpg
Watching the logs on the workers, the only lines that appear are the following:
mod_gearman_worker-log.jpg
All the checks go critical with "service/host check orphaned, is the mod-gearman worker on queue 'service' running?"

io on the master server seems ok, as does disk space

Just seems like for some reason the workers are unable to retrieve the jobs form the master.

When the issue has occured, it takes a few times to restart Nagios (service stop nagios, gearmand, nagiosxi, ndo2db, mysqld, then start in the reverses order). Initially after the event queue will indicate that events are being passed, but after a couple of minutes, the events start bunching up, and failing with the orphaned message.

I'm not sure why it settles down (maybe because I restart everything 3 times..) but after restarting the services a number of times, it seems to settle down & gain some stability.

To Troubleshoot, have made the following changes:

- Upgraded to 2012r2.9
- followed the orphaned checks section form the support wiki,
- nagios.conf - use_retained_scheduling_info=0 (This seemed to distribute the checks more evenly)
- /etc/mod_gearman/mod_gearman_neb.conf - use_uniq_jobs=off


The issue seems to take at least a 24-72hrs to surface, so at the moment am waiting.

Here is a current pic of the Event Queue:
currentEventQueue.jpg
Previously the graph was quite a sharp sawtooth type pattern

Appreciate any advice

Thanks

Lincoln
You do not have the required permissions to view the files attached to this post.
Last edited by lance on Tue Feb 18, 2014 2:44 pm, edited 1 time in total.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by BanditBBS »

I had an issue this past weekend after some network outage. After restarts of all the workers and the nagios process most were still orphaned. I rebooted NagiosXI and then half came back up fine. After that, I ended up having to reschedule the next active check of all the individual services and they would work. It just keep saying orphaned every time the check would run until I scheduled manually and then it started working.

Hope you figure something out so it can possibly help me next time....good luck!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

Yeah - I had he same symptom with needing to re-schedule active checks. But I found that changing the nagios.conf setting:

from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0

& restarting Nagios the checks kicked themselves off. I haven't' reverted this change yet.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by abrist »

lance wrote:from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0
This may be the best "solution" as of right now. It should force all your checks to pending state after a restart of the nagios service - so it is not optimal.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

OK,

so have noticed that the event queue returns to what seems to be the "normal" pattern over time. Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications (including to oue event console - looks like an environment meltdown...). This has happened twice only, but not since we made the config changes above.

Currently we seem to be running OK, but I'm a bit hesitant to cut over to the new instance as of yet as I'm still a bit unsure if the changes we did in troubleshooting actually rectified the issue.

So am just wondering if this is a potential issue that you guys are aware of &/or working through. Just keen to move to the new distributed solution!!

Just to confirm though - I havent seen the issue since we did the troubleshooting on the weekend. And the checks have been running for almost a day and a half without issue..

Thanks

Lincoln
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by abrist »

lance wrote: Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications
If you are experiencing orphaned checks, you may want to check the system ulimts:
http://support.nagios.com/wiki/index.ph ... g_Orphaned
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

Hi,

Yeah - I've followed that and as far as I can tell they've been applied. Will let you know if we see the issue occurs again. Certainly haven't seen it since those changes (& the other stuff we did) were made.

Thanks
Lincoln
Last edited by lance on Tue Feb 18, 2014 5:11 pm, edited 1 time in total.
User avatar
lmiltchev
Former Nagios Staff
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lmiltchev »

Sure, let us know if the issue come up again.
Be sure to check out our Knowledgebase for helpful articles and solutions!
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by lance »

Hi,

unfortunately we seem to be still getting issues with losing the worker nodes. Was looking at the versions installed & seems that the version of mod gearman installed by the script is 1.3.8. Is there any issue in upgrading mod gearman to the latest version hosted by modgearman.org - 1.4.14 As a trouble shooting measure? Seems there's RPMs available for RHEL6 64 bit.

thanks

Lincoln
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios XI Event queue stalling? (Mod Gearman)

Post by slansing »

There should be no issue updating, I've not personally tested that version with XI but it should work fine, I suggest setting this up with your test XI server.
Locked