Hi,
Apologies for the long post..Having an issue with all checks failing (marked as orphaned) after a period of time on our new Nagios deployment.
Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6.
We've gone through a process of migrating from a NSCA/NRDP deployment and have basically done a side by side deployment, copying all the relevant configs from our legacy Nagios instances.
Its been running for 2 months without incident while we migrated all the hosts/service checks fom the legacy deployment. We're ramped up the number of host & service checks over the past week to around 800 hosts & close to 7000 services.
Since thursday just gone (pretty much when we loaded it up) we've had 2 issues where the Event Queue bunches up its checks (basically says theres 7k scheduled immediately see pic).
Watching the logs on the workers, the only lines that appear are the following:
All the checks go critical with "service/host check orphaned, is the mod-gearman worker on queue 'service' running?"
io on the master server seems ok, as does disk space
Just seems like for some reason the workers are unable to retrieve the jobs form the master.
When the issue has occured, it takes a few times to restart Nagios (service stop nagios, gearmand, nagiosxi, ndo2db, mysqld, then start in the reverses order). Initially after the event queue will indicate that events are being passed, but after a couple of minutes, the events start bunching up, and failing with the orphaned message.
I'm not sure why it settles down (maybe because I restart everything 3 times..) but after restarting the services a number of times, it seems to settle down & gain some stability.
To Troubleshoot, have made the following changes:
- Upgraded to 2012r2.9
- followed the orphaned checks section form the support wiki,
- nagios.conf - use_retained_scheduling_info=0 (This seemed to distribute the checks more evenly)
- /etc/mod_gearman/mod_gearman_neb.conf - use_uniq_jobs=off
The issue seems to take at least a 24-72hrs to surface, so at the moment am waiting.
Here is a current pic of the Event Queue:
Previously the graph was quite a sharp sawtooth type pattern
Appreciate any advice
Thanks
Lincoln
Nagios XI Event queue stalling? (Mod Gearman)
Nagios XI Event queue stalling? (Mod Gearman)
You do not have the required permissions to view the files attached to this post.
Last edited by lance on Tue Feb 18, 2014 2:44 pm, edited 1 time in total.
Re: Nagios XI Event queue stalling? (Mod Gearman)
I had an issue this past weekend after some network outage. After restarts of all the workers and the nagios process most were still orphaned. I rebooted NagiosXI and then half came back up fine. After that, I ended up having to reschedule the next active check of all the individual services and they would work. It just keep saying orphaned every time the check would run until I scheduled manually and then it started working.
Hope you figure something out so it can possibly help me next time....good luck!
Hope you figure something out so it can possibly help me next time....good luck!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Nagios XI Event queue stalling? (Mod Gearman)
Yeah - I had he same symptom with needing to re-schedule active checks. But I found that changing the nagios.conf setting:
from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0
& restarting Nagios the checks kicked themselves off. I haven't' reverted this change yet.
from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0
& restarting Nagios the checks kicked themselves off. I haven't' reverted this change yet.
Re: Nagios XI Event queue stalling? (Mod Gearman)
This may be the best "solution" as of right now. It should force all your checks to pending state after a restart of the nagios service - so it is not optimal.lance wrote:from
use_retained_scheduling_info=1
to
use_retained_scheduling_info=0
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios XI Event queue stalling? (Mod Gearman)
OK,
so have noticed that the event queue returns to what seems to be the "normal" pattern over time. Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications (including to oue event console - looks like an environment meltdown...). This has happened twice only, but not since we made the config changes above.
Currently we seem to be running OK, but I'm a bit hesitant to cut over to the new instance as of yet as I'm still a bit unsure if the changes we did in troubleshooting actually rectified the issue.
So am just wondering if this is a potential issue that you guys are aware of &/or working through. Just keen to move to the new distributed solution!!
Just to confirm though - I havent seen the issue since we did the troubleshooting on the weekend. And the checks have been running for almost a day and a half without issue..
Thanks
Lincoln
so have noticed that the event queue returns to what seems to be the "normal" pattern over time. Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications (including to oue event console - looks like an environment meltdown...). This has happened twice only, but not since we made the config changes above.
Currently we seem to be running OK, but I'm a bit hesitant to cut over to the new instance as of yet as I'm still a bit unsure if the changes we did in troubleshooting actually rectified the issue.
So am just wondering if this is a potential issue that you guys are aware of &/or working through. Just keen to move to the new distributed solution!!
Just to confirm though - I havent seen the issue since we did the troubleshooting on the weekend. And the checks have been running for almost a day and a half without issue..
Thanks
Lincoln
Re: Nagios XI Event queue stalling? (Mod Gearman)
If you are experiencing orphaned checks, you may want to check the system ulimts:lance wrote: Issue I get is that when the checks start failing & becoming orphaned, they come back as critical & we get a bucket load of notifications
http://support.nagios.com/wiki/index.ph ... g_Orphaned
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios XI Event queue stalling? (Mod Gearman)
Hi,
Yeah - I've followed that and as far as I can tell they've been applied. Will let you know if we see the issue occurs again. Certainly haven't seen it since those changes (& the other stuff we did) were made.
Thanks
Lincoln
Yeah - I've followed that and as far as I can tell they've been applied. Will let you know if we see the issue occurs again. Certainly haven't seen it since those changes (& the other stuff we did) were made.
Thanks
Lincoln
Last edited by lance on Tue Feb 18, 2014 5:11 pm, edited 1 time in total.
Re: Nagios XI Event queue stalling? (Mod Gearman)
Sure, let us know if the issue come up again.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios XI Event queue stalling? (Mod Gearman)
Hi,
unfortunately we seem to be still getting issues with losing the worker nodes. Was looking at the versions installed & seems that the version of mod gearman installed by the script is 1.3.8. Is there any issue in upgrading mod gearman to the latest version hosted by modgearman.org - 1.4.14 As a trouble shooting measure? Seems there's RPMs available for RHEL6 64 bit.
thanks
Lincoln
unfortunately we seem to be still getting issues with losing the worker nodes. Was looking at the versions installed & seems that the version of mod gearman installed by the script is 1.3.8. Is there any issue in upgrading mod gearman to the latest version hosted by modgearman.org - 1.4.14 As a trouble shooting measure? Seems there's RPMs available for RHEL6 64 bit.
thanks
Lincoln
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios XI Event queue stalling? (Mod Gearman)
There should be no issue updating, I've not personally tested that version with XI but it should work fine, I suggest setting this up with your test XI server.