Nagios XI Event queue stalling? (Mod Gearman)
Posted: Fri Feb 14, 2014 8:03 pm
Hi,
Apologies for the long post..Having an issue with all checks failing (marked as orphaned) after a period of time on our new Nagios deployment.
Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6.
We've gone through a process of migrating from a NSCA/NRDP deployment and have basically done a side by side deployment, copying all the relevant configs from our legacy Nagios instances.
Its been running for 2 months without incident while we migrated all the hosts/service checks fom the legacy deployment. We're ramped up the number of host & service checks over the past week to around 800 hosts & close to 7000 services.
Since thursday just gone (pretty much when we loaded it up) we've had 2 issues where the Event Queue bunches up its checks (basically says theres 7k scheduled immediately see pic).
Watching the logs on the workers, the only lines that appear are the following:
All the checks go critical with "service/host check orphaned, is the mod-gearman worker on queue 'service' running?"
io on the master server seems ok, as does disk space
Just seems like for some reason the workers are unable to retrieve the jobs form the master.
When the issue has occured, it takes a few times to restart Nagios (service stop nagios, gearmand, nagiosxi, ndo2db, mysqld, then start in the reverses order). Initially after the event queue will indicate that events are being passed, but after a couple of minutes, the events start bunching up, and failing with the orphaned message.
I'm not sure why it settles down (maybe because I restart everything 3 times..) but after restarting the services a number of times, it seems to settle down & gain some stability.
To Troubleshoot, have made the following changes:
- Upgraded to 2012r2.9
- followed the orphaned checks section form the support wiki,
- nagios.conf - use_retained_scheduling_info=0 (This seemed to distribute the checks more evenly)
- /etc/mod_gearman/mod_gearman_neb.conf - use_uniq_jobs=off
The issue seems to take at least a 24-72hrs to surface, so at the moment am waiting.
Here is a current pic of the Event Queue: Previously the graph was quite a sharp sawtooth type pattern
Appreciate any advice
Thanks
Lincoln
Apologies for the long post..Having an issue with all checks failing (marked as orphaned) after a period of time on our new Nagios deployment.
Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6.
We've gone through a process of migrating from a NSCA/NRDP deployment and have basically done a side by side deployment, copying all the relevant configs from our legacy Nagios instances.
Its been running for 2 months without incident while we migrated all the hosts/service checks fom the legacy deployment. We're ramped up the number of host & service checks over the past week to around 800 hosts & close to 7000 services.
Since thursday just gone (pretty much when we loaded it up) we've had 2 issues where the Event Queue bunches up its checks (basically says theres 7k scheduled immediately see pic).
Watching the logs on the workers, the only lines that appear are the following:
All the checks go critical with "service/host check orphaned, is the mod-gearman worker on queue 'service' running?"
io on the master server seems ok, as does disk space
Just seems like for some reason the workers are unable to retrieve the jobs form the master.
When the issue has occured, it takes a few times to restart Nagios (service stop nagios, gearmand, nagiosxi, ndo2db, mysqld, then start in the reverses order). Initially after the event queue will indicate that events are being passed, but after a couple of minutes, the events start bunching up, and failing with the orphaned message.
I'm not sure why it settles down (maybe because I restart everything 3 times..) but after restarting the services a number of times, it seems to settle down & gain some stability.
To Troubleshoot, have made the following changes:
- Upgraded to 2012r2.9
- followed the orphaned checks section form the support wiki,
- nagios.conf - use_retained_scheduling_info=0 (This seemed to distribute the checks more evenly)
- /etc/mod_gearman/mod_gearman_neb.conf - use_uniq_jobs=off
The issue seems to take at least a 24-72hrs to surface, so at the moment am waiting.
Here is a current pic of the Event Queue: Previously the graph was quite a sharp sawtooth type pattern
Appreciate any advice
Thanks
Lincoln