Nagios XI 2014 Migrating Off Mod Gearman

mrochelle · Post by **mrochelle** » Wed Jun 18, 2014 4:11 pm

I recently completed the migration of 28 servers to Nagios 2014R1.1. After experiencing the performance improvements, I debated over whether to move off of Mod Gearman and finally decided to move in that direction after some minor display problems within the Thruk browser caused by the modifications required with Mod Gearman on the Nagios 2014R1.1 release. I actually believe the NagiosXI 2014 release application runs better without Mod Gearman once it is proper tuned. Since it took several iterations of tuning to stabilize some of my larger server configuration, I thought I would share my experience to others in an effort to save others some time.

Our main Nagios server configuration used Mod Gearman with 3 remote workers.
Nagios XI Version : 2014R1.1
nagprod01.cellnet.com 2.6.32-279.11.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
nagios (pid 7793) is running...
NPCD running (pid 23335).
ndo2db (pid 2568) is running...
CPU Load 15: 3.09
Total Hosts: 1907
Total Services: 9263
8 Core CPU with 16GB memory

To remove Mod Gearmand, the following modification were implemented:
Modification to /usr/local/nagios/etc/nagios.cfg
check_workers=6 – The default 4 workers would not keep up
commented out the 2 “embedded_perl” entries
max_host_check_spread=60 - Required to provide a little more time to ramp up from a cold start.
max_service_check_spread=60 - Required to provide a little more time to ramp up from a cold start.
use_retained_program_state=1
use_retained_scheduling_info=1
commented out broker_module= -for Mod Gearman

service restart nagios
service gearmand stop
service mod_gearman_worker stop
chkconfig gearmand off -Prevent start of process during startups
chkconfig mod_gearman_worker off -Prevent start of process during startups
__________________________________________

Modifications to /usr/local/nagios/etc/pnp/npcd.cfg
load_threshold = 80.0 (Using 10 times # of CPU cores)
Removing Mod Gearman increases the system load and NPCD will shut down if this threshold is exceeded.
__________________________________________

Modification to /usr/local/nagios/etc/pnp/process_perfdata.cfg
TIMEOUT = 30 -Prevent timeouts while collecting perfdata under increased system load
__________________________________________

Once all modifications are made, a system restart helps insure a good clean start and stable run after the modifications.
shutdown –r 0 -Restart the server with a clean Non Mod Gearman configuration
Current status of server after removal of Mod Gearman

Nagios2014.PNG

On smaller server configurations with 2 to 4 core CPUs and 2 to 4 GB memory, I needed to throttle down the default Nagios 2014 application to keep from overloading the server. Modify the above changes as follows for the smaller servers:
Modification to /usr/local/nagios/etc/nagios.cfg
check_workers=1 – The default 4 workers would overload the server
max_host_check_spread=120 - Required to provide a little more time to ramp up from a cold start.
max_service_check_spread=120 - Required to provide a little more time to ramp up from a cold start.

I also experimented with the nagios.cfg setting max_concurrent_checks=? But the modifications appeared to have no effect. Comments on this setting are welcome.

Post by **lmiltchev** » Wed Jun 18, 2014 4:40 pm

Thanks for the feedback! I am curious to see if other Nagios XI users (with large installs) will report similar results.

Post by **mikew** » Thu Jun 19, 2014 11:12 am

Just my experience with max_concurrent_checks:

Probably the issue in all of my testing that created more instability was messing with max_concurrent_checks. I found that the best solution was to leave it at "0" and let it do as much as it could.

On the flip side, the one thing that I have done in larger installs that created the most stability was exactly what you did, eliminate embedded Perl. There seems to be too many differences in Perl.

I have also been experimenting with check_workers and cranking them way up on larger systems. I have been cranking workers up to 30 and 40 on larger systems and watching them with no issues so far.

Thanks for the data on Mod_Gearman as many will be making this same decision.

slansing · Post by **slansing** » Thu Jun 19, 2014 12:46 pm

This is some pretty impressive information, have you tried bumping your workers out a bit as Mike mentioned? What effect did it have in your environment?

Post by **lmiltchev** » Thu Jun 19, 2014 12:59 pm

Thanks for the info, Mike! Just wanted to make a comment on eliminating embedded Perl... I don't think enabling/disabling embedded perl in the nagios.cfg would make any difference at all as we compile Core with embedded perl disabled...

Code: Select all

./configure --with-command-group="$nagioscmdgroup" --disable-embedded-perl

I cannot remember when we started disabling it by default, but it's been causing many issues in the past, so I guess this was a logical way to go...

Post by **mikew** » Thu Jun 19, 2014 1:19 pm

Yes I understand the Perl for Nagios 2014 ...but I guess I was making an overall comment for anyone using the older versions. I suggested that change to an organization with 300,000 checks and it had impressive stability changes. They were using Nagios 3.

mrochelle · Post by **mrochelle** » Thu Jun 19, 2014 2:26 pm

My experience with increasing the workers would be to add a small additional load on the Nagios system but could overload some of my monitored systems. I did run for a while with 12 workers with average load of 3.0 which is not a problem but could overload some of the systems Nagios is monitoring. I dropped the workers down to 5 which move the average load down to 1.5 and provides a manageable load to the down stream servers. This is probably unique to my environment where I have a high number of monitors on a small spread of servers.

slansing · Post by **slansing** » Fri Jun 20, 2014 9:05 am

Yeah, the number of workers being used can vary quite widely based on the volume of checks/rate of checks/hardware of the nagios server. How are things going now wither the core workers 3 days later? Still stable and checking/processing away?

mrochelle · Post by **mrochelle** » Fri Jun 20, 2014 10:12 am

Stable across all 28 servers.

Post by **lmiltchev** » Fri Jun 20, 2014 12:00 pm

@mrochelle
Sounds good.

I would encourage other users with large installs to provide us with feedback.

Nagios Support Forum

Nagios XI 2014 Migrating Off Mod Gearman

Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman

Re: Nagios XI 2014 Migrating Off Mod Gearman