Multiple Instances / Doubling in the messages log

nseltzer · Post by **nseltzer** » Thu Jan 24, 2013 6:21 pm

I kill -9'd all running forks of Nagios, blew away retention.dat (I moved it to my home dir), and restarted the box.

$ sudo mv /usr/local/nagios/var/retention.dat .

I still appear to be having issues with forks locking on me.

Code: Select all

Nagios instances:
5
nagios    3072  4894  0 15:26 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    4894     1  5 15:10 ?        00:03:44 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    9896  4894  0 15:37 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   25328  4894  0 16:06 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   31157  4894  0 16:17 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Cliche alert!: Help me Nagios Support Team. You're my only hope.

scottwilkerson · Post by **scottwilkerson** » Fri Jan 25, 2013 9:03 am

I'm starting to wonder if this could be related to your mod_gearman setup.

if you run

Code: Select all

gearman_top

Can you see the checks being processed?

nseltzer · Post by **nseltzer** » Fri Jan 25, 2013 10:13 am

Yessir. I'm not discounting the possibility that something is breaking within mod_gearman, but the configs are almost (save for gearmand server settings) the same.

Code: Select all

2013-01-25 08:12:30  -  localhost:4730   -  v0.25

 Queue Name                     | Worker Available | Jobs Waiting | Jobs Running
---------------------------------------------------------------------------------
 check_results                  |               2  |           0  |           0
 eventhandler                   |               5  |           0  |           0
 host                           |             386  |           0  |           2
 service                        |             386  |           0  |         115
 worker_papmoncp00.cabelas.corp |               1  |           0  |           0
 worker_papmoncp01.cabelas.corp |               1  |           0  |           0
 worker_papmoncp02.cabelas.corp |               1  |           0  |           0
 worker_papmoncp03.cabelas.corp |               1  |           0  |           0
 worker_papmoncp04.cabelas.corp |               1  |           0  |           0
 worker_papmoncp05.cabelas.corp |               1  |           0  |           0
 worker_papmoncp06.cabelas.corp |               1  |           0  |           0
 worker_papmoncp07.cabelas.corp |               1  |           0  |           0
 worker_sidhqmonm0_eventhandler |               1  |           0  |           0

mguthrie · Post by **mguthrie** » Fri Jan 25, 2013 10:24 am

Ah, thank you swilkerson, I think I found a clue. Can you turn of distributing event handlers with your gearman config. If you're using XI's notification handler, it won't be able to connect to the locale database and submit any notifications.

broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf server=127.0.0.1:4730 keyfile=/usr/local/nagios/etc/gearman_key.txt eventhandler=yes services=yes hosts=yes

If that doesn't fix it, send us all of your mod gearman related configs.

Post by **gwakem** » Mon Feb 04, 2013 5:41 pm

We disabled the event handling portion of gearman entirely by removing it from the both the mod_gearman_neb.conf and nagios.cfg, restarted (per below), and within ten minutes noticed the same issues with processes hanging.

/usr/local/nagios/etc/nagios.cfg:

Code: Select all

broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf server=127.0.0.1:4730 keyfile=/usr/local/nagios/etc/gearman_key.txt eventhandler=no services=yes hosts=yes

/etc/mod_gearman/mod_gearman_neb.conf:

Code: Select all

# defines if the module should distribute execution of
# eventhandlers.
eventhandler=no

I have attached the configs from a child and the master gearmand server.

mguthrie · Post by **mguthrie** » Tue Feb 05, 2013 11:22 am

Could this issue be caused by certain check plugins timing out? Are you having multiple *parent* processes spawn, or just forks of the Nagios process. Nagios forks itself to run checks, so for longer running checks you'll see many child instances of it running.

Post by **gwakem** » Tue Feb 05, 2013 11:33 am

Aha! Yes indeed, we do have a lot of WMI plugins timing out due to multiple remote side rules. I was in the process of attempting to clear those up, and this gives me additional ammo to do so. Thanks, I will see if getting that cleared out helps and let you know.

scottwilkerson · Post by **scottwilkerson** » Tue Feb 05, 2013 12:05 pm

Hopefully this will get us down the right track. Let us know if this resolves the issue...

Post by **gwakem** » Wed Feb 06, 2013 10:19 am

This was a triumph. I'm making a note here: HUGE SUCCESS. It's hard to overstate my satisfaction.

That did it! Thanks for the help guys. We still have doubling in the /var/log/messages logfile, but that's not critical. I can open a separate post for that later. This can be closed. Thanks again!

Nagios Support Forum

Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log

Re: Multiple Instances / Doubling in the messages log