Nagios hungs up

whitest · Post by **whitest** » Tue Oct 06, 2015 3:00 am

Hello everyone! In some of my installations I use mod_gearman (ver. 0.25) on the Central server Nagios 4.0.8 only for receiving check results from remote nagios servers (all are 4.0.8). Check results are submitted through send_gearman from remote servers.

From time to time gearmand stops to receive checks from all remote servers and nagios stops working while I restart gearmand and nagios services on the central server. Is it issue of Nagios or Gearmand or my misconfiguration? Could anyone help me to make working it stable?

2305 hosts and 7421 services are monitored on the central Nagios. ~90% of them are received passively.

My configuration points:
1. Central server:
1.1 /usr/local/nagios/etc/nagios.cfg:

Code: Select all

broker_module=/usr/lib/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf

1.2 /etc/mod_gearman/mod_gearman_neb.conf

Code: Select all

debug=1
logfile=/var/log/mod_gearman/mod_gearman_neb.log
server=localhost:4730
#dupserver=<host>:<port>
eventhandler=no
services=no
hosts=no
#hostgroups=name1
#hostgroups=name2,name3
#servicegroups=name1,name2,name3
do_hostchecks=no
encryption=yes
key=my_pass
#keyfile=/path/to/secret.file
use_uniq_jobs=on
# NEB Module Config
localhostgroups=
localservicegroups=
#queue_custom_variable=WORKER
result_workers=1
perfdata=no
perfdata_mode=1
orphan_host_checks=yes
accept_clear_results=no

2. Any remote nagios-server (all are the same):
2.1 /usr/local/nagios/etc/nagios.cfg:

Code: Select all

ocsp_command=gmlonp-submit_service_send_gearman
ochp_command=gmlonp-submit_host_send_gearman

2.2 /usr/local/nagios/etc/objects/commands.cfg:

Code: Select all

define command{
        command_name    gmlonp-submit_host_send_gearman
        command_line    /usr/bin/send_gearman --server=10.93.1.51:4730 --encryption=yes --key=my_pass --host="$HOSTNAME$" --returncode=$HOSTSTATEID$ --message="$HOSTOUTPUT$|$HOSTPERFDATA$"
        }

define command{
        command_name    gmlonp-submit_service_send_gearman
        command_line    /usr/bin/send_gearman --server=10.93.1.51:4730 --encryption=yes --key=my_pass --host="$HOSTNAME$" --service="$SERVICEDESC$" --returncode=$SERVICESTATEID$ --message="$SERVICEOUTPUT$|$SERVICEPERFDATA$"
        }

Logs. I've enabled debug logs of gearmand on the central server. Here is output when gearmand service hangs up. /var/log/mod_gearman/mod_gearman_neb.log:

Code: Select all

[root@rl-nms-01 ~]# tail -f /var/log/mod_gearman/mod_gearman_neb.log
[2015-10-06 07:30:37][10993][DEBUG] service job completed: vtbonp-sql-1c Disk E Space: 0
[2015-10-06 07:30:37][10993][DEBUG] service job completed: AZS160004 Memory Usage: 0
[2015-10-06 07:30:37][10993][DEBUG] host job completed: mnsonp-kis-ib: 0
[2015-10-06 07:30:37][10993][DEBUG] service job completed: AZS800047 Disk C Usage: 0
[2015-10-06 07:30:37][10993][DEBUG] host job completed: AZS160034: 0
[2015-10-06 07:30:37][10993][DEBUG] host job completed: POS650045: 0
[2015-10-06 07:30:37][10993][DEBUG] service job completed: AZS800015 PING: 0
[2015-10-06 07:30:37][10993][DEBUG] host job completed: DSL370082: 1
[2015-10-06 07:30:37][10993][DEBUG] service job completed: AZS370056 CPU Load: 0
[2015-10-06 07:30:37][10993][DEBUG] service job completed: NET370449 Memory Usage: 0

As you can see above gearman just freezed.

Output of netstat -na | grep :4730 in attach.

Code: Select all

[root@rl-nms-01 ~]# /etc/init.d/nagios status
nagios (pid 25269) is running...
[root@rl-nms-01 ~]# /etc/init.d/gearmand status
gearmand (pid  10897) is running...

Then I make service nagios restart and service gearmand restart. After that everything start working.

jdalrymple · Post by **jdalrymple** » Tue Oct 06, 2015 4:37 pm

I've seen this in a customer installation where the problem was a particular worker not interacting properly with gearman. The fix was to uninstall and reinstall the gearman worker software.

This may or may not be the case for you, but it's pretty difficult to troubleshoot also. I recommend turning off workers one at a time to see if the environment stabilizes. In my customer's environment the system would fail pretty regularly, almost always within 12 hours. Are you in this same situation?

whitest · Post by **whitest** » Wed Oct 07, 2015 3:14 am

jdalrymple, thank you for your reply. Workers in /etc/mod_gearman/mod_gearman_neb.conf are disabled already. I don't use them.
I'll try to reinstal mod_gearman. I hope it will help.

So, I already asked (https://support.nagios.com/forum/viewto ... =7&t=33245) about unstable working of mod_gearman and absence of built-in solutions in Nagios for submitting passive checks. I still believe its lack of Nagios. The lack of buffering in mod_gearman while there is no connectivity between central and remote nagioses is upset also.

jdalrymple · Post by **jdalrymple** » Wed Oct 07, 2015 10:48 am

whitest wrote:jdalrymple, thank you for your reply. Workers in /etc/mod_gearman/mod_gearman_neb.conf are disabled already. I don't use them.

Why use mod_gearman then?

whitest wrote:So, I already asked (https://support.nagios.com/forum/viewto ... =7&t=33245) about unstable working of mod_gearman and absence of built-in solutions in Nagios for submitting passive checks. I still believe its lack of Nagios. The lack of buffering in mod_gearman while there is no connectivity between central and remote nagioses is upset also.

I fail to see how this could be Nagios' fault. Once Nagios issues the check, what happens to on the path is out of the control of Nagios. The proper flow in a mod_gearman setup is as follows:

nagios --> mod_gearman --> gearmand --> worker --> gearmand --> mod_gearman --> nagios

The check can (and does) get queued at either pass through gearmand. Nagios wouldn't be in the business of queuing anything, it has 2 jobs, issue, then interpret. What would you like to see us change?

whitest · Post by **whitest** » Wed Oct 07, 2015 11:43 am

jdalrymple wrote:
whitest wrote:jdalrymple, thank you for your reply. Workers in /etc/mod_gearman/mod_gearman_neb.conf are disabled already. I don't use them.
Why use mod_gearman then?

I use send_gearmand on remote nagios as transport for submitting check result to the central nagios.
On the central nagios I use mod_gearman only for receiving results and transmiting it to nagios core. It's just broker for received check results.
I did it as discribed here:
https://labs.consol.de/nagios/mod-gearm ... eplacement

jdalrymple wrote: I fail to see how this could be Nagios' fault. Once Nagios issues the check, what happens to on the path is out of the control of Nagios. The proper flow in a mod_gearman setup is as follows:

nagios --> mod_gearman --> gearmand --> worker --> gearmand --> mod_gearman --> nagios

So I guess in my situation the chain is: remote-nagios --> send_gearman --> NETWORK --> gearmand --> central nagios.
As I discribed above all workers exept check_results are disabled:

Code: Select all

[root@rl-nms-01 ~]# gearman_top
2015-10-07 19:22:12  -  localhost:4730   -  v0.25

 Queue Name    | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------
 check_results |               1  |           0  |           0
----------------------------------------------------------------

jdalrymple wrote:What would you like to see us change?

I want to see in nagios core stable and powerful solution for submitting and receiving check results. Buffer and bulk submitting need to be in the solution.
I tried lot plugins, but all are not ideal and not stable =((

jdalrymple · Post by **jdalrymple** » Wed Oct 07, 2015 11:51 am

whitest wrote:So I guess in my situation the chain is: remote-nagios --> send_gearman --> NETWORK --> gearmand --> central nagios.

I can see clearly at least one component missing from your chain, that being the mod_gearman module between gearmand and central nagios. That's all fairly irrelivent though as in this situation I would definitely contend that the issue is likely between send_gearman and gearmand - still out of our control.

If this is actually a distributed Nagios setup I'm not understanding why you're not sending directly to the nsca daemon? It's a bit confusing that you would add all those unnecessary non-Nagios components to the mix, then blame your problems on Nagios.

whitest wrote:I want to see in nagios core stable and powerful solution for submitting and receiving check results. Buffer and bulk submitting need to be in the solution.
I tried lot plugins, but all are not ideal and not stable =((

Going back to the other thread you linked, we have to say you're an isolated case. We have many successes where people extend the functionality and capability of Nagios with mod_gearman. The customer that I mentioned having issues did in fact isolate his problem to a single seemingly corrupt worker installation. After uninstalling and reinstalling his worker he is back to monitoring tens of thousands of services without fail.

Nagios Support Forum

Nagios hungs up

Nagios hungs up

Re: Nagios hungs up

Re: Nagios hungs up

Re: Nagios hungs up

Re: Nagios hungs up

Re: Nagios hungs up