Page 2 of 3

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 5:52 pm
by lyle
Tony, that's a great trick: turning on debug_logging on the server. I chose debug_level=16 (Host/service checks), and even then I had to quickly rename debug log files as they filled up and I waited for my checks to run.

Here's the output, grep'd for my remote host name:

Code: Select all

root@asb-con-ngs-001:/usr/local/nagios/var #> grep asb-sac-jac-001 nagios.deb*
nagios.debug:[1299104890.159400] [016.0] [pid=11518] Scheduling a forced, active check of service 'SSH/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 14:27:58 2011
nagios.debug:[1299104890.160424] [016.0] [pid=11518] Scheduling a forced, active check of service 'Load/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 14:27:58 2011
nagios.debug:[1299104890.160779] [016.0] [pid=11518] Scheduling a forced, active check of service 'JBoss/PVTL' on host 'asb-sac-jac-001' @ Wed Mar  2 14:27:58 2011
nagios.debug:[1299104890.219353] [016.0] [pid=11518] Attempting to run scheduled check of service 'SSH/LINUX' on host 'asb-sac-jac-001': check options=1, latency=12.219000
nagios.debug:[1299104890.219388] [016.0] [pid=11518] Checking service 'SSH/LINUX' on host 'asb-sac-jac-001'...
nagios.debug:[1299104890.341347] [016.0] [pid=11518] Attempting to run scheduled check of service 'Load/LINUX' on host 'asb-sac-jac-001': check options=1, latency=12.341000
nagios.debug:[1299104890.342928] [016.0] [pid=11518] Checking service 'Load/LINUX' on host 'asb-sac-jac-001'...
nagios.debug:[1299104890.466791] [016.0] [pid=11518] Attempting to run scheduled check of service 'JBoss/PVTL' on host 'asb-sac-jac-001': check options=1, latency=12.466000
nagios.debug:[1299104890.467505] [016.0] [pid=11518] Checking service 'JBoss/PVTL' on host 'asb-sac-jac-001'...
nagios.debug:[1299104895.717799] [016.1] [pid=11518] Handling check result for service 'SSH/LINUX' on host 'asb-sac-jac-001'...
nagios.debug:[1299104895.717856] [016.0] [pid=11518] ** Handling check result for service 'SSH/LINUX' on host 'asb-sac-jac-001'...
nagios.debug:[1299104895.717876] [016.1] [pid=11518] HOST: asb-sac-jac-001, SERVICE: SSH/LINUX, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: SSH OK - OpenSSH_4.3 (protocol 2.0)\n
nagios.debug:[1299104895.718062] [016.0] [pid=11518] Scheduling a non-forced, active check of service 'SSH/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 17:28:10 2011
nagios.debug:[1299104895.718903] [016.1] [pid=11518] Checking service 'SSH/LINUX' on host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104895.718970] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104895.722048] [016.1] [pid=11518] Handling check result for service 'Load/LINUX' on host 'asb-sac-jac-001'...
nagios.debug:[1299104895.722067] [016.0] [pid=11518] ** Handling check result for service 'Load/LINUX' on host 'asb-sac-jac-001'...
nagios.debug:[1299104895.722083] [016.1] [pid=11518] HOST: asb-sac-jac-001, SERVICE: Load/LINUX, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: Connection refused by host\n
nagios.debug:[1299104895.722227] [016.0] [pid=11518] ** On-demand check for host 'asb-sac-jac-001'...
nagios.debug:[1299104895.722244] [016.0] [pid=11518] ** Run sync check of host 'asb-sac-jac-001'...
nagios.debug:[1299104895.722298] [016.0] [pid=11518] ** Executing sync check of host 'asb-sac-jac-001'...
nagios.debug:[1299104896.843634] [016.1] [pid=11518] HOST: asb-sac-jac-001, ATTEMPT=1/10, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=0
nagios.debug:[1299104896.843690] [016.1] [pid=11518] Pre-handle_host_state() Host: asb-sac-jac-001, Attempt=1/10, Type=HARD, Final State=0
nagios.debug:[1299104896.843710] [016.1] [pid=11518] Post-handle_host_state() Host: asb-sac-jac-001, Attempt=1/10, Type=HARD, Final State=0
nagios.debug:[1299104896.843728] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104896.843825] [016.1] [pid=11518] Checking service 'Load/LINUX' on host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104896.843860] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104896.843975] [016.0] [pid=11518] Scheduling a non-forced, active check of service 'Load/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 15:28:10 2011
nagios.debug:[1299104896.845086] [016.1] [pid=11518] Handling check result for service 'JBoss/PVTL' on host 'asb-sac-jac-001'...
nagios.debug:[1299104896.845180] [016.0] [pid=11518] ** Handling check result for service 'JBoss/PVTL' on host 'asb-sac-jac-001'...
nagios.debug:[1299104896.845200] [016.1] [pid=11518] HOST: asb-sac-jac-001, SERVICE: JBoss/PVTL, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: Connection refused by host\n
nagios.debug:[1299104896.845313] [016.0] [pid=11518] ** On-demand check for host 'asb-sac-jac-001'...
nagios.debug:[1299104896.845330] [016.0] [pid=11518] ** Run sync check of host 'asb-sac-jac-001'...
nagios.debug:[1299104896.845416] [016.1] [pid=11518] Checking service 'JBoss/PVTL' on host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104896.845453] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug:[1299104896.845533] [016.0] [pid=11518] Scheduling a non-forced, active check of service 'JBoss/PVTL' on host 'asb-sac-jac-001' @ Wed Mar  2 14:33:10 2011
nagios.debug.1:[1299104844.425668] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug.1:[1299104844.644729] [016.1] [pid=11518] Checking service 'JBoss/PVTL' on host 'asb-sac-jac-001' for flapping...
nagios.debug.1:[1299104844.644905] [016.1] [pid=11518] Checking service 'Load/LINUX' on host 'asb-sac-jac-001' for flapping...
nagios.debug.1:[1299104844.645081] [016.1] [pid=11518] Checking service 'SSH/LINUX' on host 'asb-sac-jac-001' for flapping...
Looking at timestamps 1299104895.722083 and 1299104896.845200 it looks to me like the JBoss and Load checks (the ones using nrpe) ran when scheduled, and got the result I've been seeing: "Connection refused by host".
Maybe the return code of 2 is a problem.

Here's what running these 2 checks manually look like:

Code: Select all

nagios@h:w #> whoami
nagios
nagios@h:w #> /usr/local/nagios/libexec/check_nrpe -H asb-sac-jac-001 -c check_jboss_log -t 90
OK - 0 Pivotal errors found
nagios@h:w #> /usr/local/nagios/libexec/check_nrpe -H asb-sac-jac-001 -c check_load
OK - load average: 0.08, 0.12, 0.06|load1=0.080;15.000;30.000;0; load5=0.120;10.000;25.000;0; load15=0.060;5.000;20.000;0; 
Thanks for any ideas....Lyle

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 5:58 pm
by mguthrie
I'm noticing you have this as your service definition
check_command check_nrpe!check_jbosslog

But you're running this from the command-line
check_nrpe -H asb-sac-jac-001 -c check_jboss_log -t 90

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 6:07 pm
by lyle
That *was* a typo. I got my check definition name and script name mixed up. But that didn't fix things, and check_load is just the vanilla stuff from the install.

But I have to confess that I do have a nrpe version mismatch for this remote host.

check_nrpe on the server reports that it's 2.5.1, while the nrpe executable on the remote host reports that it's 2.12

Not sure if this is the problem. Thanks...Lyle

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 6:23 pm
by tonyyarusso
There was actually a reason I used a debug_level of -1 instead of 16 - you'll note that mine gave the actual command, as it would be run manually, not just a message saying it was running. That's what I'm after here.

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 6:43 pm
by lyle
To the untrained eye, it doesn't look a lot different with debug_level=-1, but here it is:

Code: Select all

root@asb-con-ngs-001:/usr/local/nagios/var #> grep asb-sac-jac-001 nagios.debug*
nagios.debug.old.2:[1299108786.075906] [128.1] [pid=11518] Command Arguments: asb-sac-jac-001;1299108771
nagios.debug.old.2:[1299108786.076043] [016.0] [pid=11518] Scheduling a forced, active check of service 'SSH/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 15:32:51 2011
nagios.debug.old.2:[1299108786.077111] [016.0] [pid=11518] Scheduling a forced, active check of service 'Load/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 15:32:51 2011
nagios.debug.old.2:[1299108786.077593] [016.0] [pid=11518] Scheduling a forced, active check of service 'JBoss/PVTL' on host 'asb-sac-jac-001' @ Wed Mar  2 15:32:51 2011
nagios.debug.old.2:[1299108798.048592] [008.0] [pid=11518] ** Service Check Event ==> Host: 'asb-sac-jac-001', Service: 'SSH/LINUX', Options: 1, Latency: 27.048000 sec
nagios.debug.old.2:[1299108798.048631] [016.0] [pid=11518] Attempting to run scheduled check of service 'SSH/LINUX' on host 'asb-sac-jac-001': check options=1, latency=27.048000
nagios.debug.old.2:[1299108798.048702] [016.0] [pid=11518] Checking service 'SSH/LINUX' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108798.164065] [008.0] [pid=11518] ** Service Check Event ==> Host: 'asb-sac-jac-001', Service: 'Load/LINUX', Options: 1, Latency: 27.164000 sec
nagios.debug.old.2:[1299108798.164112] [016.0] [pid=11518] Attempting to run scheduled check of service 'Load/LINUX' on host 'asb-sac-jac-001': check options=1, latency=27.164000
nagios.debug.old.2:[1299108798.165679] [016.0] [pid=11518] Checking service 'Load/LINUX' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108798.279435] [008.0] [pid=11518] ** Service Check Event ==> Host: 'asb-sac-jac-001', Service: 'JBoss/PVTL', Options: 1, Latency: 27.279000 sec
nagios.debug.old.2:[1299108798.279482] [016.0] [pid=11518] Attempting to run scheduled check of service 'JBoss/PVTL' on host 'asb-sac-jac-001': check options=1, latency=27.279000
nagios.debug.old.2:[1299108798.281271] [016.0] [pid=11518] Checking service 'JBoss/PVTL' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.187444] [016.1] [pid=11518] Handling check result for service 'SSH/LINUX' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.187488] [016.0] [pid=11518] ** Handling check result for service 'SSH/LINUX' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.187504] [016.1] [pid=11518] HOST: asb-sac-jac-001, SERVICE: SSH/LINUX, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: SSH OK - OpenSSH_4.3 (protocol 2.0)\n
nagios.debug.old.2:[1299108802.187809] [016.0] [pid=11518] Scheduling a non-forced, active check of service 'SSH/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 18:33:18 2011
nagios.debug.old.2:[1299108802.188577] [016.1] [pid=11518] Checking service 'SSH/LINUX' on host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108802.188628] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108802.188817] [016.1] [pid=11518] Handling check result for service 'Load/LINUX' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.188850] [016.0] [pid=11518] ** Handling check result for service 'Load/LINUX' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.188865] [016.1] [pid=11518] HOST: asb-sac-jac-001, SERVICE: Load/LINUX, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: Connection refused by host\n
nagios.debug.old.2:[1299108802.188959] [016.0] [pid=11518] ** On-demand check for host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.188987] [016.0] [pid=11518] ** Run sync check of host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108802.189129] [016.0] [pid=11518] ** Executing sync check of host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108803.311564] [016.1] [pid=11518] HOST: asb-sac-jac-001, ATTEMPT=1/10, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=0
nagios.debug.old.2:[1299108803.311613] [016.1] [pid=11518] Pre-handle_host_state() Host: asb-sac-jac-001, Attempt=1/10, Type=HARD, Final State=0
nagios.debug.old.2:[1299108803.311661] [016.1] [pid=11518] Post-handle_host_state() Host: asb-sac-jac-001, Attempt=1/10, Type=HARD, Final State=0
nagios.debug.old.2:[1299108803.311694] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108803.311827] [016.1] [pid=11518] Checking service 'Load/LINUX' on host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108803.311877] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108803.311949] [032.0] [pid=11518] ** Service Notification Attempt ** Host: 'asb-sac-jac-001', Service: 'Load/LINUX', Type: 0, Options: 0, Current State: 2, Last Notification: Wed Dec 31 16:00:00 1969
nagios.debug.old.2:[1299108803.312198] [016.0] [pid=11518] Scheduling a non-forced, active check of service 'Load/LINUX' on host 'asb-sac-jac-001' @ Wed Mar  2 16:33:18 2011
nagios.debug.old.2:[1299108803.313165] [016.1] [pid=11518] Handling check result for service 'JBoss/PVTL' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108803.313201] [016.0] [pid=11518] ** Handling check result for service 'JBoss/PVTL' on host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108803.313217] [016.1] [pid=11518] HOST: asb-sac-jac-001, SERVICE: JBoss/PVTL, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: Connection refused by host\n
nagios.debug.old.2:[1299108803.313339] [016.0] [pid=11518] ** On-demand check for host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108803.313367] [016.0] [pid=11518] ** Run sync check of host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108803.313505] [016.1] [pid=11518] Checking service 'JBoss/PVTL' on host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108803.313551] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108803.313607] [032.0] [pid=11518] ** Service Notification Attempt ** Host: 'asb-sac-jac-001', Service: 'JBoss/PVTL', Type: 0, Options: 0, Current State: 2, Last Notification: Wed Dec 31 16:00:00 1969
nagios.debug.old.2:[1299108803.313757] [016.0] [pid=11518] Scheduling a non-forced, active check of service 'JBoss/PVTL' on host 'asb-sac-jac-001' @ Wed Mar  2 15:38:18 2011
nagios.debug.old.2:[1299108804.127324] [008.0] [pid=11518] ** Host Check Event ==> Host: 'asb-sac-jac-001', Options: 0, Latency: 16.127000 sec
nagios.debug.old.2:[1299108804.127382] [016.0] [pid=11518] Attempting to run scheduled check of host 'asb-sac-jac-001': check options=0, latency=16.127000
nagios.debug.old.2:[1299108804.127419] [016.0] [pid=11518] ** Running async check of host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108804.128303] [016.0] [pid=11518] Checking host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108807.133749] [016.1] [pid=11518] Handling check result for host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108807.133781] [016.1] [pid=11518] ** Handling async check result for host 'asb-sac-jac-001'...
nagios.debug.old.2:[1299108807.133911] [016.1] [pid=11518] HOST: asb-sac-jac-001, ATTEMPT=1/10, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=0
nagios.debug.old.2:[1299108807.133958] [016.1] [pid=11518] Pre-handle_host_state() Host: asb-sac-jac-001, Attempt=1/10, Type=HARD, Final State=0
nagios.debug.old.2:[1299108807.134002] [016.1] [pid=11518] Post-handle_host_state() Host: asb-sac-jac-001, Attempt=1/10, Type=HARD, Final State=0
nagios.debug.old.2:[1299108807.134032] [016.1] [pid=11518] Checking host 'asb-sac-jac-001' for flapping...
nagios.debug.old.2:[1299108807.134136] [016.0] [pid=11518] Scheduling a non-forced, active check of host 'asb-sac-jac-001' @ Wed Mar  2 15:38:27 2011
nagios.debug.old.2:[1299108807.134449] [016.1] [pid=11518] ** Async check result for host 'asb-sac-jac-001' handled: new state=0
The results happen at timestamps 1299108802.188865 and 1299108803.313217

Thanks for the help on this one, Tony. I can tell you're checking back frequently.

But I thought I'd get the book thrown at me for the version mismatch. :roll:

....Lyle

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 6:55 pm
by tonyyarusso
Can you try actually grepping for libexec, not the host name? The particular line I'm looking for will have the IP address, not the name:

Code: Select all

tail -f /usr/local/nagios/var/nagios.debug | grep libexec

Re: nrpe "Connection refused by host"

Posted: Wed Mar 02, 2011 7:22 pm
by lyle
:shock:

I hesitated to send you all 500 lines of debug info containing "libexec", so I started looking for the IP address of the remote host. Couldn't find it, so I looked for "check_jboss_log" and saw an unexpected address. Turns out it was the address of our F5 load balancers that the JBoss servers sit behind. The JBoss servers use the F5's as their default gateway.

I'm guessing that when I issue the check manually, the nslookup happens and I reach the JBoss server. But the Core Server must cache the IP address of the F5's as the return address of the JBoss servers.

Not sure how to solve this, but it's a relief to know I'm not going nuts. :)

...Lyle

Re: nrpe "Connection refused by host"

Posted: Thu Mar 03, 2011 10:32 am
by tonyyarusso
Oh, interesting. So I take it then that your host definitions have "address" set as the DNS name, not the IP address? I suppose one approach would be to just define them with the IP address statically, although it'd be more interesting to figure out how to make the lookup work as expected.

Re: nrpe "Connection refused by host"

Posted: Thu Mar 03, 2011 1:02 pm
by lyle
Well it turns out the resolution wasn't really *that* interesting, Tony.

I had mistakenly entered the addresses of the F5's into the JBoss server host definitions. Maybe I was told "you get to the JBoss servers via the F5's". Host checks were happy, though checking the wrong remote host. But when I started using check_nrpe, nothing was on the F5 end to accept. Of course when I did a check manually, nslookup translated the address correctly.

Thanks for all the help on this one.....Lyle

Re: nrpe "Connection refused by host"

Posted: Thu Mar 03, 2011 1:34 pm
by tonyyarusso
Ah, well I'm glad that's sorted!