Page 1 of 3
return code 127 on test that seems invalid
Posted: Mon Nov 12, 2012 2:16 pm
by ucemike
I have recently tried to implement a port test using check_jabber which links to check_tcp. The config entry for the test is:
Code: Select all
# 'check_jabber' command definition
define command{
command_name check-jabber
command_line /usr/lib64/nagios/plugins/check_jabber -H $ARG1$ -w $ARG2$ -c $ARG3$ -t $ARG4$
}
The service entry is "check-jabber!jabber.host.name.here.net!10!20!60"
I forced max debug on logs and the command line final run command was:
/usr/lib64/nagios/plugins/check_jabber -H jabber.host.name.here.net -w 10 -c 20 -t 60
When I run that command manually (as nagios user) it works fine with the output:
JABBER OK - 0.098 second response time on port 5222|time=0.097569s;10.000000;20.000000;0.000000;60.000000
I also use check_imap which links to check_tcp (as does check_jabber) and it works fine on our mail hosts. Permissions of both check_imap and check_jabber is exactly the same. I only mention it since they both use check_tcp.
We are running on RHEL 6.2, using Nagios 3.4.1, nagios-plugins 1.4.16.
Anyone have a possible clue what this could be?
Re: return code 127 on test that seems invalid
Posted: Thu Nov 15, 2012 5:05 pm
by agriffin
That's odd, is there anything in your Nagios logs about it (found in /usr/local/nagios/var or /var/log)?
Re: return code 127 on test that seems invalid
Posted: Thu Nov 15, 2012 5:11 pm
by ucemike
agriffin wrote:That's odd, is there anything in your Nagios logs about it (found in /usr/local/nagios/var or /var/log)?
Other than the error 127 no. I was hoping the expanded debug output would give me more details but unfortunately all it did was make this all the more confusing because everything looked ok (as far as the final command line).
Re: return code 127 on test that seems invalid
Posted: Thu Nov 15, 2012 5:58 pm
by agriffin
Are you running anything like SELinux that might be preventing the plugin from running in certain contexts?
Re: return code 127 on test that seems invalid
Posted: Thu Nov 15, 2012 9:50 pm
by ucemike
agriffin wrote:Are you running anything like SELinux that might be preventing the plugin from running in certain contexts?
No. I also verified we were not hitting some thread/open processes cap as well.
Re: return code 127 on test that seems invalid
Posted: Fri Nov 30, 2012 11:01 am
by ucemike
I did some further testing and removed about 1/2 of the tests we run. When I did that this error stopped. So I enabled them again (we have 4500 service tests) and the problem came back. I then cut half the hosts out (we have 200 hosts) and the same thing happened. The test started working again. I re-enabled all hosts and it broke again.
I monitored the open files with all hosts/tests enabled and saw at most 550ish (total, not just nagios user). I also added a limits.conf config for nagios user increasing those values in the 10k range. It did not help.
This seems to be a problem with a "cap" of some sort but I cannot narrow down where...I am hoping someone else might have some tips.
edit: we had also updated to 3.4.3RC1 to resolve a downtime schedule bug.
Re: return code 127 on test that seems invalid
Posted: Tue Dec 04, 2012 2:44 pm
by ucemike
Is there anyway to get the exact output the OS gave nagios when the command was executed? The debug logs taken to max do not detail it (just give full command line run which runs fine when manually executed as user nagios).
Does nagios store this information anywhere? I want to see exactly what the command and/or OS is telling nagios for it to give an error 127.
Re: return code 127 on test that seems invalid
Posted: Tue Dec 04, 2012 2:53 pm
by slansing
Can you post exactly what the command is you run manually from the Nagios terminal to get the result in your first post?
Re: return code 127 on test that seems invalid
Posted: Tue Dec 04, 2012 3:17 pm
by ucemike
slansing wrote:Can you post exactly what the command is you run manually from the Nagios terminal to get the result in your first post?
I am not sure what you are asking for. The command listed in the first post is the command I run manually. It was copy/pasted from the nagios.debug output to verify that the command line/values I used were not causing the problem.
I tested the command as user nagios.
I forced max debug on logs and the command line final run command was:
/usr/lib64/nagios/plugins/check_jabber -H jabber.host.name.here.net -w 10 -c 20 -t 60
When I run that command manually (as nagios user) it works fine with the output:
JABBER OK - 0.098 second response time on port 5222|time=0.097569s;10.000000;20.000000;0.000000;60.0000
Re: return code 127 on test that seems invalid
Posted: Tue Dec 04, 2012 4:37 pm
by ucemike
Did some additional testing and when I have more than 2770-2780 service tests the 127 errors start up. When I reduce the number to below that it starts working.