Page 1 of 1

Random error 127 out of bounds

Posted: Mon Nov 18, 2013 4:44 am
by darxmurf
Hi all,

I can't find the solution of this issue.
I have some errors like this in my logs:
[1383038181] SERVICE ALERT: IMU344;APACHE;CRITICAL;SOFT;1;(Return code of 127 is out of bounds - plugin may be missing)
[1383038181] SERVICE EVENT HANDLER: IMU344;APACHE;CRITICAL;SOFT;1;restart-service
[1383038295] SERVICE ALERT: IMU344;APACHE;OK;SOFT;2;HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.002 second response time
Everything works fine but sometimes I have this "127 is out of bound" error and then a OK.
As this error is usually linked to path or rights issues, I don't understand what's wrong.
All my permissions are ok and as the check works fine 90% of the time, there is something else :?:

If you have any info about that, thanks in advance !

Darx

Re: Random error 127 out of bounds

Posted: Mon Nov 18, 2013 11:04 am
by abrist
Can you post the check command and service config?

Re: Random error 127 out of bounds

Posted: Tue Nov 19, 2013 1:43 am
by darxmurf
here is the check
$USER1$/check_http -H $HOSTNAME$ -w 10 -t 15 -p 80 -s ALIVE -u /.test-nagios.php -e " 200 OK"
and the service description
define service {
service_description APACHE
use active
host_name IMU344
check_command check-http
}
Our installation is quite big and we are mostly using passive checks
# Active Host / Service Checks: 161 / 6377
# Passive Host / Service Checks: 1122 / 6404
The *bad* point is that we did setup it few years ago with MONARCH... which is, in the end, quite bad and generate huge cfg files not optimized at all.
I did update our setup few months ago from Nagios 2.12 to 3.5.1 but I kept Monarch in place.

Re: Random error 127 out of bounds

Posted: Tue Nov 19, 2013 11:09 am
by slansing
Puzzling, how long does it stay in a critical out of bounds state when this does happen? Can you correlate this to a specific time each day? or week?

Re: Random error 127 out of bounds

Posted: Wed Nov 20, 2013 3:11 am
by darxmurf
I think I found something.

All the services which are returning 127 error have been configured with active AND passive checks :geek: (the guy who did this installation wanted to have double check)
And they are hitting the same test script, for example, the APACHE service check is a simple echo php script.
Is here a problem for nagios to have both checks for the same service ?

Re: Random error 127 out of bounds

Posted: Wed Nov 20, 2013 10:15 am
by slansing
I guess I am a bit confused, you can't have a service be both active and passive. Do you mean two services, one is active, one is passive? This does not really make too much sense either.

Re: Random error 127 out of bounds

Posted: Thu Nov 21, 2013 8:38 am
by darxmurf
I approve...

So, in my conf (using Monarch) my APACHE check is in ACTIVE mode but it accepts also passive checks... which is quite useless I think.
I will kick one of them... but the question is: which one to keep ? Active or passive ?

EDIT :
I disabled the passive script on all the machines.

The active check is working fine but few minutes after in the log:
[Thu Nov 21 14:55:32 2013] Warning: Return code of 127 for check of service 'APACHE' on host 'DMU117' was out of bounds. Make sure the plugin you're trying to run actually exists.
[Thu Nov 21 14:55:32 2013] SERVICE ALERT: DMU117;APACHE;CRITICAL;SOFT;1;(Return code of 127 is out of bounds - plugin may be missing)
[Thu Nov 21 14:57:40 2013] SERVICE ALERT: DMU117;APACHE;OK;SOFT;2;HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.001 second response time
And the critical state change to "OK" after the next check

Re: Random error 127 out of bounds

Posted: Thu Nov 21, 2013 12:14 pm
by abrist
What timeout do you have set for the check?
Can you reproduce the intermittent behavior by running the check from the cli multiple times?

Re: Random error 127 out of bounds

Posted: Fri Nov 22, 2013 4:18 am
by darxmurf
Here is the conf I have
# SERVICE CHECK TIMEOUT
service_check_timeout=20

# HOST CHECK TIMEOUT
host_check_timeout=5

# EVENT HANDLER TIMEOUT
event_handler_timeout=50

# NOTIFICATION TIMEOUT
notification_timeout=30
I tried to launch an apache test every seconds for a while but I can't get any error... it's still running and I'll see.
...
HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.007 second response time |time=0.007248s;10.000000;;0.000000 size=151B;;;0
HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.002 second response time |time=0.001514s;10.000000;;0.000000 size=151B;;;0
HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.008 second response time |time=0.008346s;10.000000;;0.000000 size=151B;;;0
HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.001 second response time |time=0.001185s;10.000000;;0.000000 size=151B;;;0
HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.006 second response time |time=0.005511s;10.000000;;0.000000 size=151B;;;0
HTTP OK: Status line output matched " 200 OK" - 151 bytes in 0.001 second response time |time=0.001202s;10.000000;;0.000000 size=151B;;;0
...
The "Monarch" tool we are using did create the all the config files but in a bad way I think.
I have on bloc of text for each service for each machine... and as we have more than 1000 machines monitored...
wc -l /opt/nagios/etc/services.cfg
110296 /opt/nagios/etc/services.cfg
do you think this can bring some random issues or whatever ?

Re: Random error 127 out of bounds

Posted: Fri Nov 22, 2013 11:37 am
by abrist
darxmurf wrote:do you think this can bring some random issues or whatever ?
I don't thinks so as 1k hosts in a 100k line file is only 100 lines per host. That would be about 6-9 checks per host at 15 or so lines per service check. Maybe the active checks are timing out while the passive ones are working?