Page 1 of 1

service fails but succeed if retried manually

Posted: Tue Apr 04, 2017 6:55 am
by dann
Hi

i have a weird issue with nagios XI, like a false positive.
i wrote a small check with NRPE that verify the accessibility of a mountpoint.
i had to write a script because the mountpoints i'm checking are autofs managed, and not always mounted, and the nagios mountpoint wizard fails to monitor them correctly(even with the write test), it always fails.
only problem is, it works most of the time, but i always have around 50-70 checks that fails (and succeed if i retry manually or next time nagios checks them again)
this is creating A LOT of mails notification, making notifications useless.

what do i do wrong ? the nagios server is pretty strong (2 E5450 3GHz, 16GB ram, raid1 SAS enterprise disks) and i tried to change some of the values in nagios.cfg that i found on this forum, nothing seems to change, the event queue is like stuck at 200 (the small dashlet graph)
i also verified that my ntp server is accurate.

any help would be appreciated.

thanks

Re: service fails but succeed if retried manually

Posted: Tue Apr 04, 2017 1:00 pm
by mcapra
Can you send us (via PM or post attachment) a system profile? From the Nagios XI GUI, you can gather a profile via Admin -> System Profile -> Download Profile.

Can you also tell us which host/service is producing these issues?

Re: service fails but succeed if retried manually

Posted: Wed Apr 05, 2017 1:14 am
by dann
Hi

in attachment the profile.

all hosts are producing this, randomly, and it only concerns the autofs mountpoints that i'm checking with a script.

copying the script here, called check_mountpoints :

Code: Select all

#!/bin/bash

mount=${1}

fail=0

ls ${mount} >/dev/null &
childpid=$!
sleep 0.1
if [ -d "/proc/${childpid}" ]; then
	kill -9 $childpid > /dev/null 2>&1
	fail=1
fi


if [ $fail -eq 0 ]; then
	echo "OK - $mount accessible"
	exit 0
fi
if [ $fail -eq 1 ]; then
	echo "CRITICAL - $mount unreachable"
	exit 2
fi

Re: service fails but succeed if retried manually

Posted: Wed Apr 05, 2017 2:05 pm
by mcapra
Lets look at gridcluster31 as an example:

Code: Select all

[1491371254] SERVICE ALERT: gridcluster31;/homes/swlab;CRITICAL;SOFT;1;CRITICAL - /homes/swlab unreachable
[1491371320] SERVICE ALERT: gridcluster31;/mobileye/shared;CRITICAL;SOFT;1;CRITICAL - /mobileye/shared unreachable
[1491371388] SERVICE ALERT: gridcluster31;/mobileye/mbkrepository;CRITICAL;SOFT;2;CRITICAL - /mobileye/mbkrepository unreachable
[1491371610] SERVICE ALERT: gridcluster31;/homes/swlab;CRITICAL;SOFT;1;CRITICAL - /homes/swlab unreachable
In this case, Nagios XI is simply returning what the plugin produces. If the plugin is incorrectly reporting, there's not much that can be done from the Nagios XI end of things. You would need to alter the plugin.

Re: service fails but succeed if retried manually

Posted: Wed Apr 05, 2017 4:04 pm
by danns
yes but that's where it doesn't really make sense, when i run the plugin from the client it succeeds... and also in nagios if i manually force a recheck it succeeds... also if nagios retries by itself it usually succeeds to...

the plugin is quite simple, ls an autofs directory, and return 0 or 2...

Re: service fails but succeed if retried manually

Posted: Thu Apr 06, 2017 12:39 am
by dann
i solved my issue by changing my script like this :

Code: Select all

#!/bin/bash

mount=${1}

ls ${mount} > /dev/null 2>&1

if [ $? -eq 0 ]; then
	echo "OK - $mount accessible"
	exit 0
else
	echo "CRITICAL - $mount unreachable"
	exit 2
fi

i still think that there's an issue here with nagios but i worked around it.

also please note that nagios can't check autofs mountpoints, because it always find that the mountpoint is not mounted, and even the write test (from nrpe) doesn't help.

thanks, you can close this :)