service fails but succeed if retried manually

dann · Post by **dann** » Tue Apr 04, 2017 6:55 am

Hi

i have a weird issue with nagios XI, like a false positive.
i wrote a small check with NRPE that verify the accessibility of a mountpoint.
i had to write a script because the mountpoints i'm checking are autofs managed, and not always mounted, and the nagios mountpoint wizard fails to monitor them correctly(even with the write test), it always fails.
only problem is, it works most of the time, but i always have around 50-70 checks that fails (and succeed if i retry manually or next time nagios checks them again)
this is creating A LOT of mails notification, making notifications useless.

what do i do wrong ? the nagios server is pretty strong (2 E5450 3GHz, 16GB ram, raid1 SAS enterprise disks) and i tried to change some of the values in nagios.cfg that i found on this forum, nothing seems to change, the event queue is like stuck at 200 (the small dashlet graph)
i also verified that my ntp server is accurate.

any help would be appreciated.

thanks

Post by **mcapra** » Tue Apr 04, 2017 1:00 pm

Can you send us (via PM or post attachment) a system profile? From the Nagios XI GUI, you can gather a profile via Admin -> System Profile -> Download Profile.

Can you also tell us which host/service is producing these issues?

dann · Post by **dann** » Wed Apr 05, 2017 1:14 am

Hi

in attachment the profile.

all hosts are producing this, randomly, and it only concerns the autofs mountpoints that i'm checking with a script.

copying the script here, called check_mountpoints :

Code: Select all

#!/bin/bash

mount=${1}

fail=0

ls ${mount} >/dev/null &
childpid=$!
sleep 0.1
if [ -d "/proc/${childpid}" ]; then
	kill -9 $childpid > /dev/null 2>&1
	fail=1
fi


if [ $fail -eq 0 ]; then
	echo "OK - $mount accessible"
	exit 0
fi
if [ $fail -eq 1 ]; then
	echo "CRITICAL - $mount unreachable"
	exit 2
fi

Post by **mcapra** » Wed Apr 05, 2017 2:05 pm

Lets look at gridcluster31 as an example:

Code: Select all

[1491371254] SERVICE ALERT: gridcluster31;/homes/swlab;CRITICAL;SOFT;1;CRITICAL - /homes/swlab unreachable
[1491371320] SERVICE ALERT: gridcluster31;/mobileye/shared;CRITICAL;SOFT;1;CRITICAL - /mobileye/shared unreachable
[1491371388] SERVICE ALERT: gridcluster31;/mobileye/mbkrepository;CRITICAL;SOFT;2;CRITICAL - /mobileye/mbkrepository unreachable
[1491371610] SERVICE ALERT: gridcluster31;/homes/swlab;CRITICAL;SOFT;1;CRITICAL - /homes/swlab unreachable

In this case, Nagios XI is simply returning what the plugin produces. If the plugin is incorrectly reporting, there's not much that can be done from the Nagios XI end of things. You would need to alter the plugin.

danns · Post by **danns** » Wed Apr 05, 2017 4:04 pm

yes but that's where it doesn't really make sense, when i run the plugin from the client it succeeds... and also in nagios if i manually force a recheck it succeeds... also if nagios retries by itself it usually succeeds to...

the plugin is quite simple, ls an autofs directory, and return 0 or 2...

dann · Post by **dann** » Thu Apr 06, 2017 12:39 am

i solved my issue by changing my script like this :

Code: Select all

#!/bin/bash

mount=${1}

ls ${mount} > /dev/null 2>&1

if [ $? -eq 0 ]; then
	echo "OK - $mount accessible"
	exit 0
else
	echo "CRITICAL - $mount unreachable"
	exit 2
fi

i still think that there's an issue here with nagios but i worked around it.

also please note that nagios can't check autofs mountpoints, because it always find that the mountpoint is not mounted, and even the write test (from nrpe) doesn't help.

thanks, you can close this

Nagios Support Forum

service fails but succeed if retried manually

service fails but succeed if retried manually

Re: service fails but succeed if retried manually

Re: service fails but succeed if retried manually

Re: service fails but succeed if retried manually

Re: service fails but succeed if retried manually

Re: service fails but succeed if retried manually