service fails but succeed if retried manually

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
dann
Posts: 3
Joined: Thu Mar 23, 2017 4:23 am

service fails but succeed if retried manually

Post by dann »

Hi

i have a weird issue with nagios XI, like a false positive.
i wrote a small check with NRPE that verify the accessibility of a mountpoint.
i had to write a script because the mountpoints i'm checking are autofs managed, and not always mounted, and the nagios mountpoint wizard fails to monitor them correctly(even with the write test), it always fails.
only problem is, it works most of the time, but i always have around 50-70 checks that fails (and succeed if i retry manually or next time nagios checks them again)
this is creating A LOT of mails notification, making notifications useless.

what do i do wrong ? the nagios server is pretty strong (2 E5450 3GHz, 16GB ram, raid1 SAS enterprise disks) and i tried to change some of the values in nagios.cfg that i found on this forum, nothing seems to change, the event queue is like stuck at 200 (the small dashlet graph)
i also verified that my ntp server is accurate.

any help would be appreciated.

thanks
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: service fails but succeed if retried manually

Post by mcapra »

Can you send us (via PM or post attachment) a system profile? From the Nagios XI GUI, you can gather a profile via Admin -> System Profile -> Download Profile.

Can you also tell us which host/service is producing these issues?
Former Nagios employee
https://www.mcapra.com/
dann
Posts: 3
Joined: Thu Mar 23, 2017 4:23 am

Re: service fails but succeed if retried manually

Post by dann »

Hi

in attachment the profile.

all hosts are producing this, randomly, and it only concerns the autofs mountpoints that i'm checking with a script.

copying the script here, called check_mountpoints :

Code: Select all

#!/bin/bash

mount=${1}

fail=0

ls ${mount} >/dev/null &
childpid=$!
sleep 0.1
if [ -d "/proc/${childpid}" ]; then
	kill -9 $childpid > /dev/null 2>&1
	fail=1
fi


if [ $fail -eq 0 ]; then
	echo "OK - $mount accessible"
	exit 0
fi
if [ $fail -eq 1 ]; then
	echo "CRITICAL - $mount unreachable"
	exit 2
fi
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: service fails but succeed if retried manually

Post by mcapra »

Lets look at gridcluster31 as an example:

Code: Select all

[1491371254] SERVICE ALERT: gridcluster31;/homes/swlab;CRITICAL;SOFT;1;CRITICAL - /homes/swlab unreachable
[1491371320] SERVICE ALERT: gridcluster31;/mobileye/shared;CRITICAL;SOFT;1;CRITICAL - /mobileye/shared unreachable
[1491371388] SERVICE ALERT: gridcluster31;/mobileye/mbkrepository;CRITICAL;SOFT;2;CRITICAL - /mobileye/mbkrepository unreachable
[1491371610] SERVICE ALERT: gridcluster31;/homes/swlab;CRITICAL;SOFT;1;CRITICAL - /homes/swlab unreachable
In this case, Nagios XI is simply returning what the plugin produces. If the plugin is incorrectly reporting, there's not much that can be done from the Nagios XI end of things. You would need to alter the plugin.
Former Nagios employee
https://www.mcapra.com/
danns
Posts: 1
Joined: Thu Mar 16, 2017 9:28 am

Re: service fails but succeed if retried manually

Post by danns »

yes but that's where it doesn't really make sense, when i run the plugin from the client it succeeds... and also in nagios if i manually force a recheck it succeeds... also if nagios retries by itself it usually succeeds to...

the plugin is quite simple, ls an autofs directory, and return 0 or 2...
dann
Posts: 3
Joined: Thu Mar 23, 2017 4:23 am

Re: service fails but succeed if retried manually

Post by dann »

i solved my issue by changing my script like this :

Code: Select all

#!/bin/bash

mount=${1}

ls ${mount} > /dev/null 2>&1

if [ $? -eq 0 ]; then
	echo "OK - $mount accessible"
	exit 0
else
	echo "CRITICAL - $mount unreachable"
	exit 2
fi

i still think that there's an issue here with nagios but i worked around it.

also please note that nagios can't check autofs mountpoints, because it always find that the mountpoint is not mounted, and even the write test (from nrpe) doesn't help.

thanks, you can close this :)
Locked