Page 1 of 1
Nagios XI not returning correct State Status
Posted: Thu Jun 11, 2020 3:26 am
by louissiong
Hi Nagios Support,
Lately, we added a new shell script to our environment to monitor for a specific pattern in a log file.
The script will only return 2 state types whenever there is "Exception' keyword found, namely, State_OK
and State_Critical. So it means if exception is in place, State_Critical alert will be displayed in Nagios Xi.
However, we are not able to achieve our requirement and instead it is showing another State alert as
WARNING - check_by_ssh: Remote command '/app/abt2/pg-server/check_auto.sh 0 ' returned status 1
Is there anything that we should correct in the script to make it work ?
Please advise and let us know. Thanks.
Regards,
Louis
Re: Nagios XI not returning correct State Status
Posted: Thu Jun 11, 2020 5:10 pm
by ssax
I see you are calling it with check_by_ssh, please send us the full command that you are using to call it.
Add the -v onto the check_by_ssh command and send us the full output.
Run the plugin on the remote machine manually as the nagios user (or whatever user it's supposed to run as), does it work?
Check the permissions on the log file as well.
Re: Nagios XI not returning correct State Status
Posted: Thu Jun 11, 2020 10:07 pm
by louissiong
Hi Ssax,
Sure. Tried doing it with the -v parameter and this is output below.
The permissions of the log file is fine as I am able to obtain the correct
state status with simplified if-else statements and not nested.
Warning
Command: /usr/bin/ssh
Argument 1: 192.168.184.62
Argument 2: /app/abt2/pg-server/check_auto.sh 0
WARNING - check_by_ssh: Remote command '/app/abt2/pg-server/check_auto.sh 0 ' returned status 1
There is an instruction to loop through a count file. Once these are commented,
and the state works fine. See below.
Please advise. Thanks.
#loop throuh the count file
while IFS= read -r line; do
#echo "$line"
count=$line
done < "$file"
#!/usr/bin/sh
# NAGIOS EXIT STATES
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
location=${PWD}
service=${PWD##*/}
hostname=$(hostname)
file=$location/errorCount.txt
count=0
USAGE="Usage <PATTERN_CHECK>"
NOTE="This scripts checks for Exception pattern in pg-server.log"
# Print usage
if [[ $# -ne 1 ]]; then
echo -e "Usage: $USAGE\n\n"
echo -e "Note: $NOTE\n\n"
exit $STATE_UNKNOWN
fi
#typeset dd=$(date +%d)
#typeset ym=$(date +%Y%m)
dd=`date '+%d'`
ym=`date '+%Y%m'`
filecheck1="/app/abt2/pg-server/logs/pg-server.log"
alertFile="$location""/$service""_"$hostname"_ALERT.txt"
#loop throuh the count file
while IFS= read -r line; do
#echo "$line"
count=$line
done < "$file"
#dt1=`perl -MPOSIX -le 'print strftime "%c", localtime(time()-60)'`
ecount=$1
#error1=`grep $dt1 $filecheck1 | grep -i "hsmException" | tail -n1 | wc -l | sed -e 's/ //g'`
#error1=`tail -1 $filecheck1 | grep -i "hsmException" | wc -l | sed -e 's/ //g'`
error1=`grep -i "Exception" $filecheck1 | wc -l | sed -e 's/ //g'`
#if [[ ! -e $filecheck1 ]]; then
# echo "Unable to find the tsp-server.log !" >> HSM_LOG.txt
# exit 0
#fi
if [[ ( "$error1" -gt "$ecount" ) || ("$error1" -lt "$ecount" ) ]]; then
#Overwrite the counter
echo "$error1" > $file
if [[ "$error1" -gt "$ecount" ]]; then
echo "Exception found on $filecheck1, Please check on $alertFile"
echo "${ym}-${dd}" "Exception found on $filecheck1" >> $alertFile " .Error Count:""${error1}" >> "$alertFile"
#echo "Exception found on" "${ym}0${dd}" >> ALERT.txt echo "${error1}" >> ALERT.txt
#echo "HSM Exception found !"
exit $STATE_CRITICAL
elif [[ "$error1" -le "$ecount" ]]; then
echo "No Exception found !"
exit $STATE_OK
fi
else
echo "No NEW Exception Found!"
exit $STATE_OK
# exit 0
fi
Re: Nagios XI not returning correct State Status
Posted: Fri Jun 12, 2020 3:24 pm
by ssax
You have to have this at the top of the script:
Re: Nagios XI not returning correct State Status
Posted: Tue Jun 16, 2020 11:26 am
by louissiong
I have already added the line #!/usr/bin/sh at the top of the script but it still doesn't work.
BTW, we did some investigations on the codes and came out with some probably issues.
Will the following work with Nagios XI ? Thanks.
#loop throuh the count file
while IFS= read -r line; do
#echo "$line"
count=$line
done < "$file"
Regards,
Louis
Re: Nagios XI not returning correct State Status
Posted: Wed Jun 17, 2020 9:36 am
by ssax
This code works on my system (no changes):
Code: Select all
#!/usr/bin/sh
# NAGIOS EXIT STATES
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
location=${PWD}
service=${PWD##*/}
hostname=$(hostname)
file=$location/errorCount.txt
count=0
USAGE="Usage <PATTERN_CHECK>"
NOTE="This scripts checks for Exception pattern in pg-server.log"
# Print usage
if [[ $# -ne 1 ]]; then
echo -e "Usage: $USAGE\n\n"
echo -e "Note: $NOTE\n\n"
exit $STATE_UNKNOWN
fi
#typeset dd=$(date +%d)
#typeset ym=$(date +%Y%m)
dd=`date '+%d'`
ym=`date '+%Y%m'`
filecheck1="/app/abt2/pg-server/logs/pg-server.log"
alertFile="$location""/$service""_"$hostname"_ALERT.txt"
#loop throuh the count file
while IFS= read -r line; do
#echo "$line"
count=$line
done < "$file"
#dt1=`perl -MPOSIX -le 'print strftime "%c", localtime(time()-60)'`
ecount=$1
#error1=`grep $dt1 $filecheck1 | grep -i "hsmException" | tail -n1 | wc -l | sed -e 's/ //g'`
#error1=`tail -1 $filecheck1 | grep -i "hsmException" | wc -l | sed -e 's/ //g'`
error1=`grep -i "Exception" $filecheck1 | wc -l | sed -e 's/ //g'`
#if [[ ! -e $filecheck1 ]]; then
# echo "Unable to find the tsp-server.log !" >> HSM_LOG.txt
# exit 0
#fi
if [[ ( "$error1" -gt "$ecount" ) || ("$error1" -lt "$ecount" ) ]]; then
#Overwrite the counter
echo "$error1" > $file
if [[ "$error1" -gt "$ecount" ]]; then
echo "Exception found on $filecheck1, Please check on $alertFile"
echo "${ym}-${dd}" "Exception found on $filecheck1" >> $alertFile " .Error Count:""${error1}" >> "$alertFile"
#echo "Exception found on" "${ym}0${dd}" >> ALERT.txt echo "${error1}" >> ALERT.txt
#echo "HSM Exception found !"
exit $STATE_CRITICAL
elif [[ "$error1" -le "$ecount" ]]; then
echo "No Exception found !"
exit $STATE_OK
fi
else
echo "No NEW Exception Found!"
exit $STATE_OK
# exit 0
fi
Try adding a -v and a -t 120 to the command to see if it works for you:
Code: Select all
/usr/local/nagios/libexec/check_by_ssh -H X.X.X.X -C '/app/abt2/pg-server/check_auto.sh 0' -l root -v -t 120
Send us the full output.