Page 1 of 2

check_files.pl failing after XI update from 2012 to 2014

Posted: Mon Oct 27, 2014 8:12 am
by detronict
We use over 30 Nagios XI Servers to monitor various client environments.
Each of these XI servers uses rsync to push the Nagios backups to a dedicated directory on a remote server.


We use the check_files plugin to (daily) monitor the backup-files on this remote server.
We check for file-age and number of files with various thresholds.

This was working like a charm on Nagios XI 2012r2.9.

Lats week we decided to update to 2014r1.5 to start using the advantages of the new version.

We first did a yum update of the Linux software and after that an upgrade to the new Nagios version.
There where no install errors, and everything seemed OK.

The next day we found that the file checks were not functional anymore!!

The check still works when executed from the commandline, but fails when executed by the Nagios scheduler!
The error is :

OUTPUT: UNKNOWN ERROR - execution of ssh -o BatchMode=yes -o ConnectTimeout=30 192.168.254.103 "cd /data/braude-rhn01 && LANG=C ls -l" resulted in an error 65280 -

The command as generated when testing in XI CCM =

/usr/local/nagios/libexec/check_files.pl -L "XI Backup braude-rhn01" -H 192.168.254.103 -D /data/braude-rhn01 -F \*.tar.gz -w 4:40 -c 1:45 --age 82800,828801 -f

Output when executed from CLI :

XI Backup braude-rhn01 OK - Most recent timestamp is 16 hours 6 minutes old, 31 *.tar.gz files found | '*.tar.gz'=31;4:40;1:45 age_oldest=2653571s;82800;828801 age_newest=57971s

When run by XI scheduler the above mentioned error occurs.

Who can/will help us???

Regards

Rik Lijkendijk
detron ICT

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Mon Oct 27, 2014 2:49 pm
by lmiltchev
What is the version of the plugin that you are using?

Code: Select all

./check_files.pl -h
I found a post on the Nagios Exchange, which provides a "workaround":

http://exchange.nagios.org/directory/Pl ... ry/details

I commented out the lines 742-745 and this fixed the issue for me. I will ask our developers to take a look at this.

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Tue Oct 28, 2014 4:46 am
by detronict
I've tried commenting out lines 742-745.
This gets rid of the UNKNOWN error, but does not solve my problem.

Still successfull when executed from CLI but no joy when executed by Nagios.

CLI:
XI Backup braude-rhn01 OK - Most recent timestamp is 12 hours 38 minutes old, 31 *.tar.gz files found | '*.tar.gz'=31;4:40;1:45 age_oldest=2641091s;82800;828801 age_newest=45491s

Instead of execution error i get false negative.
Nagios:
XI Backup braude-rhn01 CRITICAL - *.tar.gz is 0 (less than 1) | '*.tar.gz'=0;4:40;1:45 age_oldest=0s;82800;828801 age_newest=0s

It looks like the directory listing is either not read through Nagios, or incorrectly parsed. Unfortunately my perl knowledge is severely limited!!

Does anyone know of another plugin that can be used to monitor file-age en number of files?
I still prefer this one, but if there is no solution i will have to think about alternative ways to monitor my rsync jobs.

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Tue Oct 28, 2014 10:21 am
by sreinhardt
What user are you when executing cli checks? I see we don't have the full prompt displayed, and one common issue with ssh checks is that the cert is added under the root user instead of the nagios user. Try the following:

Code: Select all

su nagios -s /bin/bash
/usr/local/nagios/libexec/check_files.pl -L "XI Backup braude-rhn01" -H 192.168.254.103 -D /data/braude-rhn01 -F \*.tar.gz -w 4:40 -c 1:45 --age 82800,828801 -f
exit

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Wed Oct 29, 2014 3:35 am
by detronict
The checks are performed from Nagios, so they should be executed as user nagios.

Execution from CLI as root:

[root@detvee-rhn00 ~]# /usr/local/nagios/libexec/check_files.pl -L "XI Backup braude-rhn01" -H 192.168.254.103 -D /data/braude-rhn01 -F \*.tar.gz -w 4:40 -c 1:45 --age 82800,828801 -f
XI Backup braude-rhn01 OK - Most recent timestamp is 10 hours 40 minutes old, 31 *.tar.gz files found | '*.tar.gz'=31;4:40;1:45 age_oldest=2634014s;82800;828801 age_newest=38414s
[root@detvee-rhn00 ~]#


Execution from CLI as user nagios:

[root@detvee-rhn00 ~]# su nagios -s /bin/bash
[nagios@detvee-rhn00 root]$ /usr/local/nagios/libexec/check_files.pl -L "XI Backup braude-rhn01" -H 192.168.254.103 -D /data/braude-rhn01 -F \*.tar.gz -w 4:40 -c 1:45 --age 82800,828801 -f
XI Backup braude-rhn01 OK - Most recent timestamp is 10 hours 41 minutes old, 31 *.tar.gz files found | '*.tar.gz'=31;4:40;1:45 age_oldest=2634079s;82800;828801 age_newest=38479s
[nagios@detvee-rhn00 root]$ exit
exit


I've also tried the ssh portion as user nagios: (Not all outputlines pasted in ths example)

[root@detvee-rhn00 ~]# su nagios -s /bin/bash
[nagios@detvee-rhn00 root]$ ssh -o BatchMode=yes -o ConnectTimeout=30 192.168.254.103 "cd /data/braude-rhn01 && LANG=C ls -l"
total 1980488
-rw-r--r-- 1 root root 62798523 Sep 28 22:00 1411934402.tar.gz
...
-rw-r--r-- 1 root root 67059687 Oct 28 22:00 1414530002.tar.gz
[nagios@detvee-rhn00 root]$ exit
exit


So it really looks like the problem is with nagios executing the call or parsing the output!!

I just noticed that nagios has added a \ in the command. Probably as an escape character for the asterisk.
Could this be the problem?

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Wed Oct 29, 2014 1:53 pm
by lmiltchev
I just noticed that nagios has added a \ in the command. Probably as an escape character for the asterisk.
Could this be the problem?
Yes, it does seem like an escaping issue. When you go to:

Home->Service Detail-><your service>->Configure->Re-configure this service

do you see the backslash in front of the asterisk in the command?

Can you post the service and the command definitions?

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Wed Oct 29, 2014 1:58 pm
by sreinhardt
The only time that extra slash should be added would be should be in the test command windows, which due to the special chars here, will not likely work. I was under the impression you were testing by letting nagios run the check and giving us results from that, not using the test command button. Two key differences is that test command runs as apache, and secondly it has far more escaping than core does its self, as its executing from php not from core. These two could definitely provide some issues.

With that, I see that in one of your later posts, post patching the script, you look like your giving direct nagios output, and it seems to be getting much further. Let's try giving it a -v flag in the check command. Then once the command has been changed, force an immediate check. You will need to look at the service detail page under advanced and find the long output. Because it will almost definitely be multi-line output, hopefully its formatted properly that it will show in long output. If you could send a screenshot of that please.

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Thu Oct 30, 2014 2:54 am
by detronict
@lmiltchev

These are the command en service defs.

define command {
command_name check_xi_backup_age
command_line $USER1$/check_files.pl -L "XI Backup $ARG1$" -H 192.168.254.103 -D /data/$ARG1$ -F $ARG2$ -w $ARG3$ -c $ARG4$ --age $ARG5$ -f
}


define service {
host_name detvee-monrsync-cluster
service_description XI Backups Brakel
use xiwizard_nrpe_service
check_command check_xi_backup_age!braude-rhn01!*.tar.gz!4:40!1:45!82800,828801!!!
max_check_attempts 5
check_interval 30
retry_interval 1
check_period 24x7
notification_interval 60
notification_period standbyuren
contacts Detron_NagiosAdmin_Mail,nagiosadmin
contact_groups admins
icon_image backup.png
_xiwizard linux-server
register 1
}

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Thu Oct 30, 2014 7:02 am
by detronict
@sreinhardt

I've added the verbose option and got too much output.
I looked in the script and remarked one output line out.

Now i can see that the ssh part is not the problem. The ssh call returns all the (remote)files that are present in the requested directory.
The strange thing is that all the lines seem to be processed but at the end a totaly different (local)file is evaluated!

This is the output:

check_files.pl plugin version 0.416
Alarm at 30
Command: ssh -o BatchMode=yes -o ConnectTimeout=30 192.168.254.103 "cd /data/braude-rhn01 && LANG=C ls -l
Executing ssh -o BatchMode=yes -o ConnectTimeout=30 192.168.254.103 "cd /data/braude-rhn01 && LANG=C ls -l" 2>&1
got line: total 586848
got line: -rw-r--r-- 1 root root 66339052 Oct 21 22:00 1413921602.tar.gz
got line: -rw-r--r-- 1 root root 66476412 Oct 22 22:00 1414008002.tar.gz
got line: -rw-r--r-- 1 root root 66679796 Oct 23 22:00 1414094402.tar.gz
got line: -rw-r--r-- 1 root root 66835642 Oct 24 22:00 1414180801.tar.gz
got line: -rw-r--r-- 1 root root 66753205 Oct 25 22:00 1414267202.tar.gz
got line: -rw-r--r-- 1 root root 66703785 Oct 26 22:00 1414357201.tar.gz
got line: -rw-r--r-- 1 root root 66862997 Oct 27 22:00 1414443602.tar.gz
got line: -rw-r--r-- 1 root root 67059687 Oct 28 22:00 1414530002.tar.gz
got line: -rw-r--r-- 1 root root 67202066 Oct 29 22:00 1414616402.tar.gz
Date 1414670060 Oldest_filetime: Newest_filetime:
Oldest file has age of seconds and newest seconds
Largest file has size of octet and smallest octet
XI Backup braude-rhn01 CRITICAL - wmic_1.3.13_static_64bit.tar.gz is 0 (less than 1)

Re: check_files.pl failing after XI update from 2012 to 2014

Posted: Thu Oct 30, 2014 2:40 pm
by lmiltchev

Code: Select all

command_line $USER1$/check_files.pl -L "XI Backup $ARG1$" -H 192.168.254.103 -D /data/$ARG1$ -F $ARG2$ -w $ARG3$ -c $ARG4$ --age $ARG5$ -f
You have $ARG1$ listed twice in the command definition. Can you try removing one and see if this is going to fix the issue? You can try:

Code: Select all

command_line $USER1$/check_files.pl -L "XI Backup" -H 192.168.254.103 -D /data/$ARG1$ -F $ARG2$ -w $ARG3$ -c $ARG4$ --age $ARG5$ -f