Page 2 of 3

Re: NRPE Socket timeout after 10 seconds

Posted: Fri Feb 17, 2017 1:22 pm
by rkennedy
I would verify all of your definitions in place, because it's pretty apparent that adding the -t at the object level isn't being respected. Adding the -t 60 will fix it up, but it'll need to be at the proper place.

Re: NRPE Socket timeout after 10 seconds

Posted: Fri Feb 17, 2017 7:52 pm
by kwhogster
I only have them in the commands file

Where else should I look

Re: NRPE Socket timeout after 10 seconds

Posted: Mon Feb 20, 2017 12:56 pm
by tgriep
You can add the -t timeout in the check_nrpe command definition in the commands.cfg file to all of the checks that use the check_nrpe command will have it's timeout increased.
But, if you are still having problems, please post how the check_nrpe command is defined in the commands.cfg file as well as the service check and then we can go from there.

Re: NRPE Socket timeout after 10 seconds

Posted: Mon Feb 20, 2017 9:01 pm
by kwhogster
I do have the -t on the commands

Code: Select all

define command{
        command_name    check_nrpe
        command_line    /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -t 60 -c $ARG1$ $ARG2$ $ARG3$ $ARG4$
}

define command{
        command_name    check_nrpe_test
        command_line    /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -t 60 -c $ARG1$ $ARG2$ $ARG3$ $ARG4$ > /tmp/yourlog.txt
}

define command{
        command_name    check_mem
        command_line    $USER1$/check_mem.sh
}

define command{
        command_name    check_users
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c $ARG1$ -a $ARG2$ $ARG3$
}
define command{
        command_name    check_windows_users
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_users -a 2 3 "$_HOSTALLOWEDUSERS$"
}
define command{
        command_name    check_ms_win_updates
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c check_ms_win_updates -a '-wd 15 -cd 30 -M PSWindowsUpdate'
}
define command{
        command_name    check_uptime
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c CheckUpTime -a MaxCrit=90d
}

define command{
        command_name    cpu_load
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c CheckCPU -a warn=80 crit=90 time=1m time=5m time=15m
}

define command{
        command_name    mem_check
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c CheckMEM -a warn=80 crit=90 time=1m time=5m time=15m
}

define command{
        command_name    service_check
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c CheckServiceState -a $ARG1$

}


Sample of my services

Code: Select all

define service {
        host_name                       TGCS008
        service_description             Check Disk Usage J:
        check_command                   check_nrpe!CheckDriveSize! -a ShowAll=long MinWarn=20% MinCrit=10% Drive=J: perf-unit=G
        check_interval                  1
        servicegroups                   DriveSpace
        use                             generic-service
}
define service {
        host_name                       TGCS009
        service_description             Check Disk Usage
        check_command                   check_nrpe!CheckDriveSize! -a ShowAll=long MinWarn=20% MinCrit=10% Drive=C: perf-unit=G
        check_interval                  1
        servicegroups                   DriveSpace
        use                             generic-service
}

define service {
        host_name                       TGCS001
        service_description             Check OS Version
        check_command                   check_nrpe!CheckWMI! -a "Query=Select Version,Caption from win32_OperatingSystem" columnSyntax="%value%" columnSeparator=", " ignore-perf-data
        servicegroups                   OSVersion
        check_interval                  1
        use                             generic-service
}
define service {
        host_name                       TGCS002
        service_description             Check OS Version
        check_command                   check_nrpe!Check_OS_Version! -a "perf-config=*(ignored:true)"
        servicegroups                   OSVersion
        check_interval                  1
        use                             generic-service
}

Sometimes I see 60 seconds but for the most I see 10 second timeouts

This does happen on all my VM machines not on the Physical Hosts I have running Servers and Computers are fine just the VMs

Running VMWare ESXI Hosts 6.0

Thanks

Tom

Re: NRPE Socket timeout after 10 seconds

Posted: Mon Feb 20, 2017 9:18 pm
by kwhogster
Guys update

I get this too a lot



Check Application Event Logs UNKNOWN 02-20-2017 21:13:52 0d 0h 2m 14s 1/3 CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages.


again only on my VMS

Thoughts

Re: NRPE Socket timeout after 10 seconds

Posted: Tue Feb 21, 2017 1:04 pm
by tgriep
The timeout settings for your commands look like they are set correctly to timeout at 60 seconds so I don't know where the 10 second setting is from unless one of the commands was missed.
You may want to check the nagios.cfg file and see if the default service timeout is set higher than 10 seconds.
This is the name of the object you should be looking at service_check_timeout

Take a look at the following document for NRPE Troubleshooting issues.
https://assets.nagios.com/downloads/nag ... utions.pdf

The Received 0 bytes from daemon message usually means that the service was not running on the remote server.
For Windows that means the NSClient agent was not running and for linux, the NRPE Agent was not running.
Take a look at the document for more details / causes.

Re: NRPE Socket timeout after 10 seconds

Posted: Tue Feb 21, 2017 9:13 pm
by kwhogster
Thanks for the doc will review later.


One other thought I had since this is only happening on my VM machines after further checking it seem to happen during the backup window.

I am using Veeam B&R to backup y VM's

I was thinking of checking to time of monitoring on these machine to exclude the backup window. Also not all VMs have this issue.


I have tried this on two services but they still alert me

Any suggestions would be helpful


Thanks

Tom

Re: NRPE Socket timeout after 10 seconds

Posted: Wed Feb 22, 2017 10:56 am
by tgriep
Setting up an exclude window would be a good solution for this.
To do this create a time period like the example below if you backup times runs between 1am and 3am, adjust it to your needs.

Code: Select all

define timeperiod {
        timeperiod_name                         backup_time
        alias                                   backup_time
        sunday                                  00:00-01:00,03:00-24:00
        monday                                  00:00-01:00,03:00-24:00
        tuesday                                 00:00-01:00,03:00-24:00
        wednesday                               00:00-01:00,03:00-24:00
        thursday                                00:00-01:00,03:00-24:00
        friday                                  00:00-01:00,03:00-24:00
        saturday                                00:00-01:00,03:00-24:00
        }
Then in your service check, you would define the following

Code: Select all

check_period			backup_time
notification_period		backup_time
That would exclude the check from running and also the notifications between the hours of 1am and 3am.
You could still leave the check_period to 24 x 7 so the service will still run but it will not send notifications during that time.
Either way should work for you.

Re: NRPE Socket timeout after 10 seconds

Posted: Wed Feb 22, 2017 8:51 pm
by kwhogster
Thanks

I will apply that to all my VMS and will give it a few days

Will report back with results


Tom

Re: NRPE Socket timeout after 10 seconds

Posted: Thu Feb 23, 2017 10:08 am
by tgriep
OK, let us know how it works out.