Timeout issue

cg28oh · Post by **cg28oh** » Fri Sep 12, 2014 8:20 pm

Anything >4 produces the Plugin timeout. I even catch glimpses of the timeout when -t 4.

./check_snmp -H 10.0.0.1 -C XXXX -o sysUpTime.0 -e 2 -t 30
CRITICAL - Plugin timed out while executing system call

abrist · Post by **abrist** » Mon Sep 15, 2014 3:45 pm

I just retested the timeout_state branch. Works fine for me, though improperly specified community strings will cause the error as will specifying an invalid snmp protocol version:

Code: Select all

[root@localhost nagios-plugins]# ./plugins/check_snmp -H <ip> -C <wrong community> -o ifInUnknownProtos.1 -e 2 -t 30
CRITICAL - Plugin timed out while executing system call
[root@localhost nagios-plugins]# ./plugins/check_snmp -H <ip> -C <proper community> -o ifInUnknownProtos.1 -e 2 -t 30  -P2c
CRITICAL - Plugin timed out while executing system call
[root@localhost nagios-plugins]# ./plugins/check_snmp -H <ip> -C <proper community> -o ifInUnknownProtos.1 -e 2 -t 30 -P1
SNMP OK - 0 | IF-MIB::ifInUnknownProtos.1=0c

Could you run your check again with the verbose flag (-vvv) and post the output?

Code: Select all

./check_snmp -H 10.0.0.1 -C XXXX -o sysUpTime.0 -e 2 -t 30 -vvv

cg28oh · Post by **cg28oh** » Mon Sep 15, 2014 10:53 pm

I've verified that the community and protocol version are correct. The same command to faster responding sites show no error. Here are the command, one with a 3 second timeout and one with 5 second. The plugin timeout message only appears on the -t 5.

Code: Select all

./check_snmp -H 10.0.0.1 -C XXXX -o sysUpTime.0 -e 3 -t 3 -vvv
/usr/bin/snmpget -Le -t 3 -r 3 -m ALL -v 1 [authpriv] 10.0.0.1:161 sysUpTime.0
External command error: Timeout: No Response from 10.0.0.1:161.

Code: Select all

 ./check_snmp -H 10.0.0.1 -C XXXX -o sysUpTime.0 -e 3 -t 5 -vvv
/usr/bin/snmpget -Le -t 5 -r 3 -m ALL -v 1 [authpriv] 10.0.0.1:161 sysUpTime.0
CRITICAL - Plugin timed out while executing system call

Maybe if you run the command to an IP that isn't alive with the -t 3, -t 10 or -t 30, maybe it will produce the same result I see? The end result I'm trying to achieve is the same no response message with -t 10 as with -t 3.

Command with a responding host

Code: Select all

./check_snmp -H 10.0.0.2 -C XX-o sysUpTime.0 -e 3 -t 10 -P1 -vvv
/usr/bin/snmpget -Le -t 10 -r 3 -m ALL -v 1 [authpriv] 10.0.0.2:161 sysUpTime.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (9949853) 1 day, 3:38:18.53
Processing oid 1 (line 1)
  oidname: DISMAN-EVENT-MIB::sysUpTimeInstance
  response: Timeticks: (9949853) 1 day, 3:38:18.53
SNMP OK - Timeticks: (9949853) 1 day, 3:38:18.53 |

abrist · Post by **abrist** » Tue Sep 16, 2014 5:01 pm

You are seeing two different timeout errors. One is the generic plugin timeout, and the other is the runcmd timeout. If your retries and timeout are really close, you may see this behavior. I will look into creating a bit more room for the external command to complete.

cg28oh · Post by **cg28oh** » Thu Nov 06, 2014 12:26 pm

Okay, I've still have been troubleshooting this (when time permits). I've went back to plugins version 1.4.16 and Nagios v3.5.1. This plugin version does *NOT* produce the system call timeout message with the high timeout values. Plugins Version 1.5 does. So looks likes something broke? between 1.4.16 and 1.5.

Code: Select all

[root@nagi01 plugins]# pwd
/root/nagios-plugins-1.4.16/plugins
[root@nagi01 plugins]# ./check_snmp -e 2 -t 10 xx.xx.xx.xx -C X----X-o sysUpTime.0
External command error: Timeout: No Response from xx.xx.xx.xx:161.

Code: Select all

[root@nagi01 plugins]# pwd
/root/nagios-plugins-1.5/plugins
[root@nagi01 plugins]# ./check_snmp -e 2 -t 10 xx.xx.xx.xx -C X----X -o sysUpTime.0
CRITICAL - Plugin timed out while executing system call

abrist · Post by **abrist** » Thu Nov 06, 2014 3:12 pm

Can you try adding an additional second to the alarm() in the plugin from the timeout_state branch?
Edit:

Code: Select all

plugins/check_snmp.c

Change line #344 from:

Code: Select all

alarm(timeout_interval + 1);

To:

Code: Select all

alarm(timeout_interval + 2);

And then recompile and test.

cg28oh · Post by **cg28oh** » Thu Nov 06, 2014 5:41 pm

In which version? 2.0.3?

Line #344 in 2.0.3 =

Code: Select all

alarm(0);

I can't seem to locate

Code: Select all

alarm(timeout_interval + 1);

in the file.

EDIT: Okay I looked back on the message board and figured it out.

cmerchant · Post by **cmerchant** » Thu Nov 06, 2014 5:53 pm

Thanks for the update. We'll leave this thread open for now.

phobbs · Post by **phobbs** » Fri Dec 12, 2014 9:10 pm

Were you able to find a solution to this yet?
I ran across this problem today when a network interruption caused ~250 hosts to become unavailable and around 1300 SNMP checks to go critical at the same time. Alert messages spammed the mail server, the mysql database filled up the partition and crashed Nagios, basically a huge mess that took me all day to clean up. I'd like to make sure this kind of thing won't become a common occurrence.

abrist · Post by **abrist** » Mon Dec 15, 2014 12:35 pm

phobbs wrote:I ran across this problem today when a network interruption caused ~250 hosts to become unavailable and around 1300 SNMP checks to go critical at the same time.

Could you let us know how this relates to a difference in status output text? It sound like you just had a nasty network outage. The issues here with check_snmp are relating to the text output when a plugin times out, but the state should stay the same . . . .

Nagios Support Forum

Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue

Re: Timeout issue