Nagios and random snmp errors Description/Type table

sasivarenan · Post by **sasivarenan** » Wed Jul 22, 2015 2:28 pm

Team Greetings,

I'm running Nagios instance on top AWS VPC network, I've open all ports and within ACL and I'm able to see the traffic, but I'm not able to catch the issue and we are firefighting.

This issue is happening daily once random timings.

Please find the configuration details below:

### Process - Storage

define command {
command_name check_snmp_storage
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
}

define service{
use generic-service,graphed-service
host_name STDB02
service_description BACKUP12
check_command check_snmp_storage!"^/backup"!85!90!
}

Error Logs

[1431658849] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;U01 Mount;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658853] SERVICE ALERT: STDB02;DATA2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;DATA3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;CTLRD3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA4;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA6;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;Root Partition;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA5;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658909] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658911] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".

Can you guys please suggest - How to take it going forward.

I look forward to hear from you.

Many Thanks,
Sasi

Post by **lmiltchev** » Wed Jul 22, 2015 3:04 pm

I believe you are missing a space here:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$

Try changing the command to:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$

I am not sure what is "$USER8$"...

Try testing your check from the CLI:

Code: Select all

./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90

sasivarenan · Post by **sasivarenan** » Sat Jul 25, 2015 8:35 am

lmiltchev wrote:I believe you are missing a space here:
Code: Select all
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try changing the command to:
Code: Select all
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
I am not sure what is "$USER8$"...

Try testing your check from the CLI:
Code: Select all
./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90

Thanks for your reply,

Sorry while posting i didnt notice the space but in my original cfg there is space.

Main problem was we getting the error only on 3.00 UTC and it is recovering after 5 minutes and everything gets normal.

My logs also missing after archiving.

Post by **tgriep** » Mon Jul 27, 2015 1:15 pm

The only time you are getting the errors is only at 3am and once a day?

Are the backups being run on that server at that time and the check command is timing out?

Try adding a 60 second timeout to your command. Edit your command and add

Code: Select all

-t 60

to it and see if that resolves it for you.

Kriyeshh · Post by **Kriyeshh** » Thu Jul 30, 2015 5:00 pm

Yeah tgriep you are right.
I too suspect the same?? Are the backups being run on that server at that time and the check command is timing out?
Good Catch!

If tgriep's view is correct its should be because of the following reasons,
#swap/memory overload on that particular time
#process priority allocation
#number of open file limitations

I recommend you to try catching the system load on that fishy timing you stated, which will help you to debug more. (use PS,Top or any other commands you are familiar with or either go for cronjob [be aware of system storage size] )
Also capture the Swap usage at the time.

Hold on!!!
I will be worthy if you get the details from both the servers Nagios Server and the Server Monitored

-Wishes
Kriyeshh

tmcdonald · Post by **tmcdonald** » Fri Jul 31, 2015 1:48 pm

@sasivarenan, let us know how tgriep and Kriyeshh's comments work out for you!

sasivarenan · Post by **sasivarenan** » Sun Aug 02, 2015 9:14 am

tgriep wrote:The only time you are getting the errors is only at 3am and once a day?

Are the backups being run on that server at that time and the check command is timing out?

Try adding a 60 second timeout to your command. Edit your command and add
Code: Select all
-t 60
to it and see if that resolves it for you.

Yes, tgriep some small backups are running at the time and check command also getting timeout, but no sign of errors in that server. We tried increase the timeout to 60 sec but didn't help. We working in detail, I will keep you posted once done.

Sure @tmcdonald....

Thanks guys.

Post by **tgriep** » Mon Aug 03, 2015 1:57 pm

In your nagios.cfg file is the service timeout set to 60 like below?

Code: Select all

service_check_timeout=60

If not, you should change that setting and restart nagios to see if that helps.

sasivarenan · Post by **sasivarenan** » Mon Aug 03, 2015 3:13 pm

The below is the value for service check.

service_check_timeout=120

Post by **tgriep** » Mon Aug 03, 2015 4:07 pm

Try changing the timeout setting for the check_snmp_storage check to 120 and see if that helps.

Code: Select all

-t 120

Nagios Support Forum

Nagios and random snmp errors Description/Type table

Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table

Re: Nagios and random snmp errors Description/Type table