Page 1 of 2

Nagios and random snmp errors Description/Type table

Posted: Wed Jul 22, 2015 2:28 pm
by sasivarenan
Team Greetings,

I'm running Nagios instance on top AWS VPC network, I've open all ports and within ACL and I'm able to see the traffic, but I'm not able to catch the issue and we are firefighting.

This issue is happening daily once random timings.

Please find the configuration details below:

### Process - Storage

define command {
command_name check_snmp_storage
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
}

define service{
use generic-service,graphed-service
host_name STDB02
service_description BACKUP12
check_command check_snmp_storage!"^/backup"!85!90!
}

Error Logs

[1431658849] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;U01 Mount;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658853] SERVICE ALERT: STDB02;DATA2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;DATA3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;CTLRD3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA4;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA6;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;Root Partition;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA5;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658909] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658911] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".

Can you guys please suggest - How to take it going forward.

I look forward to hear from you.

Many Thanks,
Sasi

Re: Nagios and random snmp errors Description/Type table

Posted: Wed Jul 22, 2015 3:04 pm
by lmiltchev
I believe you are missing a space here:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try changing the command to:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
I am not sure what is "$USER8$"...

Try testing your check from the CLI:

Code: Select all

./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90

Re: Nagios and random snmp errors Description/Type table

Posted: Sat Jul 25, 2015 8:35 am
by sasivarenan
lmiltchev wrote:I believe you are missing a space here:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try changing the command to:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
I am not sure what is "$USER8$"...

Try testing your check from the CLI:

Code: Select all

./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90
Thanks for your reply,

Sorry while posting i didnt notice the space but in my original cfg there is space.

Main problem was we getting the error only on 3.00 UTC and it is recovering after 5 minutes and everything gets normal.

My logs also missing after archiving.

Re: Nagios and random snmp errors Description/Type table

Posted: Mon Jul 27, 2015 1:15 pm
by tgriep
The only time you are getting the errors is only at 3am and once a day?

Are the backups being run on that server at that time and the check command is timing out?

Try adding a 60 second timeout to your command. Edit your command and add

Code: Select all

-t 60
to it and see if that resolves it for you.

Re: Nagios and random snmp errors Description/Type table

Posted: Thu Jul 30, 2015 5:00 pm
by Kriyeshh
Yeah tgriep you are right.
I too suspect the same?? Are the backups being run on that server at that time and the check command is timing out?
Good Catch! :P

If tgriep's view is correct its should be because of the following reasons,
#swap/memory overload on that particular time
#process priority allocation
#number of open file limitations

I recommend you to try catching the system load on that fishy timing you stated, which will help you to debug more. (use PS,Top or any other commands you are familiar with or either go for cronjob [be aware of system storage size] )
Also capture the Swap usage at the time.

Hold on!!!
I will be worthy if you get the details from both the servers Nagios Server and the Server Monitored

-Wishes
Kriyeshh

Re: Nagios and random snmp errors Description/Type table

Posted: Fri Jul 31, 2015 1:48 pm
by tmcdonald
@sasivarenan, let us know how tgriep and Kriyeshh's comments work out for you!

Re: Nagios and random snmp errors Description/Type table

Posted: Sun Aug 02, 2015 9:14 am
by sasivarenan
tgriep wrote:The only time you are getting the errors is only at 3am and once a day?

Are the backups being run on that server at that time and the check command is timing out?

Try adding a 60 second timeout to your command. Edit your command and add

Code: Select all

-t 60
to it and see if that resolves it for you.
Yes, tgriep some small backups are running at the time and check command also getting timeout, but no sign of errors in that server. We tried increase the timeout to 60 sec but didn't help. We working in detail, I will keep you posted once done.

Sure @tmcdonald....

Thanks guys.

Re: Nagios and random snmp errors Description/Type table

Posted: Mon Aug 03, 2015 1:57 pm
by tgriep
In your nagios.cfg file is the service timeout set to 60 like below?

Code: Select all

service_check_timeout=60
If not, you should change that setting and restart nagios to see if that helps.

Re: Nagios and random snmp errors Description/Type table

Posted: Mon Aug 03, 2015 3:13 pm
by sasivarenan
The below is the value for service check.

service_check_timeout=120

Re: Nagios and random snmp errors Description/Type table

Posted: Mon Aug 03, 2015 4:07 pm
by tgriep
Try changing the timeout setting for the check_snmp_storage check to 120 and see if that helps.

Code: Select all

-t 120