Nagios and random snmp errors Description/Type table

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
sasivarenan
Posts: 14
Joined: Wed Mar 04, 2015 3:02 pm

Nagios and random snmp errors Description/Type table

Post by sasivarenan »

Team Greetings,

I'm running Nagios instance on top AWS VPC network, I've open all ports and within ACL and I'm able to see the traffic, but I'm not able to catch the issue and we are firefighting.

This issue is happening daily once random timings.

Please find the configuration details below:

### Process - Storage

define command {
command_name check_snmp_storage
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
}

define service{
use generic-service,graphed-service
host_name STDB02
service_description BACKUP12
check_command check_snmp_storage!"^/backup"!85!90!
}

Error Logs

[1431658849] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;U01 Mount;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658853] SERVICE ALERT: STDB02;DATA2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;DATA3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;CTLRD3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA4;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA6;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;Root Partition;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA5;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658909] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658911] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".

Can you guys please suggest - How to take it going forward.

I look forward to hear from you.

Many Thanks,
Sasi
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios and random snmp errors Description/Type table

Post by lmiltchev »

I believe you are missing a space here:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try changing the command to:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
I am not sure what is "$USER8$"...

Try testing your check from the CLI:

Code: Select all

./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90
Be sure to check out our Knowledgebase for helpful articles and solutions!
sasivarenan
Posts: 14
Joined: Wed Mar 04, 2015 3:02 pm

Re: Nagios and random snmp errors Description/Type table

Post by sasivarenan »

lmiltchev wrote:I believe you are missing a space here:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try changing the command to:

Code: Select all

command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
I am not sure what is "$USER8$"...

Try testing your check from the CLI:

Code: Select all

./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90
Thanks for your reply,

Sorry while posting i didnt notice the space but in my original cfg there is space.

Main problem was we getting the error only on 3.00 UTC and it is recovering after 5 minutes and everything gets normal.

My logs also missing after archiving.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios and random snmp errors Description/Type table

Post by tgriep »

The only time you are getting the errors is only at 3am and once a day?

Are the backups being run on that server at that time and the check command is timing out?

Try adding a 60 second timeout to your command. Edit your command and add

Code: Select all

-t 60
to it and see if that resolves it for you.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

Re: Nagios and random snmp errors Description/Type table

Post by Kriyeshh »

Yeah tgriep you are right.
I too suspect the same?? Are the backups being run on that server at that time and the check command is timing out?
Good Catch! :P

If tgriep's view is correct its should be because of the following reasons,
#swap/memory overload on that particular time
#process priority allocation
#number of open file limitations

I recommend you to try catching the system load on that fishy timing you stated, which will help you to debug more. (use PS,Top or any other commands you are familiar with or either go for cronjob [be aware of system storage size] )
Also capture the Swap usage at the time.

Hold on!!!
I will be worthy if you get the details from both the servers Nagios Server and the Server Monitored

-Wishes
Kriyeshh
Cheers,
-Kriyeshh
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios and random snmp errors Description/Type table

Post by tmcdonald »

@sasivarenan, let us know how tgriep and Kriyeshh's comments work out for you!
Former Nagios employee
sasivarenan
Posts: 14
Joined: Wed Mar 04, 2015 3:02 pm

Re: Nagios and random snmp errors Description/Type table

Post by sasivarenan »

tgriep wrote:The only time you are getting the errors is only at 3am and once a day?

Are the backups being run on that server at that time and the check command is timing out?

Try adding a 60 second timeout to your command. Edit your command and add

Code: Select all

-t 60
to it and see if that resolves it for you.
Yes, tgriep some small backups are running at the time and check command also getting timeout, but no sign of errors in that server. We tried increase the timeout to 60 sec but didn't help. We working in detail, I will keep you posted once done.

Sure @tmcdonald....

Thanks guys.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios and random snmp errors Description/Type table

Post by tgriep »

In your nagios.cfg file is the service timeout set to 60 like below?

Code: Select all

service_check_timeout=60
If not, you should change that setting and restart nagios to see if that helps.
Be sure to check out our Knowledgebase for helpful articles and solutions!
sasivarenan
Posts: 14
Joined: Wed Mar 04, 2015 3:02 pm

Re: Nagios and random snmp errors Description/Type table

Post by sasivarenan »

The below is the value for service check.

service_check_timeout=120
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios and random snmp errors Description/Type table

Post by tgriep »

Try changing the timeout setting for the check_snmp_storage check to 120 and see if that helps.

Code: Select all

-t 120
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked