Nagios and random snmp errors Description/Type table
-
- Posts: 14
- Joined: Wed Mar 04, 2015 3:02 pm
Nagios and random snmp errors Description/Type table
Team Greetings,
I'm running Nagios instance on top AWS VPC network, I've open all ports and within ACL and I'm able to see the traffic, but I'm not able to catch the issue and we are firefighting.
This issue is happening daily once random timings.
Please find the configuration details below:
### Process - Storage
define command {
command_name check_snmp_storage
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
}
define service{
use generic-service,graphed-service
host_name STDB02
service_description BACKUP12
check_command check_snmp_storage!"^/backup"!85!90!
}
Error Logs
[1431658849] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;U01 Mount;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658853] SERVICE ALERT: STDB02;DATA2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;DATA3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;CTLRD3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA4;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA6;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;Root Partition;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA5;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658909] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658911] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
Can you guys please suggest - How to take it going forward.
I look forward to hear from you.
Many Thanks,
Sasi
I'm running Nagios instance on top AWS VPC network, I've open all ports and within ACL and I'm able to see the traffic, but I'm not able to catch the issue and we are firefighting.
This issue is happening daily once random timings.
Please find the configuration details below:
### Process - Storage
define command {
command_name check_snmp_storage
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
}
define service{
use generic-service,graphed-service
host_name STDB02
service_description BACKUP12
check_command check_snmp_storage!"^/backup"!85!90!
}
Error Logs
[1431658849] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD1;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;CTLRD2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658852] SERVICE ALERT: STDB02;U01 Mount;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658853] SERVICE ALERT: STDB02;DATA2;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;DATA3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658856] SERVICE ALERT: STDB02;CTLRD3;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA4;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA6;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;Root Partition;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658882] SERVICE ALERT: STDB02;DATA5;UNKNOWN;SOFT;1;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658909] SERVICE ALERT: STDB02;DATA1;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
[1431658911] SERVICE ALERT: STDB02;BACKUP;UNKNOWN;SOFT;2;ERROR: Description/Type table : No response from remote host "10.0.0.151".
Can you guys please suggest - How to take it going forward.
I look forward to hear from you.
Many Thanks,
Sasi
Re: Nagios and random snmp errors Description/Type table
I believe you are missing a space here:
Try changing the command to:
I am not sure what is "$USER8$"...
Try testing your check from the CLI:
Code: Select all
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Code: Select all
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try testing your check from the CLI:
Code: Select all
./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 14
- Joined: Wed Mar 04, 2015 3:02 pm
Re: Nagios and random snmp errors Description/Type table
Thanks for your reply,lmiltchev wrote:I believe you are missing a space here:Try changing the command to:Code: Select all
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin-2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
I am not sure what is "$USER8$"...Code: Select all
command_line $USER1$/check_snmp_storage.pl -H $HOSTADDRESS$ -C HelloIMin -2 $USER8$ -m $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ $ARG5$
Try testing your check from the CLI:
Code: Select all
./check_snmp_storage.pl -H 10.0.0.151 -C HelloIMin 2 -m "^/backup" -w 85 -c 90
Sorry while posting i didnt notice the space but in my original cfg there is space.
Main problem was we getting the error only on 3.00 UTC and it is recovering after 5 minutes and everything gets normal.
My logs also missing after archiving.
Re: Nagios and random snmp errors Description/Type table
The only time you are getting the errors is only at 3am and once a day?
Are the backups being run on that server at that time and the check command is timing out?
Try adding a 60 second timeout to your command. Edit your command and add
to it and see if that resolves it for you.
Are the backups being run on that server at that time and the check command is timing out?
Try adding a 60 second timeout to your command. Edit your command and add
Code: Select all
-t 60
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios and random snmp errors Description/Type table
Yeah tgriep you are right.
I too suspect the same?? Are the backups being run on that server at that time and the check command is timing out?
Good Catch!
If tgriep's view is correct its should be because of the following reasons,
#swap/memory overload on that particular time
#process priority allocation
#number of open file limitations
I recommend you to try catching the system load on that fishy timing you stated, which will help you to debug more. (use PS,Top or any other commands you are familiar with or either go for cronjob [be aware of system storage size] )
Also capture the Swap usage at the time.
Hold on!!!
I will be worthy if you get the details from both the servers Nagios Server and the Server Monitored
-Wishes
Kriyeshh
I too suspect the same?? Are the backups being run on that server at that time and the check command is timing out?
Good Catch!
If tgriep's view is correct its should be because of the following reasons,
#swap/memory overload on that particular time
#process priority allocation
#number of open file limitations
I recommend you to try catching the system load on that fishy timing you stated, which will help you to debug more. (use PS,Top or any other commands you are familiar with or either go for cronjob [be aware of system storage size] )
Also capture the Swap usage at the time.
Hold on!!!
I will be worthy if you get the details from both the servers Nagios Server and the Server Monitored
-Wishes
Kriyeshh
Cheers,
-Kriyeshh
-Kriyeshh
Re: Nagios and random snmp errors Description/Type table
@sasivarenan, let us know how tgriep and Kriyeshh's comments work out for you!
Former Nagios employee
-
- Posts: 14
- Joined: Wed Mar 04, 2015 3:02 pm
Re: Nagios and random snmp errors Description/Type table
Yes, tgriep some small backups are running at the time and check command also getting timeout, but no sign of errors in that server. We tried increase the timeout to 60 sec but didn't help. We working in detail, I will keep you posted once done.tgriep wrote:The only time you are getting the errors is only at 3am and once a day?
Are the backups being run on that server at that time and the check command is timing out?
Try adding a 60 second timeout to your command. Edit your command and addto it and see if that resolves it for you.Code: Select all
-t 60
Sure @tmcdonald....
Thanks guys.
Re: Nagios and random snmp errors Description/Type table
In your nagios.cfg file is the service timeout set to 60 like below?
If not, you should change that setting and restart nagios to see if that helps.
Code: Select all
service_check_timeout=60
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 14
- Joined: Wed Mar 04, 2015 3:02 pm
Re: Nagios and random snmp errors Description/Type table
The below is the value for service check.
service_check_timeout=120
service_check_timeout=120
Re: Nagios and random snmp errors Description/Type table
Try changing the timeout setting for the check_snmp_storage check to 120 and see if that helps.
Code: Select all
-t 120
Be sure to check out our Knowledgebase for helpful articles and solutions!