Hello,
We recently discovered that Linux SNMP checks return errors throughout the day. This just seems like a brief error as they clear almost immediately.
We are only running this type of check on a few servers, but we have noticed the errors on all of them. One server in particular is alerting around 15 plus times per day per check.
On this particular server, we are running 1 CPU check, 1 MEM check, and 10 DISK checks.
For the CPU check, we are getting UNKNOWN - No answer from host. This is the check command:
$USER1$/check_snmp_load_wizard.pl -H $HOSTADDRESS$ -C <community string> --v2c -w 95 -c 98 -f
For the MEM check, we are getting UNKNOWN - ERROR: netsnmp : No response from remote host "<hostname>". This is the check command:
$USER1$/check_snmp_mem.pl -H $HOSTADDRESS$ -C <community string> -2 -w 90,70 -c 95,75 -f
For the DISK checks, we are getting CRITICAL - ERROR: Description/Type table : No response from remote host "<hostname>". This is the check command:
$USER1$/check_snmp_storage_wizard.pl -H $HOSTADDRESS$ -C <community string> --v2c -m "^/var$" -w 95 -c 98 -f
The particular server is running Red Hat Enterprise Linux v7.9.0 STANDARD
We are in the process of setting up a new QA environment and add the same checks there, and we are getting similar results.
I then made a couple changes
1) Specifically to the check_snmp_load_wizard.pl, I enabled my $TIMEOUT = 30; (formerly using the default of 15;).
2) I added -t 60 to the checks
The results were less frequency of alerts, and all alerts now come in as UNKNOWN - ERROR: General time-out (Alarm signal)
I have also tried combining the disk checks into a single check, thinking the issues is the frequency of the snmp calls to the server. This has produced similar results, less frequency with the same General time-out error. We also lose performance graphs.
Any assistance in resolving this is greatly appreciated. The issue is not the alerts, but the noise within the UI and in the state history, making it difficult for our application owners to be aware of any legitimate issues.
Thanks in advance.
Linux SNMP Checks Returning Intermittent Errors
Re: Linux SNMP Checks Returning Intermittent Errors
Hi @shoreypu,
It sounds like there could be intermittent connection issues for these servers, if the error you're getting in your checks is that the host isn't answering. Rather than adjusting your check interval, if the issue is very intermittent and you're mostly looking to clear up noise, I'd recommend adjusting your error thresholds on the checks. "Max Check Attempts" is the number of times your server will re-attempt the check before notifying you that there is an issue.
I would also advise other diagnostics for these servers, like running a ping on a short timeout and observing how often you get dropped responses, as this could be indicative of network issues unrelated to your monitoring setup.
It sounds like there could be intermittent connection issues for these servers, if the error you're getting in your checks is that the host isn't answering. Rather than adjusting your check interval, if the issue is very intermittent and you're mostly looking to clear up noise, I'd recommend adjusting your error thresholds on the checks. "Max Check Attempts" is the number of times your server will re-attempt the check before notifying you that there is an issue.
I would also advise other diagnostics for these servers, like running a ping on a short timeout and observing how often you get dropped responses, as this could be indicative of network issues unrelated to your monitoring setup.
Re: Linux SNMP Checks Returning Intermittent Errors
Thanks. I've been trying to replicate this. No issues with ping and no issues running snmp checks or snmpwalks from command line. I'm wondering if I'm having an intermittent routing issue, since our system has multiple interfaces. The errors seem brief and random. Please keep any thoughts coming.
Re: Linux SNMP Checks Returning Intermittent Errors
Increase Timeout and Retries:
You've already increased the timeout to 30 seconds and added 60 to the checkssurvival race. However, some plugins may have a maximum timeout limit. For instance, certain Nagios plugins have a hardcoded timeout limit of 60 seconds. You can try increasing the timeout further if the plugin allows it:
Investigate any potential network issues, such as high latency or packet loss, that could be affecting SNMP communication.
Review Plugin Documentation:
Consult the documentation for the specific Nagios plugins you're using to ensure correct usage and configuration options.
You've already increased the timeout to 30 seconds and added 60 to the checkssurvival race. However, some plugins may have a maximum timeout limit. For instance, certain Nagios plugins have a hardcoded timeout limit of 60 seconds. You can try increasing the timeout further if the plugin allows it:
Investigate any potential network issues, such as high latency or packet loss, that could be affecting SNMP communication.
Review Plugin Documentation:
Consult the documentation for the specific Nagios plugins you're using to ensure correct usage and configuration options.