Intermittent alerts from Unix server

pbsindian · Post by **pbsindian** » Fri Oct 19, 2018 4:14 pm

Hi Team,
We have configured our Unix monitoring using SNMP. We are running in POC phase and we are seeing frequent alerts from different servers at different time frames. We haven't found any pattern w.r.t. to alerts.

For example we have received following alerts all at same time:

For host :
tg-pxoct is DOWN CRITICAL - Plugin timed out while executing system call
And for each service :
tg-pxoct : Disk Usage is CRITICAL ERROR: Description/Type table : No response from remote host 10.XX.XX.XX.

Host was not down during that time. We didn't find any issues with the host either. Though below are the intervals set, why did we receive alerts right from go and got clear alerts in few minutes. Why didn't it wait for completing the pooling cycle before sending us alerts? If it because, it got timed out instead of fail/success. How do we avoid these issues?

Check Interval : 5
Retry Interval : 1
Max check attempts : 19

We are using same SNMP community string to monitor Nagios and also other monitoring tools. Will that be an issue?

Thanks,
Bhargava

Post by **tgriep** » Mon Oct 22, 2018 1:30 pm

One thing to try is to increase the timeout for your SNMP check.
Some plugins when they poll a server, have to retrieve alot of data and if it does not get all of the data in time, it causes a timeout.
Most of the SNMP plugins have a 5 second timeout.
Try editing your command and increase the timeout to 59 seconds be adding the following to the command line.
-t 50

Another thing to look into. SNMP uses the UDP protocol and if there are ant network congestion's, that data could be dropped.
Make sure your network devices are set to not drop that data.

Let us know if this helps.

pbsindian · Post by **pbsindian** » Wed Oct 24, 2018 2:05 pm

Thank you. We have applied timeout on servers which were alerting. We will monitor for next couple of days.

We have been seeing different kinds of time out errors like below from various servers intermittently.

ERROR: General time-out (Alarm signal)
ERROR: Description/Type table : No response from remote host
No answer from host
service check timed out
no response from host

What is the best way to handle these time outs?

Should we add -t 59 across all the the 1000+ services we onboarded so far?

Thanks,
Bhargava

Post by **tgriep** » Wed Oct 24, 2018 4:05 pm

If the timeout alerts are generated from SNMP checks, then I would increase the timeout value for the command that you are using for the checks.
That way, you would only have to edit a few commands in the Core Config Manager instead of editing the service checks individually to fix the timeout issue.

Nagios Support Forum

Intermittent alerts from Unix server

Intermittent alerts from Unix server

Re: Intermittent alerts from Unix server

Re: Intermittent alerts from Unix server

Re: Intermittent alerts from Unix server