Page 1 of 2

SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 10:57 am
by gormank
This isn't really a Nagios question...
I've recently been looking into re-enabling SNMP trap monitors, after disabling them and going to active monitoring ~6 months ago. I just disabled the trap listening service in Nagios, not the underlying snmptapd or snmptt services. I went away from traps due to the unreliability I noticed.
Of 4 systems 2 snmptrapd services weren't processing traps (no traps logged in /var/log/messages), and one had a stalled snmptt service (service snmptt restart showed failed on the stop, which I've found to indicate snmptt is no longer working). This is on boxes that have been up for 6 days.

It seems I need to restart these services daily. Are others seeing the same behavior or am I special (again)? :)

Re: SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 5:56 pm
by ssax
I would say if the traps are not in /var/log/messages (and the others are) then they are either not getting to the XI server, snmptrapd doesn't like something with the trap, or there is a misconfiguration/issue with snmptrapd.

Have you tried running a tcpdump to see if they are getting there?
- Change X.X.X.X to the sender's IP that is having issues.

Code: Select all

tcpdump -nni eth0 src X.X.X.X
Please zip up and attach your /etc/snmp directory and your /usr/local/bin/snmptraphandling.py script so that I can validate the configurations and versions you have.

Thank you

Re: SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 5:57 pm
by Box293
Generally no we don't see this issue on customers systems.

I have seen major delays with snmptt and snmptrapd where the server uses an external dns server like 8.8.8.8 which results in major delays and causes lots of traps to spool.

Here is a starting point to troubleshoot trap problems:
https://support.nagios.com/kb/category.php?id=55

This tutorial goes into detail as to how the traps actually work from when they enter the server through to the Nagios services.
https://support.nagios.com/kb/article.php?id=77

It's long, but it's worth setting up the scenario as a way of playing with SNMP as I find that's the best way to learn.

Re: SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 6:34 pm
by gormank
A restart of snmptrapd made it start logging into the syslog, as did restarting snmptt on the server where it was stalled, so the issue is with the services locking up as I said.

Re: SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 6:44 pm
by Box293
One one the most frustrating points with snmptrapd is that the logging is nearly non-existent, which makes it really hard to troubleshoot.

Are the traps you're sending SNMP version 1, 2 or 3 ?

Re: SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 7:10 pm
by gormank
Traps are v1 and v2, no v3.

I normally get a lot of traps logged in syslog so when I grep snmp /var/log/messages| wc -l and get 0, I know something's wrong...
For some reason I get traps telling me uptime for hosts all the time. Most of the traps are from ILOs, with a few blade chassis, DLn80s, fiber switches and 3par. I get traps rom hardware, not OSs...

I see one issue, which is that from an earlier post that was similar, my failover servers didn't have the updated nagios init script that restarts snmptt when nagios is started. The failover servers weren't used for a while when we tried to use VMware FT, but it required using a single core, which put CPU usage through the roof.

The change to the init script was done because when nagios is restarted, possibly when snmptt is reconfigured via the trap translation under admin, snmptt stops working. This is funny since I'm pretty sure the only way you know it stopped working (other than no traps translated) is restarting the service fails (it says FAILED) on the stop part. This may be the wrong way to fix that issue, but it works. So we can remove snmptt from the conversation if there's no better solution.
Still, 2 of 4 nagios hosts had a stalled snmptrapd after 6 days of uptime.

Re: SNMP trap system unreliable?

Posted: Thu Mar 03, 2016 10:42 pm
by Box293
What output do you get from:

Code: Select all

md5sum /usr/local/bin/snmptraphandling.py
I get:

Code: Select all

0639919b86c9e659ed04b1c63052bbbc  /usr/local/bin/snmptraphandling.py
There was a bug in this a while ago that was related to /usr/local/nagios/var/rw/nagios.cmd being created by this script when nagios was stopped and caused issues. The md5sum above is the most up to date script that fixed this problem.

Re: SNMP trap system unreliable?

Posted: Fri Mar 04, 2016 3:16 pm
by gormank
Are you suggesting that snmptraphandling.py can lock up snmptt or snmptrapd? It hardly seems likely...

# md5sum /usr/local/bin/snmptraphandling.py
0639919b86c9e659ed04b1c63052bbbc /usr/local/bin/snmptraphandling.py
# ll /usr/local/bin/snmptraphandling.py
-rwxr-xr-x 1 root root 2448 May 5 2015 /usr/local/bin/snmptraphandling.py

Re: SNMP trap system unreliable?

Posted: Sun Mar 06, 2016 5:33 pm
by Box293
gormank wrote:Are you suggesting that snmptraphandling.py can lock up snmptt or snmptrapd? It hardly seems likely...

# md5sum /usr/local/bin/snmptraphandling.py
0639919b86c9e659ed04b1c63052bbbc /usr/local/bin/snmptraphandling.py
# ll /usr/local/bin/snmptraphandling.py
-rwxr-xr-x 1 root root 2448 May 5 2015 /usr/local/bin/snmptraphandling.py
Just trying to rule things out, I've seen stranger things happen.
gormank wrote:The change to the init script was done because when nagios is restarted, possibly when snmptt is reconfigured via the trap translation under admin, snmptt stops working. This is funny since I'm pretty sure the only way you know it stopped working (other than no traps translated) is restarting the service fails (it says FAILED) on the stop part. This may be the wrong way to fix that issue, but it works. So we can remove snmptt from the conversation if there's no better solution.
Still, 2 of 4 nagios hosts had a stalled snmptrapd after 6 days of uptime.
When snmptt stops working, are traps being spooled in /var/spool/snmptt/ ?

A way of detecting if the problem has occurred is to create a localhost service that watches the amount of files in /var/spool/snmptt/. Use the "Folder Watch" wizard to create such a service.

I would be interested in seeing if it is related to a specific trap which is holding thing up. If it was, the traps spooled in this directory are named based on the time they were created, so you would be able to grab a copy of the files so we can inspect them.

After restarting snmptt, do all the files in this directory get processed or is there some that are there permanently?

Re: SNMP trap system unreliable?

Posted: Mon Mar 07, 2016 10:51 am
by gormank
The snmptt issue is resolved with the fix I mentioned. I also mentioned its no longer an issue. As long as snmptrapd was working, when snmptt would spool files. When restarted, they would be processed.