SNMP Trap Reliability
Posted: Fri Sep 21, 2012 2:49 pm
Does anyone have any issues with SNMP Trap reliability or suggestions for improving reliability of SNMP Trap reception & translation in this situation? It's been spotty for me, as I've had the snmptrapd service get into a bad state multiple times, which prevents any traps from coming in and therefore no notifications that myself and others rely on when there are problems. So though traps are still being sent from the remote end and things for the most part appear normal, traps suddenly stop showing up in the snmptt logs on the NagiosXI server.
From my perspective the problem seems to be with snmptrapd, but i'm uncertain whether the snmptt service or the python script for submitting the passive check to Nagios can have an impact on snmptrapd's state. Both snmptrapd and snmptt services are still running. Restarting snmptt doesn't resolve the issue, but restarting snmptrapd does. I didn't think to check whether traffic was still coming in on 162 before I did a restart to fix.
The only thing that catches my eye is that snmptraphandling.py hangs around in the process table when this happens. I setup a monitor to watch for this process, but i'm not certain it's always going to be there. Anyone have any thoughts on what might be causing this?
On a separate, but semi related note that might help others: The default install of snmptrapd came with a bad stop function in /etc/init.d/snmptrapd. If I ran stop or restart (which calls a stop+start) against snmptrapd, it didn't work and this output was sent to standard error:
Stopping snmptrapd: pidof: invalid options on command line!
pidof: invalid options on command line!
The stop function calls killproc (a function imported from /etc/init.d/functions) with "-On" as an argument, which created the issue. Since killproc is executing a kill, I don't think an "O" (not zero) or a literal "n", are valid options. Maybe the O was meant to be a "0" (zero), which can be passed to kill, but that just performs error checking and doesn't actually send the signal. To fix that in my situation I removed the "-On." The snmptrapd process does accept the "-On" option, so my guess is that this was just a mistake and the flags were in the wrong spot.
RHEL 6.2 + NagiosXI 2011R3.3
Thanks!
Bryant
From my perspective the problem seems to be with snmptrapd, but i'm uncertain whether the snmptt service or the python script for submitting the passive check to Nagios can have an impact on snmptrapd's state. Both snmptrapd and snmptt services are still running. Restarting snmptt doesn't resolve the issue, but restarting snmptrapd does. I didn't think to check whether traffic was still coming in on 162 before I did a restart to fix.
The only thing that catches my eye is that snmptraphandling.py hangs around in the process table when this happens. I setup a monitor to watch for this process, but i'm not certain it's always going to be there. Anyone have any thoughts on what might be causing this?
On a separate, but semi related note that might help others: The default install of snmptrapd came with a bad stop function in /etc/init.d/snmptrapd. If I ran stop or restart (which calls a stop+start) against snmptrapd, it didn't work and this output was sent to standard error:
Stopping snmptrapd: pidof: invalid options on command line!
pidof: invalid options on command line!
The stop function calls killproc (a function imported from /etc/init.d/functions) with "-On" as an argument, which created the issue. Since killproc is executing a kill, I don't think an "O" (not zero) or a literal "n", are valid options. Maybe the O was meant to be a "0" (zero), which can be passed to kill, but that just performs error checking and doesn't actually send the signal. To fix that in my situation I removed the "-On." The snmptrapd process does accept the "-On" option, so my guess is that this was just a mistake and the flags were in the wrong spot.
RHEL 6.2 + NagiosXI 2011R3.3
Thanks!
Bryant