SNMP Trap Reliability

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
bvinisky
Posts: 34
Joined: Wed Jul 06, 2011 11:28 am

SNMP Trap Reliability

Post by bvinisky »

Does anyone have any issues with SNMP Trap reliability or suggestions for improving reliability of SNMP Trap reception & translation in this situation? It's been spotty for me, as I've had the snmptrapd service get into a bad state multiple times, which prevents any traps from coming in and therefore no notifications that myself and others rely on when there are problems. So though traps are still being sent from the remote end and things for the most part appear normal, traps suddenly stop showing up in the snmptt logs on the NagiosXI server.

From my perspective the problem seems to be with snmptrapd, but i'm uncertain whether the snmptt service or the python script for submitting the passive check to Nagios can have an impact on snmptrapd's state. Both snmptrapd and snmptt services are still running. Restarting snmptt doesn't resolve the issue, but restarting snmptrapd does. I didn't think to check whether traffic was still coming in on 162 before I did a restart to fix.

The only thing that catches my eye is that snmptraphandling.py hangs around in the process table when this happens. I setup a monitor to watch for this process, but i'm not certain it's always going to be there. Anyone have any thoughts on what might be causing this?


On a separate, but semi related note that might help others: The default install of snmptrapd came with a bad stop function in /etc/init.d/snmptrapd. If I ran stop or restart (which calls a stop+start) against snmptrapd, it didn't work and this output was sent to standard error:

Stopping snmptrapd: pidof: invalid options on command line!

pidof: invalid options on command line!

The stop function calls killproc (a function imported from /etc/init.d/functions) with "-On" as an argument, which created the issue. Since killproc is executing a kill, I don't think an "O" (not zero) or a literal "n", are valid options. Maybe the O was meant to be a "0" (zero), which can be passed to kill, but that just performs error checking and doesn't actually send the signal. To fix that in my situation I removed the "-On." The snmptrapd process does accept the "-On" option, so my guess is that this was just a mistake and the flags were in the wrong spot.

RHEL 6.2 + NagiosXI 2011R3.3


Thanks!

Bryant
bvinisky
Posts: 34
Joined: Wed Jul 06, 2011 11:28 am

Re: SNMP Trap Reliability

Post by bvinisky »

I found that snmptt is running in standalone mode and snmptrapd waits on snmptt to finish it's work before processing additional traps. Snmptt in turn waits on the program it execs to return before it continues, so it very well could be that snmptraphandling.py hangs and clogs up the works.


From http://snmptt.sourceforge.net/docs/snmp ... ile-format...

Standalone or daemon mode:

The SNMPTRAPD program blocks when executing traphandle commands. This means that if the program called never quits, SNMPTRAPD will wait forever. If a trap is received while the traphandler is running, it is buffered and will be processed when the traphandler finishes. I do not know how large this buffer is.

The program called by SNMPTT (EXEC) blocks SNMPTT. If you call a program that does not return, SNMPTT will be left waiting. In standalone mode, this would cause snmptrapd to wait forever also.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: SNMP Trap Reliability

Post by scottwilkerson »

I am going to lock this thread, please continue discussion on
http://support.nagios.com/forum/viewtop ... =16&t=7464
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked