Page 2 of 2

Re: SNMP Service Check Timeout

Posted: Wed Feb 01, 2017 5:14 pm
by ssax
Could it be a DNS issue? Please try changing it to an IP address instead of the DNS name and see if you're still able to replicate:

Code: Select all

./check_snmp_win.pl -H 192.168.10.10 -C xxxx -n 'BES Client'
Also, please send the output of these commands (run as root):

Code: Select all

ps aux | grep nagios.cfg
ipcs -q
tail -n100 /var/log/messages

Thank you

Re: SNMP Service Check Timeout

Posted: Thu Feb 02, 2017 11:32 am
by christiandunn1
Here are the commands (I tested using the linux check as we have the same issue with both scripts):

[root@prdmon1 libexec]# ./check_snmp_process_wizard.pl -H 199.214.10.76 -C xxxx --v2c -n 'BESClient' -w '0' -c '0'
1 process matching BESClient (> 0)
[root@prdmon1 libexec]# ./check_snmp_process_wizard.pl -H 199.214.10.76 -C xxxx --v2c -n 'BESClient' -w '0' -c '0'
ERROR: Alarm signal (Nagios time-out)

[root@prdmon1 libexec]# ps aux | grep nagios.cfg
nagios 22223 0.2 0.0 47244 13392 ? Ss Feb01 2:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 22268 0.0 0.0 42496 3836 ? S Feb01 0:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 30366 0.0 0.0 112652 984 pts/0 S+ 09:28 0:00 grep --color=auto nagios.cfg
[root@prdmon1 libexec]# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x95010002 0 nagios 600 0 0
0x00010002 163841 nagios 600 0 0

[root@prdmon1 libexec]# tail -n100 /var/log/messages
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127082 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127081 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127081 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127080 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127080 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127084 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127084 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127085 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127085 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127083 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127083 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127086 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127086 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127087 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127087 of user nagios.
Feb 2 09:26:10 prdmon1 nagios: SERVICE ALERT: PRDUCM1B;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:26:13 prdmon1 ndo2db: Trimming timedevents.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming systemcommands.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming servicechecks.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming hostchecks.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming eventhandlers.
Feb 2 09:26:20 prdmon1 systemd-logind: New session 127088 of user admin.ctd.
Feb 2 09:26:20 prdmon1 systemd: Started Session 127088 of user admin.ctd.
Feb 2 09:26:20 prdmon1 systemd: Starting Session 127088 of user admin.ctd.
Feb 2 09:26:20 prdmon1 dbus[750]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 2 09:26:20 prdmon1 dbus-daemon: dbus[750]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 2 09:26:20 prdmon1 dbus[750]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 2 09:26:20 prdmon1 dbus-daemon: dbus[750]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 2 09:26:20 prdmon1 nagios: SERVICE ALERT: AGEIDB01;McAfee AV;OK;SOFT;2;1 process matching nailsd (> 0)
Feb 2 09:26:25 prdmon1 nagios: SERVICE ALERT: AGLIDBWH01;McAfee AV;OK;SOFT;2;1 process matching nailsd (> 0)
Feb 2 09:26:26 prdmon1 su: (to root) admin.ctd on pts/0
Feb 2 09:26:34 prdmon1 nagios: SERVICE ALERT: AGWEBPC02;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:26:35 prdmon1 nagios: SERVICE ALERT: PRDMAV1;McAfee AV;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:26:41 prdmon1 nagios: SERVICE ALERT: AGLXWEB01;McAfee AV;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:26:47 prdmon1 nagios: SERVICE ALERT: PRDXDOM1B;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:26:48 prdmon1 nagios: SERVICE ALERT: AGLWAS02;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:27:01 prdmon1 systemd: Started Session 127090 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127090 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127091 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127091 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127094 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127094 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127093 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127093 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127092 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127092 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127095 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127095 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127097 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127097 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127089 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127089 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127096 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127096 of user nagios.
Feb 2 09:27:03 prdmon1 nagios: SERVICE ALERT: UATAPPS05;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:27:03 prdmon1 nagios: SERVICE ALERT: PRDUCM1B;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:27:04 prdmon1 python: Unable to login to ESX
Feb 2 09:27:04 prdmon1 python: Virt backend 'env/cmdline' fails with error: Server raised fault: 'Cannot complete login due to an incorrect user name or password.'
Feb 2 09:27:14 prdmon1 ndo2db: Trimming timedevents.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming systemcommands.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming servicechecks.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming hostchecks.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming eventhandlers.
Feb 2 09:27:25 prdmon1 tac_plus[29010]: connect from 127.0.0.1 [127.0.0.1]
Feb 2 09:27:27 prdmon1 nagios: SERVICE ALERT: PRDMAV1;McAfee AV;OK;SOFT;2;4 services active (matching "McAfee Agent Service,McAfee Agent Backwards Compatibility Service,McAfee Agent Common Services,McAfee Service Controller") : OK
Feb 2 09:27:34 prdmon1 nagios: SERVICE ALERT: AGLXWEB01;McAfee AV;OK;SOFT;2;1 process matching nailsd (> 0)
Feb 2 09:27:38 prdmon1 nagios: SERVICE ALERT: AGAPPS05;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:27:56 prdmon1 nagios: SERVICE ALERT: UATAPPS05;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:28:01 prdmon1 systemd: Started Session 127101 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127101 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127100 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127100 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127104 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127104 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127099 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127099 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127102 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127102 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127103 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127103 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Created slice user-992.slice.
Feb 2 09:28:01 prdmon1 systemd: Starting user-992.slice.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127098 of user pcp.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127098 of user pcp.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127105 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127105 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127106 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127106 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127107 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127107 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Removed slice user-992.slice.
Feb 2 09:28:01 prdmon1 systemd: Stopping user-992.slice.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming timedevents.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming systemcommands.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming servicechecks.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming hostchecks.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming eventhandlers.
Feb 2 09:28:31 prdmon1 nagios: SERVICE ALERT: AGAPPS05;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:28:44 prdmon1 nagios: SERVICE ALERT: PRDUCM1B;McAfee AV;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:28:45 prdmon1 nagios: SERVICE ALERT: UATADS1A;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
[root@prdmon1 libexec]#


At this point short of the Nagios server sending a malformed packet I would agree it seems to be a network issue as midway through the UDP conversation I see the Nagios server send a packet that is not received at the remote host. The conversation does die at that point regardless of the timeout set.

Re: SNMP Service Check Timeout

Posted: Thu Feb 02, 2017 11:49 am
by ssax
Ah ok, good catch on the UDP loss!

You do have multiple kernel message queues, that can cause strange issues (but should not have interfered with command line testing), you should still fix it though:

Please run these commands to fix the message queues:

Code: Select all

service nagios stop
ps aux | grep nagios.cfg | grep -v grep | awk '{print $2'} | xargs kill -9
service ndo2db stop
service mysqld restart
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
Try running a continuous ping from the XI server and see if any fail, let it run for a minute or so:

Code: Select all

ping 199.214.10.76
Thank you

Re: SNMP Service Check Timeout

Posted: Thu Feb 02, 2017 3:32 pm
by christiandunn1
I got a syntax error running "ps aux | grep nagios.cfg | grep -v grep | awk '{print $2'} | xargs kill -9"

I (perhaps) foolishly ran the rest of the commands and here is the new ipcs output.

[root@prdmon1 ~]# ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x7c010002 32768 nagios 600 0 0


I also ran a continuous ping for 2 minutes to the remote host and didn't see any lost packets even while concurrently running failing snmp checks.

At this point unless you have any ideas I guess I will have to pinpoint where the packet is being dropped and for what reason.

Re: SNMP Service Check Timeout

Posted: Thu Feb 02, 2017 3:40 pm
by rkennedy
Just throwing this out there, you could use iperf or netcat to test for UDP which may be helpful. I've seen setups where there is TCP/UDP filtering in place, with ICMP having nothing.

Netcat would also allow you to probe the SNMP port specifically, which could help with the digging.

http://superuser.com/questions/589732/h ... dp-packets - top two answers.

Re: SNMP Service Check Timeout

Posted: Wed Mar 01, 2017 1:56 pm
by tmcdonald
Just checking in since we have not heard from you in a while. Did @rkennedy's post clear things up?