[root@prdmon1 libexec]# tail -n100 /var/log/messages
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127082 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127081 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127081 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127080 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127080 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127084 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127084 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127085 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127085 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127083 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127083 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127086 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127086 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Started Session 127087 of user nagios.
Feb 2 09:26:01 prdmon1 systemd: Starting Session 127087 of user nagios.
Feb 2 09:26:10 prdmon1 nagios: SERVICE ALERT: PRDUCM1B;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:26:13 prdmon1 ndo2db: Trimming timedevents.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming systemcommands.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming servicechecks.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming hostchecks.
Feb 2 09:26:13 prdmon1 ndo2db: Trimming eventhandlers.
Feb 2 09:26:20 prdmon1 systemd-logind: New session 127088 of user admin.ctd.
Feb 2 09:26:20 prdmon1 systemd: Started Session 127088 of user admin.ctd.
Feb 2 09:26:20 prdmon1 systemd: Starting Session 127088 of user admin.ctd.
Feb 2 09:26:20 prdmon1 dbus[750]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 2 09:26:20 prdmon1 dbus-daemon: dbus[750]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Feb 2 09:26:20 prdmon1 dbus[750]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 2 09:26:20 prdmon1 dbus-daemon: dbus[750]: [system] Successfully activated service 'org.freedesktop.problems'
Feb 2 09:26:20 prdmon1 nagios: SERVICE ALERT: AGEIDB01;McAfee AV;OK;SOFT;2;1 process matching nailsd (> 0)
Feb 2 09:26:25 prdmon1 nagios: SERVICE ALERT: AGLIDBWH01;McAfee AV;OK;SOFT;2;1 process matching nailsd (> 0)
Feb 2 09:26:26 prdmon1 su: (to root) admin.ctd on pts/0
Feb 2 09:26:34 prdmon1 nagios: SERVICE ALERT: AGWEBPC02;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:26:35 prdmon1 nagios: SERVICE ALERT: PRDMAV1;McAfee AV;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:26:41 prdmon1 nagios: SERVICE ALERT: AGLXWEB01;McAfee AV;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:26:47 prdmon1 nagios: SERVICE ALERT: PRDXDOM1B;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:26:48 prdmon1 nagios: SERVICE ALERT: AGLWAS02;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:27:01 prdmon1 systemd: Started Session 127090 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127090 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127091 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127091 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127094 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127094 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127093 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127093 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127092 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127092 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127095 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127095 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127097 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127097 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127089 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127089 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Started Session 127096 of user nagios.
Feb 2 09:27:01 prdmon1 systemd: Starting Session 127096 of user nagios.
Feb 2 09:27:03 prdmon1 nagios: SERVICE ALERT: UATAPPS05;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:27:03 prdmon1 nagios: SERVICE ALERT: PRDUCM1B;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:27:04 prdmon1 python: Unable to login to ESX
Feb 2 09:27:04 prdmon1 python: Virt backend 'env/cmdline' fails with error: Server raised fault: 'Cannot complete login due to an incorrect user name or password.'
Feb 2 09:27:14 prdmon1 ndo2db: Trimming timedevents.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming systemcommands.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming servicechecks.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming hostchecks.
Feb 2 09:27:14 prdmon1 ndo2db: Trimming eventhandlers.
Feb 2 09:27:25 prdmon1 tac_plus[29010]: connect from 127.0.0.1 [127.0.0.1]
Feb 2 09:27:27 prdmon1 nagios: SERVICE ALERT: PRDMAV1;McAfee AV;OK;SOFT;2;4 services active (matching "McAfee Agent Service,McAfee Agent Backwards Compatibility Service,McAfee Agent Common Services,McAfee Service Controller") : OK
Feb 2 09:27:34 prdmon1 nagios: SERVICE ALERT: AGLXWEB01;McAfee AV;OK;SOFT;2;1 process matching nailsd (> 0)
Feb 2 09:27:38 prdmon1 nagios: SERVICE ALERT: AGAPPS05;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:27:56 prdmon1 nagios: SERVICE ALERT: UATAPPS05;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:28:01 prdmon1 systemd: Started Session 127101 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127101 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127100 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127100 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127104 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127104 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127099 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127099 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127102 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127102 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127103 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127103 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Created slice user-992.slice.
Feb 2 09:28:01 prdmon1 systemd: Starting user-992.slice.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127098 of user pcp.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127098 of user pcp.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127105 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127105 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127106 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127106 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Started Session 127107 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Starting Session 127107 of user nagios.
Feb 2 09:28:01 prdmon1 systemd: Removed slice user-992.slice.
Feb 2 09:28:01 prdmon1 systemd: Stopping user-992.slice.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming timedevents.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming systemcommands.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming servicechecks.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming hostchecks.
Feb 2 09:28:15 prdmon1 ndo2db: Trimming eventhandlers.
Feb 2 09:28:31 prdmon1 nagios: SERVICE ALERT: AGAPPS05;BigFix;OK;SOFT;2;1 process matching BESClient (> 0)
Feb 2 09:28:44 prdmon1 nagios: SERVICE ALERT: PRDUCM1B;McAfee AV;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
Feb 2 09:28:45 prdmon1 nagios: SERVICE ALERT: UATADS1A;BigFix;UNKNOWN;SOFT;1;ERROR: Alarm signal (Nagios time-out)
[root@prdmon1 libexec]#
At this point short of the Nagios server sending a malformed packet I would agree it seems to be a network issue as midway through the UDP conversation I see the Nagios server send a packet that is not received at the remote host. The conversation does die at that point regardless of the timeout set.
You do have multiple kernel message queues, that can cause strange issues (but should not have interfered with command line testing), you should still fix it though:
Please run these commands to fix the message queues:
service nagios stop
ps aux | grep nagios.cfg | grep -v grep | awk '{print $2'} | xargs kill -9
service ndo2db stop
service mysqld restart
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
Try running a continuous ping from the XI server and see if any fail, let it run for a minute or so:
Just throwing this out there, you could use iperf or netcat to test for UDP which may be helpful. I've seen setups where there is TCP/UDP filtering in place, with ICMP having nothing.
Netcat would also allow you to probe the SNMP port specifically, which could help with the digging.