Page 1 of 2

Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 9:33 am
by lce411
I have been having some issues with Nagios recently. Initially, I would notice an un-ordinary amount of Critical errors in the web GUI. A restart of the service (sometimes a full reboot) cleared up any issue, but I have not been able to find any cause. This morning, basically every check had gone Critical (over 200). A server reboot cleared something up and now everything is Green, although a few are in a flapping state. Does anyone have any suggestions on what could be causing this unstableness? It's running on RHEL5.9, as a VM in VMware. Should more resources be allocated for it? It currently has 4GB or memory, 1 CPU and is provisioned for 20GB HDD space.

**Update - Since my first posting, I now have a growing number of Critical errors, which is abnormal for my environment (1 or 2 is more common). It seems like something is getting hung or not communicating with the Nagios server. They are all socket timeout errors, so maybe it's NRPE? The only thing that I can think of that has changed is that I removed a java RPM that I thought was unnecessary. Does NRPE need java to work?

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 10:56 am
by abrist
lce411 wrote: Does NRPE need java to work?
Nope.
What version of nagios are you running?
Have you changed you nagios server's ip recently?
How many hosts and services are you monitoring?
When it starts to hang, get a handful of logs:

Code: Select all

tail -25 /usr/local/nagios/var/nagios.log
ps -aef
df -h
df -i
tail -25 /var/log/messages

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 11:56 am
by lce411
abrist wrote:
lce411 wrote: Does NRPE need java to work?
Nope.
What version of nagios are you running?
Have you changed you nagios server's ip recently?
How many hosts and services are you monitoring?
When it starts to hang, get a handful of logs:

Code: Select all

tail -25 /usr/local/nagios/var/nagios.log
ps -aef
df -h
df -i
tail -25 /var/log/messages
Version is Nagios Core 3.2.1
There have been no changes to the nagios server configuration
We are monitoring 17 hosts and about 11 or 12 services per host

Results of the tail command:

Code: Select all

[1369413779] Warning: The check of service 'Users Logged In' on host 'cde-rhela.deva' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369413836] SERVICE ALERT: cde-rhela.deva;Users Logged In;OK;SOFT;2;USERS OK - 0 users currently logged in
[1369413894] Warning: The check of service 'SSH' on host 'cde-sftp.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369413894] Warning: The check of service 'Log Partition' on host 'cde-syslog1.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369413922] SERVICE ALERT: cde-syslog1.chscde-ei.ustranscom.mil;Log Partition;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369413922] SERVICE ALERT: cde-sftp.chscde-ei.ustranscom.mil;SSH;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369413951] Warning: The check of service 'SSH' on host 'cde-dns1.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369413951] Warning: The check of service 'Boot Partition' on host 'cde-dns2.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369413975] SERVICE ALERT: cde-dns2.chscde-ei.ustranscom.mil;Boot Partition;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369414004] SERVICE ALERT: cde-sftp.chscde-ei.ustranscom.mil;Users Logged In;CRITICAL;SOFT;2;CHECK_NRPE: Socket timeout after 10 seconds.
[1369414034] Warning: The check of service 'Users Logged In' on host 'cde-squid.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369414034] Warning: The check of service 'Check NTPd Proc' on host 'cde-syslog2.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369414034] SERVICE ALERT: cde-squid.chscde-ei.ustranscom.mil;Users Logged In;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1369414065] SERVICE ALERT: cde-syslog2.chscde-ei.ustranscom.mil;Check NTPd Proc;CRITICAL;SOFT;2;(Service Check Timed Out)
[1369414100] Warning: The check of host 'cde-vc.chscde.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
[1369414100] Warning: The check of service 'SSH' on host 'cde-jenkinsa.deva' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369414100] Warning: The check of service 'Usr Partition' on host 'cde-ldap.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369414100] Warning: The check of service 'Swap Usage' on host 'cde-ntp.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1369414100] SERVICE ALERT: cde-ntp.chscde-ei.ustranscom.mil;Swap Usage;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369414100] SERVICE NOTIFICATION: jmcdonald;cde-ldap.chscde-ei.ustranscom.mil;Usr Partition;CRITICAL;notify-service-by-email;(Service Check Timed Out)
[1369414131] Warning: Contact 'jmcdonald' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Usr Partition\nHost: cde-ldap.chscde-ei.ustranscom.mil\nAddress: cde-ldap.chscde-ei.ustranscom.mil\nState: CRITICAL\n\nDate/Time: Fri May 24 12:48:20 EDT 2013\n\nAdditional Info:\n\n(Service Check Timed Out)" | /bin/mail -s "** PROBLEM Service Alert: cde-ldap.chscde-ei.ustranscom.mil/Usr Partition is CRITICAL **" [email protected] -- -r [email protected]' timed out after 30 seconds
[1369414131] SERVICE NOTIFICATION: jzimmer;cde-ldap.chscde-ei.ustranscom.mil;Usr Partition;CRITICAL;notify-service-by-email;(Service Check Timed Out)
[1369414162] Warning: Contact 'jzimmer' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Usr Partition\nHost: cde-ldap.chscde-ei.ustranscom.mil\nAddress: cde-ldap.chscde-ei.ustranscom.mil\nState: CRITICAL\n\nDate/Time: Fri May 24 12:48:51 EDT 2013\n\nAdditional Info:\n\n(Service Check Timed Out)" | /bin/mail -s "** PROBLEM Service Alert: cde-ldap.chscde-ei.ustranscom.mil/Usr Partition is CRITICAL **" [email protected] -- -r [email protected]' timed out after 30 seconds
[1369414162] SERVICE NOTIFICATION: mcaldwell;cde-ldap.chscde-ei.ustranscom.mil;Usr Partition;CRITICAL;notify-service-by-email;(Service Check Timed Out)
[1369414193] Warning: Contact 'mcaldwell' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Usr Partition\nHost: cde-ldap.chscde-ei.ustranscom.mil\nAddress: cde-ldap.chscde-ei.ustranscom.mil\nState: CRITICAL\n\nDate/Time: Fri May 24 12:49:22 EDT 2013\n\nAdditional Info:\n\n(Service Check Timed Out)" | /bin/mail -s "** PROBLEM Service Alert: cde-ldap.chscde-ei.ustranscom.mil/Usr Partition is CRITICAL **" [email protected] -- -r [email protected]' timed out after 30 seconds
ps -aef:

Code: Select all

[root@cde-nagios ~]# ps -aef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:52 ?        00:00:00 init [3]
root         2     1  0 09:52 ?        00:00:00 [migration/0]
root         3     1  0 09:52 ?        00:00:00 [ksoftirqd/0]
root         4     1  0 09:52 ?        00:00:07 [events/0]
root         5     1  0 09:52 ?        00:00:00 [khelper]
root        46     1  0 09:52 ?        00:00:00 [kthread]
root        50    46  0 09:52 ?        00:00:00 [kblockd/0]
root        51    46  0 09:52 ?        00:00:00 [cqueue/0]
root        54    46  0 09:52 ?        00:00:00 [khubd]
root        56    46  0 09:52 ?        00:00:00 [kseriod]
root       203    46  0 09:52 ?        00:00:00 [khungtaskd]
root       204    46  0 09:52 ?        00:00:00 [pdflush]
root       205    46  0 09:52 ?        00:00:02 [pdflush]
root       206    46  0 09:52 ?        00:00:00 [kswapd0]
root       207    46  0 09:52 ?        00:00:00 [aio/0]
root       411    46  0 09:52 ?        00:00:00 [kpsmoused]
root       441    46  0 09:52 ?        00:00:00 [mpt_poll_0]
root       442    46  0 09:52 ?        00:00:00 [mpt/0]
root       443    46  0 09:52 ?        00:00:00 [scsi_eh_0]
root       446    46  0 09:52 ?        00:00:00 [ata/0]
root       447    46  0 09:52 ?        00:00:00 [ata_aux]
root       454    46  0 09:52 ?        00:00:00 [kstriped]
root       463    46  0 09:52 ?        00:00:03 [kjournald]
root       488    46  0 09:52 ?        00:00:00 [kauditd]
root       521     1  0 09:52 ?        00:00:00 /sbin/udevd -d
root      1518    46  0 09:52 ?        00:00:00 [kmpathd/0]
root      1519    46  0 09:52 ?        00:00:00 [kmpath_handlerd]
root      1541    46  0 09:52 ?        00:00:07 [kjournald]
root      1543    46  0 09:52 ?        00:00:00 [kjournald]
root      1605     1  0 09:52 ?        00:00:00 /bin/bash /etc/rc.d/rc 3
root      1814     1  0 09:52 ?        00:00:08 /usr/sbin/vmtoolsd
root      1861    46  0 09:52 ?        00:00:00 [iscsi_eh]
root      1895    46  0 09:52 ?        00:00:00 [cnic_wq]
root      1898    46  0 09:52 ?        00:00:00 [bnx2i_thread/0]
root      1911    46  0 09:52 ?        00:00:00 [ib_addr]
root      1918    46  0 09:52 ?        00:00:00 [ib_mcast]
root      1919    46  0 09:52 ?        00:00:00 [ib_inform]
root      1920    46  0 09:52 ?        00:00:00 [local_sa]
root      1923    46  0 09:52 ?        00:00:00 [iw_cm_wq]
root      1927    46  0 09:52 ?        00:00:00 [ib_cm/0]
root      1929    46  0 09:52 ?        00:00:00 [rdma_cm]
root      1945     1  0 09:52 ?        00:00:00 iscsiuio
root      1950     1  0 09:52 ?        00:00:00 iscsid
root      1951     1  0 09:52 ?        00:00:00 iscsid
root      2208     1  0 09:52 ?        00:00:09 auditd
root      2210  2208  0 09:52 ?        00:00:04 /sbin/audispd
rpc       2259     1  0 09:52 ?        00:00:00 portmap
root      2293    46  0 09:52 ?        00:00:00 [rpciod/0]
rpcuser   2299     1  0 09:52 ?        00:00:00 rpc.statd
root      2326     1  0 09:52 ?        00:00:00 rpc.idmapd
dbus      2356     1  0 09:52 ?        00:00:00 dbus-daemon --system
68        2394     1  0 09:52 ?        00:00:02 hald
root      2395  2394  0 09:52 ?        00:00:00 hald-runner
68        2403  2395  0 09:53 ?        00:00:00 hald-addon-keyboard: listening on /dev/input/event0
root      2414  2395  0 09:53 ?        00:00:04 hald-addon-storage: polling /dev/hdc
root      2455     1  0 09:53 ?        00:00:00 supervising syslog-ng
root      2456  2455  0 09:53 ?        00:00:29 /opt/syslog-ng/sbin/syslog-ng --no-caps
root      2476     1  0 09:53 ?        00:00:00 /usr/sbin/sshd
ntp       2495     1  0 09:53 ?        00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
nagios    2508     1  0 09:53 ?        00:00:00 nrpe -c /etc/nagios/nrpe.cfg -d
root      2530     1  0 09:53 ?        00:00:01 sendmail: rejecting connections on daemon MTA: load average
smmsp     2538     1  0 09:53 ?        00:00:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
root      2552     1  0 09:53 ?        00:00:00 gpm -m /dev/input/mice -t exps2
root      2566     1  0 09:53 ?        00:00:00 /usr/sbin/httpd
root      2579     1  0 09:53 ?        00:00:00 crond
xfs       2600     1  0 09:53 ?        00:00:00 xfs -droppriv -daemon
apache    2604  2566  0 09:53 ?        00:00:04 /usr/sbin/httpd
root      2607     1  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2616  2607  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2617  2616  0 09:53 ?        00:00:01 /opt/McAfee/cma/bin/cma
root      2618  2616  0 09:53 ?        00:00:02 /opt/McAfee/cma/bin/cma
root      2619  2616  0 09:53 ?        00:00:06 /opt/McAfee/cma/bin/cma
root      2620  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2621  2616  0 09:53 ?        00:00:01 /opt/McAfee/cma/bin/cma
root      2622  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2623  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2624  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2625  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2626  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2627  2616  0 09:53 ?        00:00:10 /opt/McAfee/cma/bin/cma
root      2628  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2629  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2630  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2631  2616  0 09:53 ?        00:00:02 /opt/McAfee/cma/bin/cma
root      2679  2616  0 09:53 ?        00:00:00 /opt/McAfee/cma/bin/cma
root      2686     1  0 09:53 ?        00:00:01 /opt/NAI/LinuxShield/libexec/nailsd -c /var/opt/NAI/LinuxSh
root      2687  2686  0 09:53 ?        00:00:00 /opt/NAI/LinuxShield/libexec/nailslogd -p 6 -l 3 -s 4 -c /v
root      2688  2687  0 09:53 ?        00:00:07 /opt/NAI/LinuxShield/libexec/nailslogd -p 6 -l 3 -s 4 -c /v
root      2701     1  0 09:53 ?        00:00:02 /opt/NAI/LinuxShield/libexec/mon -p /var/opt/NAI/LinuxShiel
root      2714     1  0 09:53 ?        00:00:00 /opt/NAI/LinuxShield/apache/bin/nailswebd -d /opt/NAI/Linux
nails     2716  2714  0 09:53 ?        00:00:00 /opt/NAI/LinuxShield/apache/bin/nailswebd -d /opt/NAI/Linux
avahi     2728     1  0 09:53 ?        00:00:00 avahi-daemon: running [cde-nagios.local]
avahi     2729  2728  0 09:53 ?        00:00:00 avahi-daemon: chroot helper
root      2735  1605  0 09:53 ?        00:00:00 /bin/sh /etc/rc3.d/S99nagios start
nagios    2770  2735  0 09:53 ?        00:00:24 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagio
apache    5177  2566  0 10:00 ?        00:00:04 /usr/sbin/httpd
root      5733  2686  1 10:05 ?        00:02:03 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      6170  2686  1 10:09 ?        00:01:46 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      6474  2686  0 10:13 ?        00:01:27 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      6781  2686  0 10:17 ?        00:01:17 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      7098  2686  0 10:21 ?        00:01:11 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      7377  2686  0 10:25 ?        00:01:05 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      7751  2686  0 10:29 ?        00:01:00 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
apache    7788  2566  0 10:29 ?        00:00:02 /usr/sbin/httpd
apache    7859  2566  0 10:30 ?        00:00:03 /usr/sbin/httpd
root      8135  2686  0 10:33 ?        00:00:56 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      8660  2686  0 10:37 ?        00:00:52 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
apache    8703  2566  0 10:37 ?        00:00:03 /usr/sbin/httpd
apache    8704  2566  0 10:37 ?        00:00:02 /usr/sbin/httpd
apache    8862  2566  0 10:39 ?        00:00:02 /usr/sbin/httpd
apache    8926  2566  0 10:40 ?        00:00:03 /usr/sbin/httpd
apache    8974  2566  0 10:40 ?        00:00:02 /usr/sbin/httpd
root      9022  2686  0 10:41 ?        00:00:49 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      9284  2686  0 10:45 ?        00:00:45 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      9572  2686  0 10:49 ?        00:00:42 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root      9944  2686  0 10:53 ?        00:00:39 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     10250  2686  0 10:57 ?        00:00:37 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     10463  2686  0 11:01 ?        00:00:34 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     10689  2686  0 11:05 ?        00:00:32 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     10818  2686  0 11:09 ?        00:00:30 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     10967  2686  0 11:13 ?        00:00:28 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11086  2686  0 11:17 ?        00:00:26 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11207  2686  0 11:21 ?        00:00:24 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11320  2686  0 11:25 ?        00:00:22 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11406  2686  0 11:29 ?        00:00:21 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11543  2686  0 11:33 ?        00:00:19 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11584  2686  0 11:37 ?        00:00:18 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11647  2686  0 11:41 ?        00:00:17 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11697  2686  0 11:45 ?        00:00:15 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11797  2686  0 11:49 ?        00:00:14 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11879  2686  0 11:53 ?        00:00:13 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11919  2686  0 11:57 ?        00:00:12 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     11968  2686  0 12:01 ?        00:00:11 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12002  2686  0 12:05 ?        00:00:10 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12032  2686  0 12:09 ?        00:00:09 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12076  2686  0 12:13 ?        00:00:08 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12107  2686  0 12:17 ?        00:00:07 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12137  2686  0 12:21 ?        00:00:06 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12193  2686  0 12:25 ?        00:00:05 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12230  2686  0 12:29 ?        00:00:04 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root     12235  2476  0 12:29 ?        00:00:00 sshd: jmcdonald [priv]
**For clarification, the LinuxShield/scanner entries are from an HBSS install that was recently installed in our environment for testing purposes.


df -h

Code: Select all

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             3.8G  2.6G  1.1G  73% /
/dev/sda3              15G  8.9G  5.0G  65% /var
/dev/sda2             487M   29M  433M   7% /home
tmpfs                 2.0G     0  2.0G   0% /dev/shm
df -i

Code: Select all

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1            1025024   77415  947609    8% /
/dev/sda3            3961056   24215 3936841    1% /var
/dev/sda2             128520      81  128439    1% /home
tmpfs                 505472       1  505471    1% /dev/shm

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 12:18 pm
by slansing
It looks like the issue was that your checks were timing out, or not returning. Did this server experience any network related issues? Are all of the hosts in one data center that could have had a lapse in connectivity?

Are you able to telnet/ping the windows/linux NRPE monitored servers on port 5666? Can you take a screenshot of your Performance Info page from within the Core interface and upload it here?

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 12:53 pm
by lce411
slansing wrote:It looks like the issue was that your checks were timing out, or not returning. Did this server experience any network related issues? Are all of the hosts in one data center that could have had a lapse in connectivity?

Are you able to telnet/ping the windows/linux NRPE monitored servers on port 5666? Can you take a screenshot of your Performance Info page from within the Core interface and upload it here?
There were no network related issues that I know about. All of our hosts reside in the same instance of VMware, which has all it's hardware in the same server room. We are in an office a mile down the road and VPN into the system. Screenshots of the Performance Info. are attached

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 1:28 pm
by slansing
Can you share the service configuration of one of your service's that are timing out? Also, the host configuration for it's Host. Also please share the output of running the following on the Nagios server's command line:

Code: Select all

sestatus

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 1:37 pm
by lce411
slansing wrote:Can you share the service configuration of one of your service's that are timing out? Also, the host configuration for it's Host. Also please share the output of running the following on the Nagios server's command line:

Code: Select all

sestatus
command[check_syslog]=/usr/lib64/nagios/plugins/check_procs -w 1: -c :2 -s RSZDT -C syslog-ng

define host{
use linux-server
host_name cde-dns2.chscde-ei.ustranscom.mil
contact_groups admins
}

SELinux is disabled

Is that what you were looking for?

Re: Nagios hanging during start-up/reboot

Posted: Fri May 24, 2013 1:48 pm
by slansing
Do you have a service defined for this host? The command you shared is for a local check, in order to check those hosts you are getting timeouts for you will need to install a NRPE agent on them, then check them from the Nagios server through a service "that is a very brief description of what you would need to do to check a remote host." If you do not have the object definitions set up correctly, you will get the timeouts you have been seeing.

http://nagios.sourceforge.net/docs/3_0/ ... tions.html

Re: Nagios hanging during start-up/reboot

Posted: Tue May 28, 2013 7:36 am
by lce411
slansing wrote:Do you have a service defined for this host? The command you shared is for a local check, in order to check those hosts you are getting timeouts for you will need to install a NRPE agent on them, then check them from the Nagios server through a service "that is a very brief description of what you would need to do to check a remote host." If you do not have the object definitions set up correctly, you will get the timeouts you have been seeing.

http://nagios.sourceforge.net/docs/3_0/ ... tions.html
define service{
use generic-service ; Name of service template to use
hostgroup_name CDE-LS-Linux-Hosts, CDE-VIEW-Linux-Hosts, CDE-INF-Linux-Hosts, CDE-SS-Linux-Hosts, CDE-Auth-Linux-Hosts, CDE-Deva-Linux-Hosts
service_description Check Syslog-NG Proc
check_command check_nrpe!check_syslog
}

We have NRPE installed on all of our clients and our monitoring system has been working fine for a couple of years now. I just want to point that out, so you don't think this is a new setup and we are trying to work out any kinks. Recently the Nagios server has been sluggish and it's only gotten worse. If the above service definition is still not what you are looking for, then let me know and I will get you some more info, so you can continue to help me try to figure this out. I came in this morning and was unable to SSH into the Nagios server or get the web GUI to load. I bounced the server and everything was fine, soon it after it restored, however, I already have 59 "time out" errors in the web GUI.

Re: Nagios hanging during start-up/reboot

Posted: Tue May 28, 2013 2:13 pm
by sreinhardt
Lets get some basic system specs and load settings, ideally while its running sluggishly. This also may be easier to attach as a text message than typing in.

Code: Select all

ps ax | wc -l

ulimit -a

cat /proc/loadavg

cat /proc/sys/kernel/threads-max

grep -i rlimit /usr/local/apache/conf/httpd.conf

uptime

free -m

df -h

df -i

tail /var/log/messages