Page 1 of 2
Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 9:33 am
by lce411
I have been having some issues with Nagios recently. Initially, I would notice an un-ordinary amount of Critical errors in the web GUI. A restart of the service (sometimes a full reboot) cleared up any issue, but I have not been able to find any cause. This morning, basically every check had gone Critical (over 200). A server reboot cleared something up and now everything is Green, although a few are in a flapping state. Does anyone have any suggestions on what could be causing this unstableness? It's running on RHEL5.9, as a VM in VMware. Should more resources be allocated for it? It currently has 4GB or memory, 1 CPU and is provisioned for 20GB HDD space.
**Update - Since my first posting, I now have a growing number of Critical errors, which is abnormal for my environment (1 or 2 is more common). It seems like something is getting hung or not communicating with the Nagios server. They are all socket timeout errors, so maybe it's NRPE? The only thing that I can think of that has changed is that I removed a java RPM that I thought was unnecessary. Does NRPE need java to work?
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 10:56 am
by abrist
lce411 wrote: Does NRPE need java to work?
Nope.
What version of nagios are you running?
Have you changed you nagios server's ip recently?
How many hosts and services are you monitoring?
When it starts to hang, get a handful of logs:
Code: Select all
tail -25 /usr/local/nagios/var/nagios.log
ps -aef
df -h
df -i
tail -25 /var/log/messages
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 11:56 am
by lce411
abrist wrote:lce411 wrote: Does NRPE need java to work?
Nope.
What version of nagios are you running?
Have you changed you nagios server's ip recently?
How many hosts and services are you monitoring?
When it starts to hang, get a handful of logs:
Code: Select all
tail -25 /usr/local/nagios/var/nagios.log
ps -aef
df -h
df -i
tail -25 /var/log/messages
Version is Nagios Core 3.2.1
There have been no changes to the nagios server configuration
We are monitoring 17 hosts and about 11 or 12 services per host
Results of the tail command:
Code: Select all
[1369413779] Warning: The check of service 'Users Logged In' on host 'cde-rhela.deva' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369413836] SERVICE ALERT: cde-rhela.deva;Users Logged In;OK;SOFT;2;USERS OK - 0 users currently logged in
[1369413894] Warning: The check of service 'SSH' on host 'cde-sftp.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369413894] Warning: The check of service 'Log Partition' on host 'cde-syslog1.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369413922] SERVICE ALERT: cde-syslog1.chscde-ei.ustranscom.mil;Log Partition;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369413922] SERVICE ALERT: cde-sftp.chscde-ei.ustranscom.mil;SSH;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369413951] Warning: The check of service 'SSH' on host 'cde-dns1.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369413951] Warning: The check of service 'Boot Partition' on host 'cde-dns2.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369413975] SERVICE ALERT: cde-dns2.chscde-ei.ustranscom.mil;Boot Partition;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369414004] SERVICE ALERT: cde-sftp.chscde-ei.ustranscom.mil;Users Logged In;CRITICAL;SOFT;2;CHECK_NRPE: Socket timeout after 10 seconds.
[1369414034] Warning: The check of service 'Users Logged In' on host 'cde-squid.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369414034] Warning: The check of service 'Check NTPd Proc' on host 'cde-syslog2.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369414034] SERVICE ALERT: cde-squid.chscde-ei.ustranscom.mil;Users Logged In;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1369414065] SERVICE ALERT: cde-syslog2.chscde-ei.ustranscom.mil;Check NTPd Proc;CRITICAL;SOFT;2;(Service Check Timed Out)
[1369414100] Warning: The check of host 'cde-vc.chscde.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...
[1369414100] Warning: The check of service 'SSH' on host 'cde-jenkinsa.deva' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369414100] Warning: The check of service 'Usr Partition' on host 'cde-ldap.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369414100] Warning: The check of service 'Swap Usage' on host 'cde-ntp.chscde-ei.ustranscom.mil' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1369414100] SERVICE ALERT: cde-ntp.chscde-ei.ustranscom.mil;Swap Usage;CRITICAL;SOFT;1;(Service Check Timed Out)
[1369414100] SERVICE NOTIFICATION: jmcdonald;cde-ldap.chscde-ei.ustranscom.mil;Usr Partition;CRITICAL;notify-service-by-email;(Service Check Timed Out)
[1369414131] Warning: Contact 'jmcdonald' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Usr Partition\nHost: cde-ldap.chscde-ei.ustranscom.mil\nAddress: cde-ldap.chscde-ei.ustranscom.mil\nState: CRITICAL\n\nDate/Time: Fri May 24 12:48:20 EDT 2013\n\nAdditional Info:\n\n(Service Check Timed Out)" | /bin/mail -s "** PROBLEM Service Alert: cde-ldap.chscde-ei.ustranscom.mil/Usr Partition is CRITICAL **" [email protected] -- -r [email protected]' timed out after 30 seconds
[1369414131] SERVICE NOTIFICATION: jzimmer;cde-ldap.chscde-ei.ustranscom.mil;Usr Partition;CRITICAL;notify-service-by-email;(Service Check Timed Out)
[1369414162] Warning: Contact 'jzimmer' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Usr Partition\nHost: cde-ldap.chscde-ei.ustranscom.mil\nAddress: cde-ldap.chscde-ei.ustranscom.mil\nState: CRITICAL\n\nDate/Time: Fri May 24 12:48:51 EDT 2013\n\nAdditional Info:\n\n(Service Check Timed Out)" | /bin/mail -s "** PROBLEM Service Alert: cde-ldap.chscde-ei.ustranscom.mil/Usr Partition is CRITICAL **" [email protected] -- -r [email protected]' timed out after 30 seconds
[1369414162] SERVICE NOTIFICATION: mcaldwell;cde-ldap.chscde-ei.ustranscom.mil;Usr Partition;CRITICAL;notify-service-by-email;(Service Check Timed Out)
[1369414193] Warning: Contact 'mcaldwell' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Usr Partition\nHost: cde-ldap.chscde-ei.ustranscom.mil\nAddress: cde-ldap.chscde-ei.ustranscom.mil\nState: CRITICAL\n\nDate/Time: Fri May 24 12:49:22 EDT 2013\n\nAdditional Info:\n\n(Service Check Timed Out)" | /bin/mail -s "** PROBLEM Service Alert: cde-ldap.chscde-ei.ustranscom.mil/Usr Partition is CRITICAL **" [email protected] -- -r [email protected]' timed out after 30 seconds
ps -aef:
Code: Select all
[root@cde-nagios ~]# ps -aef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 09:52 ? 00:00:00 init [3]
root 2 1 0 09:52 ? 00:00:00 [migration/0]
root 3 1 0 09:52 ? 00:00:00 [ksoftirqd/0]
root 4 1 0 09:52 ? 00:00:07 [events/0]
root 5 1 0 09:52 ? 00:00:00 [khelper]
root 46 1 0 09:52 ? 00:00:00 [kthread]
root 50 46 0 09:52 ? 00:00:00 [kblockd/0]
root 51 46 0 09:52 ? 00:00:00 [cqueue/0]
root 54 46 0 09:52 ? 00:00:00 [khubd]
root 56 46 0 09:52 ? 00:00:00 [kseriod]
root 203 46 0 09:52 ? 00:00:00 [khungtaskd]
root 204 46 0 09:52 ? 00:00:00 [pdflush]
root 205 46 0 09:52 ? 00:00:02 [pdflush]
root 206 46 0 09:52 ? 00:00:00 [kswapd0]
root 207 46 0 09:52 ? 00:00:00 [aio/0]
root 411 46 0 09:52 ? 00:00:00 [kpsmoused]
root 441 46 0 09:52 ? 00:00:00 [mpt_poll_0]
root 442 46 0 09:52 ? 00:00:00 [mpt/0]
root 443 46 0 09:52 ? 00:00:00 [scsi_eh_0]
root 446 46 0 09:52 ? 00:00:00 [ata/0]
root 447 46 0 09:52 ? 00:00:00 [ata_aux]
root 454 46 0 09:52 ? 00:00:00 [kstriped]
root 463 46 0 09:52 ? 00:00:03 [kjournald]
root 488 46 0 09:52 ? 00:00:00 [kauditd]
root 521 1 0 09:52 ? 00:00:00 /sbin/udevd -d
root 1518 46 0 09:52 ? 00:00:00 [kmpathd/0]
root 1519 46 0 09:52 ? 00:00:00 [kmpath_handlerd]
root 1541 46 0 09:52 ? 00:00:07 [kjournald]
root 1543 46 0 09:52 ? 00:00:00 [kjournald]
root 1605 1 0 09:52 ? 00:00:00 /bin/bash /etc/rc.d/rc 3
root 1814 1 0 09:52 ? 00:00:08 /usr/sbin/vmtoolsd
root 1861 46 0 09:52 ? 00:00:00 [iscsi_eh]
root 1895 46 0 09:52 ? 00:00:00 [cnic_wq]
root 1898 46 0 09:52 ? 00:00:00 [bnx2i_thread/0]
root 1911 46 0 09:52 ? 00:00:00 [ib_addr]
root 1918 46 0 09:52 ? 00:00:00 [ib_mcast]
root 1919 46 0 09:52 ? 00:00:00 [ib_inform]
root 1920 46 0 09:52 ? 00:00:00 [local_sa]
root 1923 46 0 09:52 ? 00:00:00 [iw_cm_wq]
root 1927 46 0 09:52 ? 00:00:00 [ib_cm/0]
root 1929 46 0 09:52 ? 00:00:00 [rdma_cm]
root 1945 1 0 09:52 ? 00:00:00 iscsiuio
root 1950 1 0 09:52 ? 00:00:00 iscsid
root 1951 1 0 09:52 ? 00:00:00 iscsid
root 2208 1 0 09:52 ? 00:00:09 auditd
root 2210 2208 0 09:52 ? 00:00:04 /sbin/audispd
rpc 2259 1 0 09:52 ? 00:00:00 portmap
root 2293 46 0 09:52 ? 00:00:00 [rpciod/0]
rpcuser 2299 1 0 09:52 ? 00:00:00 rpc.statd
root 2326 1 0 09:52 ? 00:00:00 rpc.idmapd
dbus 2356 1 0 09:52 ? 00:00:00 dbus-daemon --system
68 2394 1 0 09:52 ? 00:00:02 hald
root 2395 2394 0 09:52 ? 00:00:00 hald-runner
68 2403 2395 0 09:53 ? 00:00:00 hald-addon-keyboard: listening on /dev/input/event0
root 2414 2395 0 09:53 ? 00:00:04 hald-addon-storage: polling /dev/hdc
root 2455 1 0 09:53 ? 00:00:00 supervising syslog-ng
root 2456 2455 0 09:53 ? 00:00:29 /opt/syslog-ng/sbin/syslog-ng --no-caps
root 2476 1 0 09:53 ? 00:00:00 /usr/sbin/sshd
ntp 2495 1 0 09:53 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
nagios 2508 1 0 09:53 ? 00:00:00 nrpe -c /etc/nagios/nrpe.cfg -d
root 2530 1 0 09:53 ? 00:00:01 sendmail: rejecting connections on daemon MTA: load average
smmsp 2538 1 0 09:53 ? 00:00:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
root 2552 1 0 09:53 ? 00:00:00 gpm -m /dev/input/mice -t exps2
root 2566 1 0 09:53 ? 00:00:00 /usr/sbin/httpd
root 2579 1 0 09:53 ? 00:00:00 crond
xfs 2600 1 0 09:53 ? 00:00:00 xfs -droppriv -daemon
apache 2604 2566 0 09:53 ? 00:00:04 /usr/sbin/httpd
root 2607 1 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2616 2607 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2617 2616 0 09:53 ? 00:00:01 /opt/McAfee/cma/bin/cma
root 2618 2616 0 09:53 ? 00:00:02 /opt/McAfee/cma/bin/cma
root 2619 2616 0 09:53 ? 00:00:06 /opt/McAfee/cma/bin/cma
root 2620 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2621 2616 0 09:53 ? 00:00:01 /opt/McAfee/cma/bin/cma
root 2622 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2623 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2624 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2625 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2626 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2627 2616 0 09:53 ? 00:00:10 /opt/McAfee/cma/bin/cma
root 2628 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2629 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2630 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2631 2616 0 09:53 ? 00:00:02 /opt/McAfee/cma/bin/cma
root 2679 2616 0 09:53 ? 00:00:00 /opt/McAfee/cma/bin/cma
root 2686 1 0 09:53 ? 00:00:01 /opt/NAI/LinuxShield/libexec/nailsd -c /var/opt/NAI/LinuxSh
root 2687 2686 0 09:53 ? 00:00:00 /opt/NAI/LinuxShield/libexec/nailslogd -p 6 -l 3 -s 4 -c /v
root 2688 2687 0 09:53 ? 00:00:07 /opt/NAI/LinuxShield/libexec/nailslogd -p 6 -l 3 -s 4 -c /v
root 2701 1 0 09:53 ? 00:00:02 /opt/NAI/LinuxShield/libexec/mon -p /var/opt/NAI/LinuxShiel
root 2714 1 0 09:53 ? 00:00:00 /opt/NAI/LinuxShield/apache/bin/nailswebd -d /opt/NAI/Linux
nails 2716 2714 0 09:53 ? 00:00:00 /opt/NAI/LinuxShield/apache/bin/nailswebd -d /opt/NAI/Linux
avahi 2728 1 0 09:53 ? 00:00:00 avahi-daemon: running [cde-nagios.local]
avahi 2729 2728 0 09:53 ? 00:00:00 avahi-daemon: chroot helper
root 2735 1605 0 09:53 ? 00:00:00 /bin/sh /etc/rc3.d/S99nagios start
nagios 2770 2735 0 09:53 ? 00:00:24 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagio
apache 5177 2566 0 10:00 ? 00:00:04 /usr/sbin/httpd
root 5733 2686 1 10:05 ? 00:02:03 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 6170 2686 1 10:09 ? 00:01:46 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 6474 2686 0 10:13 ? 00:01:27 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 6781 2686 0 10:17 ? 00:01:17 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 7098 2686 0 10:21 ? 00:01:11 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 7377 2686 0 10:25 ? 00:01:05 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 7751 2686 0 10:29 ? 00:01:00 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
apache 7788 2566 0 10:29 ? 00:00:02 /usr/sbin/httpd
apache 7859 2566 0 10:30 ? 00:00:03 /usr/sbin/httpd
root 8135 2686 0 10:33 ? 00:00:56 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 8660 2686 0 10:37 ? 00:00:52 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
apache 8703 2566 0 10:37 ? 00:00:03 /usr/sbin/httpd
apache 8704 2566 0 10:37 ? 00:00:02 /usr/sbin/httpd
apache 8862 2566 0 10:39 ? 00:00:02 /usr/sbin/httpd
apache 8926 2566 0 10:40 ? 00:00:03 /usr/sbin/httpd
apache 8974 2566 0 10:40 ? 00:00:02 /usr/sbin/httpd
root 9022 2686 0 10:41 ? 00:00:49 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 9284 2686 0 10:45 ? 00:00:45 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 9572 2686 0 10:49 ? 00:00:42 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 9944 2686 0 10:53 ? 00:00:39 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 10250 2686 0 10:57 ? 00:00:37 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 10463 2686 0 11:01 ? 00:00:34 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 10689 2686 0 11:05 ? 00:00:32 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 10818 2686 0 11:09 ? 00:00:30 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 10967 2686 0 11:13 ? 00:00:28 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11086 2686 0 11:17 ? 00:00:26 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11207 2686 0 11:21 ? 00:00:24 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11320 2686 0 11:25 ? 00:00:22 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11406 2686 0 11:29 ? 00:00:21 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11543 2686 0 11:33 ? 00:00:19 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11584 2686 0 11:37 ? 00:00:18 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11647 2686 0 11:41 ? 00:00:17 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11697 2686 0 11:45 ? 00:00:15 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11797 2686 0 11:49 ? 00:00:14 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11879 2686 0 11:53 ? 00:00:13 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11919 2686 0 11:57 ? 00:00:12 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 11968 2686 0 12:01 ? 00:00:11 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12002 2686 0 12:05 ? 00:00:10 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12032 2686 0 12:09 ? 00:00:09 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12076 2686 0 12:13 ? 00:00:08 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12107 2686 0 12:17 ? 00:00:07 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12137 2686 0 12:21 ? 00:00:06 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12193 2686 0 12:25 ? 00:00:05 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12230 2686 0 12:29 ? 00:00:04 /opt/NAI/LinuxShield/libexec/scanner -e /opt/NAI/LinuxShiel
root 12235 2476 0 12:29 ? 00:00:00 sshd: jmcdonald [priv]
**For clarification, the LinuxShield/scanner entries are from an HBSS install that was recently installed in our environment for testing purposes.
df -h
Code: Select all
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 3.8G 2.6G 1.1G 73% /
/dev/sda3 15G 8.9G 5.0G 65% /var
/dev/sda2 487M 29M 433M 7% /home
tmpfs 2.0G 0 2.0G 0% /dev/shm
df -i
Code: Select all
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 1025024 77415 947609 8% /
/dev/sda3 3961056 24215 3936841 1% /var
/dev/sda2 128520 81 128439 1% /home
tmpfs 505472 1 505471 1% /dev/shm
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 12:18 pm
by slansing
It looks like the issue was that your checks were timing out, or not returning. Did this server experience any network related issues? Are all of the hosts in one data center that could have had a lapse in connectivity?
Are you able to telnet/ping the windows/linux NRPE monitored servers on port 5666? Can you take a screenshot of your Performance Info page from within the Core interface and upload it here?
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 12:53 pm
by lce411
slansing wrote:It looks like the issue was that your checks were timing out, or not returning. Did this server experience any network related issues? Are all of the hosts in one data center that could have had a lapse in connectivity?
Are you able to telnet/ping the windows/linux NRPE monitored servers on port 5666? Can you take a screenshot of your Performance Info page from within the Core interface and upload it here?
There were no network related issues that I know about. All of our hosts reside in the same instance of VMware, which has all it's hardware in the same server room. We are in an office a mile down the road and VPN into the system. Screenshots of the Performance Info. are attached
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 1:28 pm
by slansing
Can you share the service configuration of one of your service's that are timing out? Also, the host configuration for it's Host. Also please share the output of running the following on the Nagios server's command line:
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 1:37 pm
by lce411
slansing wrote:Can you share the service configuration of one of your service's that are timing out? Also, the host configuration for it's Host. Also please share the output of running the following on the Nagios server's command line:
command[check_syslog]=/usr/lib64/nagios/plugins/check_procs -w 1: -c :2 -s RSZDT -C syslog-ng
define host{
use linux-server
host_name cde-dns2.chscde-ei.ustranscom.mil
contact_groups admins
}
SELinux is disabled
Is that what you were looking for?
Re: Nagios hanging during start-up/reboot
Posted: Fri May 24, 2013 1:48 pm
by slansing
Do you have a service defined for this host? The command you shared is for a local check, in order to check those hosts you are getting timeouts for you will need to install a NRPE agent on them, then check them from the Nagios server through a service "that is a very brief description of what you would need to do to check a remote host." If you do not have the object definitions set up correctly, you will get the timeouts you have been seeing.
http://nagios.sourceforge.net/docs/3_0/ ... tions.html
Re: Nagios hanging during start-up/reboot
Posted: Tue May 28, 2013 7:36 am
by lce411
slansing wrote:Do you have a service defined for this host? The command you shared is for a local check, in order to check those hosts you are getting timeouts for you will need to install a NRPE agent on them, then check them from the Nagios server through a service "that is a very brief description of what you would need to do to check a remote host." If you do not have the object definitions set up correctly, you will get the timeouts you have been seeing.
http://nagios.sourceforge.net/docs/3_0/ ... tions.html
define service{
use generic-service ; Name of service template to use
hostgroup_name CDE-LS-Linux-Hosts, CDE-VIEW-Linux-Hosts, CDE-INF-Linux-Hosts, CDE-SS-Linux-Hosts, CDE-Auth-Linux-Hosts, CDE-Deva-Linux-Hosts
service_description Check Syslog-NG Proc
check_command check_nrpe!check_syslog
}
We have NRPE installed on all of our clients and our monitoring system has been working fine for a couple of years now. I just want to point that out, so you don't think this is a new setup and we are trying to work out any kinks. Recently the Nagios server has been sluggish and it's only gotten worse. If the above service definition is still not what you are looking for, then let me know and I will get you some more info, so you can continue to help me try to figure this out. I came in this morning and was unable to SSH into the Nagios server or get the web GUI to load. I bounced the server and everything was fine, soon it after it restored, however, I already have 59 "time out" errors in the web GUI.
Re: Nagios hanging during start-up/reboot
Posted: Tue May 28, 2013 2:13 pm
by sreinhardt
Lets get some basic system specs and load settings, ideally while its running sluggishly. This also may be easier to attach as a text message than typing in.
Code: Select all
ps ax | wc -l
ulimit -a
cat /proc/loadavg
cat /proc/sys/kernel/threads-max
grep -i rlimit /usr/local/apache/conf/httpd.conf
uptime
free -m
df -h
df -i
tail /var/log/messages