Page 1 of 2
Performance degraded after Reoccurring Downtime setup
Posted: Tue Dec 15, 2015 10:57 am
by mhixson2
Environment:
Nagios XI 2014R2.7
CentOS 6.6 VM
4xCPU, 12GB Mem
679 hosts (all Windows)
5500 services
NSClient++ 0.4.3.143-x64
NRPE checks with a fair amount of external scripts
I configured reoccurring downtime for all of my hosts late last week and ever since then performance has been very slow via the web UI. It was never slow before that.
I've been watching server performance:
- CPU utilization is very low - typically under 10% with occasional to 50 or 60%, but the spikes seem to correspond with background processes, not apache demand.
Memory use is at 10.5 of 12GB and holding.
I/O wait is currently 0.05% but this spikes hard when I apply a new config and takes a minute or two to recover.
Load (pulled from test Nagios server which is watching prod) is currently: load1=0.43, load5=0.61, load15=0.9 and doesn't really ever spike much above 5%.
Is the performance issue related to having reoccurring downtime scheduled for all hosts? And I have noticed that as I navigate, the messages on each service and host that note the upcoming downtime take some time to load.
Let me know if you need any other info. I need to get this thing back to its snappy self!
Thanks
Re: Performance degraded after Reoccurring Downtime setup
Posted: Tue Dec 15, 2015 5:45 pm
by tgriep
Can you run the following as root on the server and post the output here?
Re: Performance degraded after Reoccurring Downtime setup
Posted: Tue Dec 15, 2015 7:53 pm
by Box293
In addition to what
@tgriep has requested, can you please provide this information:
Admin > System Config > Manage System Config
What is in your "Program URL" and "External URL" fields?
Does you Program URL resolve to an internal IP address?
Re: Performance degraded after Reoccurring Downtime setup
Posted: Wed Dec 16, 2015 10:20 am
by mhixson2
Thanks guys. I think you're both onto something.
The Program URL hadn't been reverted from SSL testing I had been doing previously, but backed out of. It was
https://10.220.102.42/nagiosxi/, but I have reverted it back to
http://10.220.102.42/nagiosxi/. With this reverted however, the problem still persists.
Examples of the issue:
- Click on unhandled problems on the home screen (11 warnings in this situation)
The borders and menus on the new page load immediately
The service list takes about 6 seconds to display, during which a pinwheel spins on the page
After applying config changes, active service checks, active host checks, and notifications do not come up immediately
This is monitored by watching the six system status indicators in the top right of the page
So, the first three will be green right after applying the config, but the last three take 30+ seconds to return to an OK/green state
Here is the list of running processes. It reminded me that we do have another installation on this box - Splunk. I forgot about that. It reads perfdata files and some of the ndoutils MySQL tables and sends the data off to our Splunk server to graphing/metrics. The perfdata forwarding is pretty much realtime, while the MySQL reads are done every 60 minutes. Let me know if you think this could be causing a conflict. It should be noted that the performance issue was not present until I set up the scheduled downtime (increased MySQL activity?) and that the Splunk forwarding had been set up for months before that with no noticeable problems.
Code: Select all
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Dec13 ? 00:00:02 /sbin/init
root 2 0 0 Dec13 ? 00:00:00 [kthreadd]
root 3 2 0 Dec13 ? 00:00:52 [migration/0]
root 4 2 0 Dec13 ? 00:00:24 [ksoftirqd/0]
root 5 2 0 Dec13 ? 00:00:00 [stopper/0]
root 6 2 0 Dec13 ? 00:01:32 [watchdog/0]
root 7 2 0 Dec13 ? 00:00:44 [migration/1]
root 8 2 0 Dec13 ? 00:00:00 [stopper/1]
root 9 2 0 Dec13 ? 00:00:33 [ksoftirqd/1]
root 10 2 0 Dec13 ? 00:01:25 [watchdog/1]
root 11 2 0 Dec13 ? 00:01:11 [migration/2]
root 12 2 0 Dec13 ? 00:00:00 [stopper/2]
root 13 2 0 Dec13 ? 00:00:30 [ksoftirqd/2]
root 14 2 0 Dec13 ? 00:01:10 [watchdog/2]
root 15 2 0 Dec13 ? 00:00:58 [migration/3]
root 16 2 0 Dec13 ? 00:00:00 [stopper/3]
root 17 2 0 Dec13 ? 00:00:31 [ksoftirqd/3]
root 18 2 0 Dec13 ? 00:01:12 [watchdog/3]
root 19 2 0 Dec13 ? 00:02:27 [events/0]
root 20 2 0 Dec13 ? 00:02:04 [events/1]
root 21 2 0 Dec13 ? 00:01:54 [events/2]
root 22 2 0 Dec13 ? 00:02:19 [events/3]
root 23 2 0 Dec13 ? 00:00:00 [cgroup]
root 24 2 0 Dec13 ? 00:00:00 [khelper]
root 25 2 0 Dec13 ? 00:00:00 [netns]
root 26 2 0 Dec13 ? 00:00:00 [async/mgr]
root 27 2 0 Dec13 ? 00:00:00 [pm]
root 28 2 0 Dec13 ? 00:00:07 [sync_supers]
root 29 2 0 Dec13 ? 00:00:03 [bdi-default]
root 30 2 0 Dec13 ? 00:00:00 [kintegrityd/0]
root 31 2 0 Dec13 ? 00:00:00 [kintegrityd/1]
root 32 2 0 Dec13 ? 00:00:00 [kintegrityd/2]
root 33 2 0 Dec13 ? 00:00:00 [kintegrityd/3]
root 34 2 0 Dec13 ? 00:03:58 [kblockd/0]
root 35 2 0 Dec13 ? 00:03:35 [kblockd/1]
root 36 2 0 Dec13 ? 00:03:32 [kblockd/2]
root 37 2 0 Dec13 ? 00:03:43 [kblockd/3]
root 38 2 0 Dec13 ? 00:00:00 [kacpid]
root 39 2 0 Dec13 ? 00:00:00 [kacpi_notify]
root 40 2 0 Dec13 ? 00:00:00 [kacpi_hotplug]
root 41 2 0 Dec13 ? 00:00:00 [ata_aux]
root 42 2 0 Dec13 ? 00:00:00 [ata_sff/0]
root 43 2 0 Dec13 ? 00:00:00 [ata_sff/1]
root 44 2 0 Dec13 ? 00:00:00 [ata_sff/2]
root 45 2 0 Dec13 ? 00:00:00 [ata_sff/3]
root 46 2 0 Dec13 ? 00:00:00 [ksuspend_usbd]
root 47 2 0 Dec13 ? 00:00:00 [khubd]
root 48 2 0 Dec13 ? 00:00:00 [kseriod]
root 49 2 0 Dec13 ? 00:00:00 [md/0]
root 50 2 0 Dec13 ? 00:00:00 [md/1]
root 51 2 0 Dec13 ? 00:00:00 [md/2]
root 52 2 0 Dec13 ? 00:00:00 [md/3]
root 53 2 0 Dec13 ? 00:00:00 [md_misc/0]
root 54 2 0 Dec13 ? 00:00:00 [md_misc/1]
root 55 2 0 Dec13 ? 00:00:00 [md_misc/2]
root 56 2 0 Dec13 ? 00:00:00 [md_misc/3]
root 57 2 0 Dec13 ? 00:00:00 [linkwatch]
root 59 2 0 Dec13 ? 00:00:00 [khungtaskd]
root 60 2 0 Dec13 ? 00:00:03 [kswapd0]
root 61 2 0 Dec13 ? 00:00:00 [ksmd]
root 62 2 0 Dec13 ? 00:01:22 [khugepaged]
root 63 2 0 Dec13 ? 00:00:00 [aio/0]
root 64 2 0 Dec13 ? 00:00:00 [aio/1]
root 65 2 0 Dec13 ? 00:00:00 [aio/2]
root 66 2 0 Dec13 ? 00:00:00 [aio/3]
root 67 2 0 Dec13 ? 00:00:00 [crypto/0]
root 68 2 0 Dec13 ? 00:00:00 [crypto/1]
root 69 2 0 Dec13 ? 00:00:00 [crypto/2]
root 70 2 0 Dec13 ? 00:00:00 [crypto/3]
root 78 2 0 Dec13 ? 00:00:00 [kthrotld/0]
root 79 2 0 Dec13 ? 00:00:00 [kthrotld/1]
root 80 2 0 Dec13 ? 00:00:00 [kthrotld/2]
root 81 2 0 Dec13 ? 00:00:00 [kthrotld/3]
root 82 2 0 Dec13 ? 00:00:00 [pciehpd]
root 84 2 0 Dec13 ? 00:00:00 [kpsmoused]
root 85 2 0 Dec13 ? 00:00:00 [usbhid_resumer]
root 86 2 0 Dec13 ? 00:00:00 [deferwq]
root 117 2 0 Dec13 ? 00:00:00 [kdmremove]
root 118 2 0 Dec13 ? 00:00:00 [kstriped]
root 254 2 0 Dec13 ? 00:00:00 [scsi_eh_0]
root 255 2 0 Dec13 ? 00:00:00 [scsi_eh_1]
root 277 2 0 Dec13 ? 00:00:45 [mpt_poll_0]
root 278 2 0 Dec13 ? 00:00:00 [mpt/0]
root 279 2 0 Dec13 ? 00:00:00 [scsi_eh_2]
root 343 2 0 Dec13 ? 00:00:00 [kdmflush]
root 344 2 0 Dec13 ? 00:00:00 [kdmflush]
root 362 2 0 Dec13 ? 00:03:20 [jbd2/dm-0-8]
root 363 2 0 Dec13 ? 00:00:00 [ext4-dio-unwrit]
root 436 1 0 Dec13 ? 00:00:00 /sbin/udevd -d
root 567 2 0 Dec13 ? 00:00:15 [vmmemctl]
root 722 2 0 Dec13 ? 00:00:00 [kdmflush]
root 725 2 0 Dec13 ? 00:00:00 [kdmflush]
root 771 2 0 Dec13 ? 00:00:00 [jbd2/sda1-8]
root 772 2 0 Dec13 ? 00:00:00 [ext4-dio-unwrit]
root 773 2 0 Dec13 ? 00:02:28 [jbd2/dm-3-8]
root 774 2 0 Dec13 ? 00:00:00 [ext4-dio-unwrit]
root 775 2 0 Dec13 ? 00:03:09 [jbd2/dm-2-8]
root 776 2 0 Dec13 ? 00:00:00 [ext4-dio-unwrit]
root 810 2 0 Dec13 ? 00:00:05 [kauditd]
postfix 836 1741 0 08:13 ? 00:00:00 pickup -l -t fifo -u
root 1001 2 0 Dec13 ? 00:04:42 [flush-253:0]
root 1002 2 1 Dec13 ? 01:26:24 [flush-253:2]
root 1003 2 0 Dec13 ? 00:03:18 [flush-253:3]
root 1240 1 0 Dec13 ? 00:00:27 auditd
root 1260 1 0 Dec13 ? 00:00:12 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
dbus 1368 1 0 Dec13 ? 00:00:00 dbus-daemon --system
root 1378 1 0 Dec13 ? 00:00:26 winbindd
root 1420 1 0 Dec13 ? 00:00:00 /usr/sbin/sshd
root 1429 1 0 Dec13 ? 00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
ntp 1438 1 0 Dec13 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
root 1441 436 0 Dec13 ? 00:00:00 /sbin/udevd -d
root 1442 436 0 Dec13 ? 00:00:00 /sbin/udevd -d
root 1474 1 0 Dec13 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
root 1556 1 0 Dec13 ? 00:04:45 /usr/sbin/vmtoolsd
root 1576 1 0 Dec13 ? 00:00:00 /usr/lib/vmware-vgauth/VGAuthService -s
mysql 1577 1474 4 Dec13 ? 03:36:29 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
root 1642 1 0 Dec13 ? 00:01:42 ./ManagementAgentHost
postgres 1646 1 0 Dec13 ? 00:00:37 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
root 1741 1 0 Dec13 ? 00:00:00 /usr/libexec/postfix/master
postfix 1751 1741 0 Dec13 ? 00:00:00 qmgr -l -t fifo -u
root 1755 1 0 Dec13 ? 00:00:15 /usr/sbin/httpd
root 1765 1 0 Dec13 ? 00:00:09 crond
postgres 1772 1646 0 Dec13 ? 00:00:15 postgres: logger process
postgres 1777 1646 0 Dec13 ? 00:01:33 postgres: writer process
postgres 1778 1646 0 Dec13 ? 00:01:06 postgres: wal writer process
postgres 1779 1646 0 Dec13 ? 00:00:24 postgres: autovacuum launcher process
postgres 1780 1646 0 Dec13 ? 00:01:40 postgres: stats collector process
root 1789 1378 0 Dec13 ? 00:00:02 winbindd
root 2940 1 0 Dec13 ? 00:00:28 nmbd -D
root 2954 1 0 Dec13 ? 00:00:00 smbd -D
root 2962 1378 0 Dec13 ? 00:00:01 winbindd
nagios 2966 1 0 Dec13 ? 00:00:23 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
root 2970 1378 0 Dec13 ? 00:00:01 winbindd
root 2976 1378 0 Dec13 ? 00:00:00 winbindd
root 2983 2954 0 Dec13 ? 00:00:00 smbd -D
ajaxterm 3047 1 0 Dec13 ? 00:01:33 python /usr/share/ajaxterm/ajaxterm.py --daemon --port=8022 --uid=ajaxterm
root 3428 1 0 Dec13 ? 00:00:05 /opt/simpana/Base/cvlaunchd
root 3429 1 0 Dec13 ? 00:00:47 /opt/simpana/Base/cvd
root 3430 1 0 Dec13 ? 00:01:40 /opt/simpana/Base/EvMgrC
root 3432 1 0 Dec13 ? 00:00:13 /opt/simpana/Base/cvfwd
nagios 3865 1 0 Dec13 ? 00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root 3902 1 0 Dec13 tty1 00:00:00 /sbin/mingetty /dev/tty1
root 3904 1 0 Dec13 tty2 00:00:00 /sbin/mingetty /dev/tty2
root 3907 1 0 Dec13 tty3 00:00:00 /sbin/mingetty /dev/tty3
root 3910 1 0 Dec13 tty4 00:00:00 /sbin/mingetty /dev/tty4
root 3915 1 0 Dec13 tty5 00:00:00 /sbin/mingetty /dev/tty5
root 3918 1 0 Dec13 tty6 00:00:00 /sbin/mingetty /dev/tty6
apache 4248 1755 3 05:27 ? 00:06:48 /usr/sbin/httpd
postgres 4302 1646 0 05:27 ? 00:00:10 postgres: nagiosxi nagiosxi [local] idle
apache 9772 1755 3 06:34 ? 00:04:31 /usr/sbin/httpd
postgres 9791 1646 0 06:34 ? 00:00:06 postgres: nagiosxi nagiosxi [local] idle
nagios 14247 1 2 Dec14 ? 00:55:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 14249 14247 0 Dec14 ? 00:02:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 14250 14247 0 Dec14 ? 00:02:16 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 14251 14247 0 Dec14 ? 00:02:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 14252 14247 0 Dec14 ? 00:02:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 14253 14247 0 Dec14 ? 00:02:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 14254 14247 0 Dec14 ? 00:02:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 14257 3865 0 Dec14 ? 00:02:25 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 14258 14257 0 Dec14 ? 00:16:03 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 14312 14247 0 Dec14 ? 00:00:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
apache 15243 1755 3 05:35 ? 00:06:45 /usr/sbin/httpd
postgres 15384 1646 0 05:35 ? 00:00:09 postgres: nagiosxi nagiosxi [local] idle
apache 16654 1755 3 06:17 ? 00:05:06 /usr/sbin/httpd
postgres 16672 1646 0 06:17 ? 00:00:07 postgres: nagiosxi nagiosxi [local] idle
apache 18881 1755 3 05:58 ? 00:05:39 /usr/sbin/httpd
postgres 18892 1646 0 05:58 ? 00:00:08 postgres: nagiosxi nagiosxi [local] idle
apache 21978 1755 3 06:00 ? 00:05:36 /usr/sbin/httpd
postgres 21987 1646 0 06:00 ? 00:00:08 postgres: nagiosxi nagiosxi [local] idle
apache 22203 1755 3 05:39 ? 00:06:24 /usr/sbin/httpd
postgres 22210 1646 0 05:39 ? 00:00:09 postgres: nagiosxi nagiosxi [local] idle
apache 23443 1755 3 07:25 ? 00:02:43 /usr/sbin/httpd
postgres 23461 1646 0 07:25 ? 00:00:03 postgres: nagiosxi nagiosxi [local] idle
apache 24092 1755 3 07:04 ? 00:03:24 /usr/sbin/httpd
postgres 24119 1646 0 07:04 ? 00:00:05 postgres: nagiosxi nagiosxi [local] idle
root 24238 1 5 Dec14 ? 02:46:54 splunkd -p 8089 restart
root 24240 24238 0 Dec14 ? 00:00:55 [splunkd pid=24238] splunkd -p 8089 restart [process-runner]
root 24314 24240 0 Dec14 ? 00:06:12 /opt/splunk/bin/python -O /opt/splunk/lib/python2.7/site-packages/splunk/appserver/mrsparkle/root.py --proxied=127.0.0.1,8065,8000
root 24320 24240 0 Dec14 ? 00:03:04 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089
root 24430 24240 0 Dec14 ? 00:00:00 python /opt/splunk/etc/apps/splunk_app_db_connect/bin/rpcstart.py
root 24443 24430 0 Dec14 ? 00:17:22 /opt/jdk1.8.0_60/bin/java -Xmx1024m -XX:+UseConcMarkSweepGC -classpath /opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jackson-all-1.9.11.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/slf4j-log4j12-1.7.7.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-server-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-util-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/mysql-connector-java-5.1.36.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/antlr4-annotations-4.4.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/log4j-1.2.17.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/servlet-api-3.1.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-servlet-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/commons-logging-1.1.2.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/antlr4-runtime-4.4.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-io-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/avro-ipc-1.7.6.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/dbx2.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/slf4j-api-1.7.7.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/avro-compiler-1.7.6.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-security-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-http-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/avro-1.7.6.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-common-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-client-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/javax.websocket-api-1.0.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/commons-io-2.4.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-servlet-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-api-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/javax-websocket-server-impl-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/jcs-2.0.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/slf4j-ext-1.7.7.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-server-9.2.4.v20141103.jar -DSPLUNK_HOME=/opt/splunk com.splunk.dbx2.rpc.RPCServer 127.0.0.1:9998
apache 24682 1755 3 07:25 ? 00:02:40 /usr/sbin/httpd
postgres 24699 1646 0 07:25 ? 00:00:03 postgres: nagiosxi nagiosxi [local] idle
apache 25745 1755 3 07:05 ? 00:03:24 /usr/sbin/httpd
postgres 25792 1646 0 07:05 ? 00:00:04 postgres: nagiosxi nagiosxi [local] idle
apache 26268 1755 2 07:47 ? 00:01:58 /usr/sbin/httpd
postgres 26327 1646 0 07:47 ? 00:00:02 postgres: nagiosxi nagiosxi [local] idle
apache 26578 1755 3 07:06 ? 00:03:22 /usr/sbin/httpd
postgres 26638 1646 0 07:06 ? 00:00:04 postgres: nagiosxi nagiosxi [local] idle
apache 27657 1755 3 07:06 ? 00:03:17 /usr/sbin/httpd
postgres 27665 1646 0 07:06 ? 00:00:04 postgres: nagiosxi nagiosxi [local] idle
apache 28091 1755 3 07:07 ? 00:03:23 /usr/sbin/httpd
postgres 28151 1646 0 07:07 ? 00:00:04 postgres: nagiosxi nagiosxi [local] idle
root 30045 1420 0 Dec14 ? 00:00:00 sshd: mhixson [priv]
mhixson 30244 30045 0 Dec14 ? 00:00:01 sshd: mhixson@pts/0
mhixson 30246 30244 0 Dec14 pts/0 00:00:00 -bash
apache 31229 1755 3 05:45 ? 00:06:23 /usr/sbin/httpd
root 31251 30246 0 Dec14 pts/0 00:00:00 su
postgres 31285 1646 0 05:45 ? 00:00:09 postgres: nagiosxi nagiosxi [local] idle
root 31305 31251 0 Dec14 pts/0 00:00:00 bash
root 31442 1765 0 08:54 ? 00:00:00 CROND
root 31443 1765 0 08:54 ? 00:00:00 CROND
root 31444 1765 0 08:54 ? 00:00:00 CROND
root 31445 1765 0 08:54 ? 00:00:00 CROND
root 31446 1765 0 08:54 ? 00:00:00 CROND
nagios 31448 31445 0 08:54 ? 00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios 31450 31444 0 08:54 ? 00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1
nagios 31453 31448 0 08:54 ? 00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios 31454 31450 0 08:54 ? 00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios 31456 31442 0 08:54 ? 00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios 31458 31443 0 08:54 ? 00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios 31460 31446 0 08:54 ? 00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios 31461 31458 0 08:54 ? 00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios 31462 31460 0 08:54 ? 00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios 31464 31456 0 08:54 ? 00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
postgres 31465 1646 0 08:54 ? 00:00:00 postgres: nagiosxi nagiosxi [local] idle
postgres 31471 1646 0 08:54 ? 00:00:00 postgres: nagiosxi nagiosxi [local] idle
postgres 31478 1646 0 08:54 ? 00:00:00 postgres: nagiosxi nagiosxi [local] idle
postgres 31483 1646 0 08:54 ? 00:00:00 postgres: nagiosxi nagiosxi [local] idle
postgres 31489 1646 0 08:54 ? 00:00:00 postgres: nagiosxi nagiosxi [local] idle
nagios 32262 14254 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.3.96 -t 30 -c reboot_server -a Sunday 04:00 off
nagios 32264 14252 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.3.55 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios 32265 14250 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.68.16 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios 32270 14252 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.148.16 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 15% crit=free lt 10%
nagios 32272 14249 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.2.166 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios 32280 14254 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.1.49 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios 32281 14253 0 08:54 ? 00:00:00 /usr/local/nagios/libexec/check_icmp -H 10.222.1.127 -w 3000.0 80 -c 5000.0 100 -p 5
root 32282 31305 0 08:54 pts/0 00:00:00 ps -ef
apache 32706 1755 3 05:25 ? 00:07:07 /usr/sbin/httpd
postgres 32725 1646 0 05:25 ? 00:00:10 postgres: nagiosxi nagiosxi [local] idle
Re: Performance degraded after Reoccurring Downtime setup
Posted: Wed Dec 16, 2015 2:49 pm
by mhixson2
I stopped Splunk on the Nagios server and confirmed the 6 second load time and the 30 check/notification recovery both still exist.
Let me know if you have any ideas.
Thanks
Re: Performance degraded after Reoccurring Downtime setup
Posted: Wed Dec 16, 2015 2:52 pm
by hsmith
You're running Splunk on the same server that you are running Nagios XI? We only officially support clean minimal installation systems. We can't speak for what changes Splunk may have made to the system that could be causing performance issues. Please correct me if I am not understanding your setup correctly.
Re: Performance degraded after Reoccurring Downtime setup
Posted: Wed Dec 16, 2015 3:45 pm
by mhixson2
Understood. And you are correct - same server. However, the start of these issues did not correspond with the Splunk integration. They've been happily running side by side for many months.
I'm afraid I may have some corruption or something else going in one of the databases or something. I didn't think it related till just now, but the root partition of this server ran out of space several weeks ago, which caused the OS to crash, and upon recovery, I found the Nagios MySQL databases non-functional. I called in and you guys walked me through running the repair_databases shell script, which fixed it. I'm wondering if there's something left from that... I'm not even sure where to look or how to tell. I'll look into where those logs are.
I have deleted the reoccurring downtime, and Nagios' performance is back to normal. So, does that config get stored in one of the MySQL databases? Have you heard of this issue before when reoccurring downtime is set up for ~680 hosts?
Any help is appreciated.
Thanks
Re: Performance degraded after Reoccurring Downtime setup
Posted: Wed Dec 16, 2015 4:12 pm
by mhixson2
To your point - I am going to investigate getting this Splunk installation off of the Nagios server. I didn't like it from the beginning, and the integration said that's the way it must be done, but I don't believe it. Splunk should be able to connect and get what it needs without anything running on the server. I'll see where that takes me. In the meantime, as I said, any help is appreciated.
Thanks.
Re: Performance degraded after Reoccurring Downtime setup
Posted: Thu Dec 17, 2015 2:03 pm
by ssax
Generally when we see disk space filled up we also see crashed tables in /var/log/mysqld.log, that could be the problem, take a look in there and see if you see any crashed tables. That's usually the culprit of sudden onset slowness.
Re: Performance degraded after Reoccurring Downtime setup
Posted: Thu Dec 17, 2015 2:41 pm
by mhixson2
Thanks!
I ended up looking through that log this morning and found the events below. The events through 12/13 correspond with the server running out of space, and me cleaning it up and running the repair databases shell script. All was well after that on the server and in this log. The only thing that doesn't correspond with that outage is the 'lost+found' events. Those were happening up to this morning. I went ahead and ran the repair databases script again and it completed successfully. No events have been recorded in the log since the completion of that script.
Code: Select all
151212 20:10:15 [Warning] Disk is full writing '/tmp/ST3pL5Kr' (Errcode: 28). Waiting for someone to free
space... (Expect up to 60 secs delay for server to continue after freeing disk space)
151212 20:10:15 [Warning] Retry in 60 secs. Message reprinted in 600 secs
151212 20:20:15 [Warning] Disk is full writing '/tmp/ST3pL5Kr' (Errcode: 28). Waiting for someone to free
space...151213 00:10:03 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
151213 0:10:06 InnoDB: Initializing buffer pool, size = 8.0M
151213 0:10:06 InnoDB: Completed initialization of buffer pool
151213 0:10:10 InnoDB: Started; log sequence number 0 44233
151213 0:10:11 [Note] Event Scheduler: Loaded 0 events
151213 0:10:11 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
151213 0:21:56 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI
'; try to repair it
151213 0:21:56 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI
'; try to repair it
151213 0:25:13 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI'; try to repair it
151213 7:00:03 [ERROR] Invalid (old?) table or database name 'lost+found'
151214 7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
151215 7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
151216 7:00:02 [ERROR] Invalid (old?) table or database name 'lost+found'
151216 10:08:18 [ERROR] Invalid (old?) table or database name 'lost+found'
151217 7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
So, the log appears to be happy with the way MySQL is running, and Nagios certainly isn't showing any errors. However, the slowness is still present. Again this morning I deleted all scheduled downtime (in the Mass Acknowledge tool) and again it performed perfect after that. I still have my scheduled downtime configured (i never deleted those entries - just the downtime on the hosts). Over time, they re-populate and Nagios begins to slow down.
What else is involved with scheduling reoccurring downtime? Does it write that to a config or table?
Thanks