Performance degraded after Reoccurring Downtime setup

mhixson2 · Post by **mhixson2** » Tue Dec 15, 2015 10:57 am

Environment:

Nagios XI 2014R2.7
CentOS 6.6 VM
4xCPU, 12GB Mem

679 hosts (all Windows)
5500 services

NSClient++ 0.4.3.143-x64
NRPE checks with a fair amount of external scripts

I configured reoccurring downtime for all of my hosts late last week and ever since then performance has been very slow via the web UI. It was never slow before that.
I've been watching server performance:

CPU utilization is very low - typically under 10% with occasional to 50 or 60%, but the spikes seem to correspond with background processes, not apache demand.
Memory use is at 10.5 of 12GB and holding.
I/O wait is currently 0.05% but this spikes hard when I apply a new config and takes a minute or two to recover.
Load (pulled from test Nagios server which is watching prod) is currently: load1=0.43, load5=0.61, load15=0.9 and doesn't really ever spike much above 5%.

Is the performance issue related to having reoccurring downtime scheduled for all hosts? And I have noticed that as I navigate, the messages on each service and host that note the upcoming downtime take some time to load.

Let me know if you need any other info. I need to get this thing back to its snappy self!

Thanks

Post by **tgriep** » Tue Dec 15, 2015 5:45 pm

Can you run the following as root on the server and post the output here?

Code: Select all

ps -ef

Post by **Box293** » Tue Dec 15, 2015 7:53 pm

In addition to what @tgriep has requested, can you please provide this information:

Admin > System Config > Manage System Config

What is in your "Program URL" and "External URL" fields?

Does you Program URL resolve to an internal IP address?

mhixson2 · Post by **mhixson2** » Wed Dec 16, 2015 10:20 am

Thanks guys. I think you're both onto something.

The Program URL hadn't been reverted from SSL testing I had been doing previously, but backed out of. It was https://10.220.102.42/nagiosxi/, but I have reverted it back to http://10.220.102.42/nagiosxi/. With this reverted however, the problem still persists.

Examples of the issue:

Click on unhandled problems on the home screen (11 warnings in this situation)
The borders and menus on the new page load immediately
The service list takes about 6 seconds to display, during which a pinwheel spins on the page

After applying config changes, active service checks, active host checks, and notifications do not come up immediately
This is monitored by watching the six system status indicators in the top right of the page
So, the first three will be green right after applying the config, but the last three take 30+ seconds to return to an OK/green state

Here is the list of running processes. It reminded me that we do have another installation on this box - Splunk. I forgot about that. It reads perfdata files and some of the ndoutils MySQL tables and sends the data off to our Splunk server to graphing/metrics. The perfdata forwarding is pretty much realtime, while the MySQL reads are done every 60 minutes. Let me know if you think this could be causing a conflict. It should be noted that the performance issue was not present until I set up the scheduled downtime (increased MySQL activity?) and that the Splunk forwarding had been set up for months before that with no noticeable problems.

Code: Select all

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Dec13 ?        00:00:02 /sbin/init
root         2     0  0 Dec13 ?        00:00:00 [kthreadd]
root         3     2  0 Dec13 ?        00:00:52 [migration/0]
root         4     2  0 Dec13 ?        00:00:24 [ksoftirqd/0]
root         5     2  0 Dec13 ?        00:00:00 [stopper/0]
root         6     2  0 Dec13 ?        00:01:32 [watchdog/0]
root         7     2  0 Dec13 ?        00:00:44 [migration/1]
root         8     2  0 Dec13 ?        00:00:00 [stopper/1]
root         9     2  0 Dec13 ?        00:00:33 [ksoftirqd/1]
root        10     2  0 Dec13 ?        00:01:25 [watchdog/1]
root        11     2  0 Dec13 ?        00:01:11 [migration/2]
root        12     2  0 Dec13 ?        00:00:00 [stopper/2]
root        13     2  0 Dec13 ?        00:00:30 [ksoftirqd/2]
root        14     2  0 Dec13 ?        00:01:10 [watchdog/2]
root        15     2  0 Dec13 ?        00:00:58 [migration/3]
root        16     2  0 Dec13 ?        00:00:00 [stopper/3]
root        17     2  0 Dec13 ?        00:00:31 [ksoftirqd/3]
root        18     2  0 Dec13 ?        00:01:12 [watchdog/3]
root        19     2  0 Dec13 ?        00:02:27 [events/0]
root        20     2  0 Dec13 ?        00:02:04 [events/1]
root        21     2  0 Dec13 ?        00:01:54 [events/2]
root        22     2  0 Dec13 ?        00:02:19 [events/3]
root        23     2  0 Dec13 ?        00:00:00 [cgroup]
root        24     2  0 Dec13 ?        00:00:00 [khelper]
root        25     2  0 Dec13 ?        00:00:00 [netns]
root        26     2  0 Dec13 ?        00:00:00 [async/mgr]
root        27     2  0 Dec13 ?        00:00:00 [pm]
root        28     2  0 Dec13 ?        00:00:07 [sync_supers]
root        29     2  0 Dec13 ?        00:00:03 [bdi-default]
root        30     2  0 Dec13 ?        00:00:00 [kintegrityd/0]
root        31     2  0 Dec13 ?        00:00:00 [kintegrityd/1]
root        32     2  0 Dec13 ?        00:00:00 [kintegrityd/2]
root        33     2  0 Dec13 ?        00:00:00 [kintegrityd/3]
root        34     2  0 Dec13 ?        00:03:58 [kblockd/0]
root        35     2  0 Dec13 ?        00:03:35 [kblockd/1]
root        36     2  0 Dec13 ?        00:03:32 [kblockd/2]
root        37     2  0 Dec13 ?        00:03:43 [kblockd/3]
root        38     2  0 Dec13 ?        00:00:00 [kacpid]
root        39     2  0 Dec13 ?        00:00:00 [kacpi_notify]
root        40     2  0 Dec13 ?        00:00:00 [kacpi_hotplug]
root        41     2  0 Dec13 ?        00:00:00 [ata_aux]
root        42     2  0 Dec13 ?        00:00:00 [ata_sff/0]
root        43     2  0 Dec13 ?        00:00:00 [ata_sff/1]
root        44     2  0 Dec13 ?        00:00:00 [ata_sff/2]
root        45     2  0 Dec13 ?        00:00:00 [ata_sff/3]
root        46     2  0 Dec13 ?        00:00:00 [ksuspend_usbd]
root        47     2  0 Dec13 ?        00:00:00 [khubd]
root        48     2  0 Dec13 ?        00:00:00 [kseriod]
root        49     2  0 Dec13 ?        00:00:00 [md/0]
root        50     2  0 Dec13 ?        00:00:00 [md/1]
root        51     2  0 Dec13 ?        00:00:00 [md/2]
root        52     2  0 Dec13 ?        00:00:00 [md/3]
root        53     2  0 Dec13 ?        00:00:00 [md_misc/0]
root        54     2  0 Dec13 ?        00:00:00 [md_misc/1]
root        55     2  0 Dec13 ?        00:00:00 [md_misc/2]
root        56     2  0 Dec13 ?        00:00:00 [md_misc/3]
root        57     2  0 Dec13 ?        00:00:00 [linkwatch]
root        59     2  0 Dec13 ?        00:00:00 [khungtaskd]
root        60     2  0 Dec13 ?        00:00:03 [kswapd0]
root        61     2  0 Dec13 ?        00:00:00 [ksmd]
root        62     2  0 Dec13 ?        00:01:22 [khugepaged]
root        63     2  0 Dec13 ?        00:00:00 [aio/0]
root        64     2  0 Dec13 ?        00:00:00 [aio/1]
root        65     2  0 Dec13 ?        00:00:00 [aio/2]
root        66     2  0 Dec13 ?        00:00:00 [aio/3]
root        67     2  0 Dec13 ?        00:00:00 [crypto/0]
root        68     2  0 Dec13 ?        00:00:00 [crypto/1]
root        69     2  0 Dec13 ?        00:00:00 [crypto/2]
root        70     2  0 Dec13 ?        00:00:00 [crypto/3]
root        78     2  0 Dec13 ?        00:00:00 [kthrotld/0]
root        79     2  0 Dec13 ?        00:00:00 [kthrotld/1]
root        80     2  0 Dec13 ?        00:00:00 [kthrotld/2]
root        81     2  0 Dec13 ?        00:00:00 [kthrotld/3]
root        82     2  0 Dec13 ?        00:00:00 [pciehpd]
root        84     2  0 Dec13 ?        00:00:00 [kpsmoused]
root        85     2  0 Dec13 ?        00:00:00 [usbhid_resumer]
root        86     2  0 Dec13 ?        00:00:00 [deferwq]
root       117     2  0 Dec13 ?        00:00:00 [kdmremove]
root       118     2  0 Dec13 ?        00:00:00 [kstriped]
root       254     2  0 Dec13 ?        00:00:00 [scsi_eh_0]
root       255     2  0 Dec13 ?        00:00:00 [scsi_eh_1]
root       277     2  0 Dec13 ?        00:00:45 [mpt_poll_0]
root       278     2  0 Dec13 ?        00:00:00 [mpt/0]
root       279     2  0 Dec13 ?        00:00:00 [scsi_eh_2]
root       343     2  0 Dec13 ?        00:00:00 [kdmflush]
root       344     2  0 Dec13 ?        00:00:00 [kdmflush]
root       362     2  0 Dec13 ?        00:03:20 [jbd2/dm-0-8]
root       363     2  0 Dec13 ?        00:00:00 [ext4-dio-unwrit]
root       436     1  0 Dec13 ?        00:00:00 /sbin/udevd -d
root       567     2  0 Dec13 ?        00:00:15 [vmmemctl]
root       722     2  0 Dec13 ?        00:00:00 [kdmflush]
root       725     2  0 Dec13 ?        00:00:00 [kdmflush]
root       771     2  0 Dec13 ?        00:00:00 [jbd2/sda1-8]
root       772     2  0 Dec13 ?        00:00:00 [ext4-dio-unwrit]
root       773     2  0 Dec13 ?        00:02:28 [jbd2/dm-3-8]
root       774     2  0 Dec13 ?        00:00:00 [ext4-dio-unwrit]
root       775     2  0 Dec13 ?        00:03:09 [jbd2/dm-2-8]
root       776     2  0 Dec13 ?        00:00:00 [ext4-dio-unwrit]
root       810     2  0 Dec13 ?        00:00:05 [kauditd]
postfix    836  1741  0 08:13 ?        00:00:00 pickup -l -t fifo -u
root      1001     2  0 Dec13 ?        00:04:42 [flush-253:0]
root      1002     2  1 Dec13 ?        01:26:24 [flush-253:2]
root      1003     2  0 Dec13 ?        00:03:18 [flush-253:3]
root      1240     1  0 Dec13 ?        00:00:27 auditd
root      1260     1  0 Dec13 ?        00:00:12 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
dbus      1368     1  0 Dec13 ?        00:00:00 dbus-daemon --system
root      1378     1  0 Dec13 ?        00:00:26 winbindd
root      1420     1  0 Dec13 ?        00:00:00 /usr/sbin/sshd
root      1429     1  0 Dec13 ?        00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
ntp       1438     1  0 Dec13 ?        00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
root      1441   436  0 Dec13 ?        00:00:00 /sbin/udevd -d
root      1442   436  0 Dec13 ?        00:00:00 /sbin/udevd -d
root      1474     1  0 Dec13 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
root      1556     1  0 Dec13 ?        00:04:45 /usr/sbin/vmtoolsd
root      1576     1  0 Dec13 ?        00:00:00 /usr/lib/vmware-vgauth/VGAuthService -s
mysql     1577  1474  4 Dec13 ?        03:36:29 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
root      1642     1  0 Dec13 ?        00:01:42 ./ManagementAgentHost
postgres  1646     1  0 Dec13 ?        00:00:37 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
root      1741     1  0 Dec13 ?        00:00:00 /usr/libexec/postfix/master
postfix   1751  1741  0 Dec13 ?        00:00:00 qmgr -l -t fifo -u
root      1755     1  0 Dec13 ?        00:00:15 /usr/sbin/httpd
root      1765     1  0 Dec13 ?        00:00:09 crond
postgres  1772  1646  0 Dec13 ?        00:00:15 postgres: logger process                          
postgres  1777  1646  0 Dec13 ?        00:01:33 postgres: writer process                          
postgres  1778  1646  0 Dec13 ?        00:01:06 postgres: wal writer process                      
postgres  1779  1646  0 Dec13 ?        00:00:24 postgres: autovacuum launcher process             
postgres  1780  1646  0 Dec13 ?        00:01:40 postgres: stats collector process                 
root      1789  1378  0 Dec13 ?        00:00:02 winbindd
root      2940     1  0 Dec13 ?        00:00:28 nmbd -D
root      2954     1  0 Dec13 ?        00:00:00 smbd -D
root      2962  1378  0 Dec13 ?        00:00:01 winbindd
nagios    2966     1  0 Dec13 ?        00:00:23 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
root      2970  1378  0 Dec13 ?        00:00:01 winbindd
root      2976  1378  0 Dec13 ?        00:00:00 winbindd
root      2983  2954  0 Dec13 ?        00:00:00 smbd -D
ajaxterm  3047     1  0 Dec13 ?        00:01:33 python /usr/share/ajaxterm/ajaxterm.py --daemon --port=8022 --uid=ajaxterm
root      3428     1  0 Dec13 ?        00:00:05 /opt/simpana/Base/cvlaunchd
root      3429     1  0 Dec13 ?        00:00:47 /opt/simpana/Base/cvd
root      3430     1  0 Dec13 ?        00:01:40 /opt/simpana/Base/EvMgrC
root      3432     1  0 Dec13 ?        00:00:13 /opt/simpana/Base/cvfwd
nagios    3865     1  0 Dec13 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root      3902     1  0 Dec13 tty1     00:00:00 /sbin/mingetty /dev/tty1
root      3904     1  0 Dec13 tty2     00:00:00 /sbin/mingetty /dev/tty2
root      3907     1  0 Dec13 tty3     00:00:00 /sbin/mingetty /dev/tty3
root      3910     1  0 Dec13 tty4     00:00:00 /sbin/mingetty /dev/tty4
root      3915     1  0 Dec13 tty5     00:00:00 /sbin/mingetty /dev/tty5
root      3918     1  0 Dec13 tty6     00:00:00 /sbin/mingetty /dev/tty6
apache    4248  1755  3 05:27 ?        00:06:48 /usr/sbin/httpd
postgres  4302  1646  0 05:27 ?        00:00:10 postgres: nagiosxi nagiosxi [local] idle          
apache    9772  1755  3 06:34 ?        00:04:31 /usr/sbin/httpd
postgres  9791  1646  0 06:34 ?        00:00:06 postgres: nagiosxi nagiosxi [local] idle          
nagios   14247     1  2 Dec14 ?        00:55:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   14249 14247  0 Dec14 ?        00:02:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   14250 14247  0 Dec14 ?        00:02:16 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   14251 14247  0 Dec14 ?        00:02:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   14252 14247  0 Dec14 ?        00:02:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   14253 14247  0 Dec14 ?        00:02:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   14254 14247  0 Dec14 ?        00:02:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   14257  3865  0 Dec14 ?        00:02:25 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios   14258 14257  0 Dec14 ?        00:16:03 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios   14312 14247  0 Dec14 ?        00:00:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
apache   15243  1755  3 05:35 ?        00:06:45 /usr/sbin/httpd
postgres 15384  1646  0 05:35 ?        00:00:09 postgres: nagiosxi nagiosxi [local] idle          
apache   16654  1755  3 06:17 ?        00:05:06 /usr/sbin/httpd
postgres 16672  1646  0 06:17 ?        00:00:07 postgres: nagiosxi nagiosxi [local] idle          
apache   18881  1755  3 05:58 ?        00:05:39 /usr/sbin/httpd
postgres 18892  1646  0 05:58 ?        00:00:08 postgres: nagiosxi nagiosxi [local] idle          
apache   21978  1755  3 06:00 ?        00:05:36 /usr/sbin/httpd
postgres 21987  1646  0 06:00 ?        00:00:08 postgres: nagiosxi nagiosxi [local] idle          
apache   22203  1755  3 05:39 ?        00:06:24 /usr/sbin/httpd
postgres 22210  1646  0 05:39 ?        00:00:09 postgres: nagiosxi nagiosxi [local] idle          
apache   23443  1755  3 07:25 ?        00:02:43 /usr/sbin/httpd
postgres 23461  1646  0 07:25 ?        00:00:03 postgres: nagiosxi nagiosxi [local] idle          
apache   24092  1755  3 07:04 ?        00:03:24 /usr/sbin/httpd
postgres 24119  1646  0 07:04 ?        00:00:05 postgres: nagiosxi nagiosxi [local] idle          
root     24238     1  5 Dec14 ?        02:46:54 splunkd -p 8089 restart
root     24240 24238  0 Dec14 ?        00:00:55 [splunkd pid=24238] splunkd -p 8089 restart [process-runner]
root     24314 24240  0 Dec14 ?        00:06:12 /opt/splunk/bin/python -O /opt/splunk/lib/python2.7/site-packages/splunk/appserver/mrsparkle/root.py --proxied=127.0.0.1,8065,8000
root     24320 24240  0 Dec14 ?        00:03:04 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089
root     24430 24240  0 Dec14 ?        00:00:00 python /opt/splunk/etc/apps/splunk_app_db_connect/bin/rpcstart.py
root     24443 24430  0 Dec14 ?        00:17:22 /opt/jdk1.8.0_60/bin/java -Xmx1024m -XX:+UseConcMarkSweepGC -classpath /opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jackson-all-1.9.11.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/slf4j-log4j12-1.7.7.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-server-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-util-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/mysql-connector-java-5.1.36.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/antlr4-annotations-4.4.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/log4j-1.2.17.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/servlet-api-3.1.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-servlet-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/commons-logging-1.1.2.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/antlr4-runtime-4.4.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-io-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/avro-ipc-1.7.6.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/dbx2.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/slf4j-api-1.7.7.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/avro-compiler-1.7.6.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-security-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/jetty-http-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/avro-1.7.6.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-common-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-client-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/javax.websocket-api-1.0.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/commons-io-2.4.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-servlet-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-api-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/javax-websocket-server-impl-9.2.4.v20141103.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/jcs-2.0.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/slf4j-ext-1.7.7.jar:/opt/splunk/etc/apps/splunk_app_db_connect/bin/lib/ext/websocket-server-9.2.4.v20141103.jar -DSPLUNK_HOME=/opt/splunk com.splunk.dbx2.rpc.RPCServer 127.0.0.1:9998
apache   24682  1755  3 07:25 ?        00:02:40 /usr/sbin/httpd
postgres 24699  1646  0 07:25 ?        00:00:03 postgres: nagiosxi nagiosxi [local] idle          
apache   25745  1755  3 07:05 ?        00:03:24 /usr/sbin/httpd
postgres 25792  1646  0 07:05 ?        00:00:04 postgres: nagiosxi nagiosxi [local] idle          
apache   26268  1755  2 07:47 ?        00:01:58 /usr/sbin/httpd
postgres 26327  1646  0 07:47 ?        00:00:02 postgres: nagiosxi nagiosxi [local] idle          
apache   26578  1755  3 07:06 ?        00:03:22 /usr/sbin/httpd
postgres 26638  1646  0 07:06 ?        00:00:04 postgres: nagiosxi nagiosxi [local] idle          
apache   27657  1755  3 07:06 ?        00:03:17 /usr/sbin/httpd
postgres 27665  1646  0 07:06 ?        00:00:04 postgres: nagiosxi nagiosxi [local] idle          
apache   28091  1755  3 07:07 ?        00:03:23 /usr/sbin/httpd
postgres 28151  1646  0 07:07 ?        00:00:04 postgres: nagiosxi nagiosxi [local] idle          
root     30045  1420  0 Dec14 ?        00:00:00 sshd: mhixson [priv]
mhixson  30244 30045  0 Dec14 ?        00:00:01 sshd: mhixson@pts/0
mhixson  30246 30244  0 Dec14 pts/0    00:00:00 -bash
apache   31229  1755  3 05:45 ?        00:06:23 /usr/sbin/httpd
root     31251 30246  0 Dec14 pts/0    00:00:00 su
postgres 31285  1646  0 05:45 ?        00:00:09 postgres: nagiosxi nagiosxi [local] idle          
root     31305 31251  0 Dec14 pts/0    00:00:00 bash
root     31442  1765  0 08:54 ?        00:00:00 CROND
root     31443  1765  0 08:54 ?        00:00:00 CROND
root     31444  1765  0 08:54 ?        00:00:00 CROND
root     31445  1765  0 08:54 ?        00:00:00 CROND
root     31446  1765  0 08:54 ?        00:00:00 CROND
nagios   31448 31445  0 08:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios   31450 31444  0 08:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1
nagios   31453 31448  0 08:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios   31454 31450  0 08:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios   31456 31442  0 08:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios   31458 31443  0 08:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios   31460 31446  0 08:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios   31461 31458  0 08:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios   31462 31460  0 08:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios   31464 31456  0 08:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
postgres 31465  1646  0 08:54 ?        00:00:00 postgres: nagiosxi nagiosxi [local] idle          
postgres 31471  1646  0 08:54 ?        00:00:00 postgres: nagiosxi nagiosxi [local] idle          
postgres 31478  1646  0 08:54 ?        00:00:00 postgres: nagiosxi nagiosxi [local] idle          
postgres 31483  1646  0 08:54 ?        00:00:00 postgres: nagiosxi nagiosxi [local] idle          
postgres 31489  1646  0 08:54 ?        00:00:00 postgres: nagiosxi nagiosxi [local] idle          
nagios   32262 14254  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.3.96 -t 30 -c reboot_server -a Sunday 04:00 off
nagios   32264 14252  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.3.55 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios   32265 14250  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.68.16 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios   32270 14252  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.148.16 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 15% crit=free lt 10%
nagios   32272 14249  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.2.166 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios   32280 14254  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H 10.220.1.49 -t 45 -c check_drivesize -a filter=type = 'fixed' and drive regexp '.*[C-Z].*' warn=free lt 20% crit=free lt 10%
nagios   32281 14253  0 08:54 ?        00:00:00 /usr/local/nagios/libexec/check_icmp -H 10.222.1.127 -w 3000.0 80  -c 5000.0 100  -p 5
root     32282 31305  0 08:54 pts/0    00:00:00 ps -ef
apache   32706  1755  3 05:25 ?        00:07:07 /usr/sbin/httpd
postgres 32725  1646  0 05:25 ?        00:00:10 postgres: nagiosxi nagiosxi [local] idle

mhixson2 · Post by **mhixson2** » Wed Dec 16, 2015 2:49 pm

I stopped Splunk on the Nagios server and confirmed the 6 second load time and the 30 check/notification recovery both still exist.
Let me know if you have any ideas.

Thanks

hsmith · Post by **hsmith** » Wed Dec 16, 2015 2:52 pm

You're running Splunk on the same server that you are running Nagios XI? We only officially support clean minimal installation systems. We can't speak for what changes Splunk may have made to the system that could be causing performance issues. Please correct me if I am not understanding your setup correctly.

mhixson2 · Post by **mhixson2** » Wed Dec 16, 2015 3:45 pm

Understood. And you are correct - same server. However, the start of these issues did not correspond with the Splunk integration. They've been happily running side by side for many months.

I'm afraid I may have some corruption or something else going in one of the databases or something. I didn't think it related till just now, but the root partition of this server ran out of space several weeks ago, which caused the OS to crash, and upon recovery, I found the Nagios MySQL databases non-functional. I called in and you guys walked me through running the repair_databases shell script, which fixed it. I'm wondering if there's something left from that... I'm not even sure where to look or how to tell. I'll look into where those logs are.

I have deleted the reoccurring downtime, and Nagios' performance is back to normal. So, does that config get stored in one of the MySQL databases? Have you heard of this issue before when reoccurring downtime is set up for ~680 hosts?

Any help is appreciated.

Thanks

mhixson2 · Post by **mhixson2** » Wed Dec 16, 2015 4:12 pm

To your point - I am going to investigate getting this Splunk installation off of the Nagios server. I didn't like it from the beginning, and the integration said that's the way it must be done, but I don't believe it. Splunk should be able to connect and get what it needs without anything running on the server. I'll see where that takes me. In the meantime, as I said, any help is appreciated.

Thanks.

ssax · Post by **ssax** » Thu Dec 17, 2015 2:03 pm

Generally when we see disk space filled up we also see crashed tables in /var/log/mysqld.log, that could be the problem, take a look in there and see if you see any crashed tables. That's usually the culprit of sudden onset slowness.

mhixson2 · Post by **mhixson2** » Thu Dec 17, 2015 2:41 pm

Thanks!

I ended up looking through that log this morning and found the events below. The events through 12/13 correspond with the server running out of space, and me cleaning it up and running the repair databases shell script. All was well after that on the server and in this log. The only thing that doesn't correspond with that outage is the 'lost+found' events. Those were happening up to this morning. I went ahead and ran the repair databases script again and it completed successfully. No events have been recorded in the log since the completion of that script.

Code: Select all

151212 20:10:15 [Warning] Disk is full writing '/tmp/ST3pL5Kr' (Errcode: 28). Waiting for someone to free
 space... (Expect up to 60 secs delay for server to continue after freeing disk space)
151212 20:10:15 [Warning] Retry in 60 secs. Message reprinted in 600 secs
151212 20:20:15 [Warning] Disk is full writing '/tmp/ST3pL5Kr' (Errcode: 28). Waiting for someone to free
 space...151213 00:10:03 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
151213  0:10:06  InnoDB: Initializing buffer pool, size = 8.0M
151213  0:10:06  InnoDB: Completed initialization of buffer pool
151213  0:10:10  InnoDB: Started; log sequence number 0 44233
151213  0:10:11 [Note] Event Scheduler: Loaded 0 events
151213  0:10:11 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
151213  0:21:56 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI
'; try to repair it
151213  0:21:56 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI
'; try to repair it
151213  0:25:13 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI'; try to repair it
151213  7:00:03 [ERROR] Invalid (old?) table or database name 'lost+found'
151214  7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
151215  7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
151216  7:00:02 [ERROR] Invalid (old?) table or database name 'lost+found'
151216 10:08:18 [ERROR] Invalid (old?) table or database name 'lost+found'
151217  7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'

So, the log appears to be happy with the way MySQL is running, and Nagios certainly isn't showing any errors. However, the slowness is still present. Again this morning I deleted all scheduled downtime (in the Mass Acknowledge tool) and again it performed perfect after that. I still have my scheduled downtime configured (i never deleted those entries - just the downtime on the hosts). Over time, they re-populate and Nagios begins to slow down.

What else is involved with scheduling reoccurring downtime? Does it write that to a config or table?

Thanks

Nagios Support Forum

Performance degraded after Reoccurring Downtime setup

Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup

Re: Performance degraded after Reoccurring Downtime setup