NSClient++ 0.4.3.143-x64
NRPE checks with a fair amount of external scripts
I configured reoccurring downtime for all of my hosts late last week and ever since then performance has been very slow via the web UI. It was never slow before that.
I've been watching server performance:
CPU utilization is very low - typically under 10% with occasional to 50 or 60%, but the spikes seem to correspond with background processes, not apache demand.
Memory use is at 10.5 of 12GB and holding.
I/O wait is currently 0.05% but this spikes hard when I apply a new config and takes a minute or two to recover.
Load (pulled from test Nagios server which is watching prod) is currently: load1=0.43, load5=0.61, load15=0.9 and doesn't really ever spike much above 5%.
Is the performance issue related to having reoccurring downtime scheduled for all hosts? And I have noticed that as I navigate, the messages on each service and host that note the upcoming downtime take some time to load.
Let me know if you need any other info. I need to get this thing back to its snappy self!
The Program URL hadn't been reverted from SSL testing I had been doing previously, but backed out of. It was https://10.220.102.42/nagiosxi/, but I have reverted it back to http://10.220.102.42/nagiosxi/. With this reverted however, the problem still persists.
Examples of the issue:
Click on unhandled problems on the home screen (11 warnings in this situation)
The borders and menus on the new page load immediately
The service list takes about 6 seconds to display, during which a pinwheel spins on the page
After applying config changes, active service checks, active host checks, and notifications do not come up immediately
This is monitored by watching the six system status indicators in the top right of the page
So, the first three will be green right after applying the config, but the last three take 30+ seconds to return to an OK/green state
Here is the list of running processes. It reminded me that we do have another installation on this box - Splunk. I forgot about that. It reads perfdata files and some of the ndoutils MySQL tables and sends the data off to our Splunk server to graphing/metrics. The perfdata forwarding is pretty much realtime, while the MySQL reads are done every 60 minutes. Let me know if you think this could be causing a conflict. It should be noted that the performance issue was not present until I set up the scheduled downtime (increased MySQL activity?) and that the Splunk forwarding had been set up for months before that with no noticeable problems.
I stopped Splunk on the Nagios server and confirmed the 6 second load time and the 30 check/notification recovery both still exist.
Let me know if you have any ideas.
You're running Splunk on the same server that you are running Nagios XI? We only officially support clean minimal installation systems. We can't speak for what changes Splunk may have made to the system that could be causing performance issues. Please correct me if I am not understanding your setup correctly.
Understood. And you are correct - same server. However, the start of these issues did not correspond with the Splunk integration. They've been happily running side by side for many months.
I'm afraid I may have some corruption or something else going in one of the databases or something. I didn't think it related till just now, but the root partition of this server ran out of space several weeks ago, which caused the OS to crash, and upon recovery, I found the Nagios MySQL databases non-functional. I called in and you guys walked me through running the repair_databases shell script, which fixed it. I'm wondering if there's something left from that... I'm not even sure where to look or how to tell. I'll look into where those logs are.
I have deleted the reoccurring downtime, and Nagios' performance is back to normal. So, does that config get stored in one of the MySQL databases? Have you heard of this issue before when reoccurring downtime is set up for ~680 hosts?
To your point - I am going to investigate getting this Splunk installation off of the Nagios server. I didn't like it from the beginning, and the integration said that's the way it must be done, but I don't believe it. Splunk should be able to connect and get what it needs without anything running on the server. I'll see where that takes me. In the meantime, as I said, any help is appreciated.
Generally when we see disk space filled up we also see crashed tables in /var/log/mysqld.log, that could be the problem, take a look in there and see if you see any crashed tables. That's usually the culprit of sudden onset slowness.
I ended up looking through that log this morning and found the events below. The events through 12/13 correspond with the server running out of space, and me cleaning it up and running the repair databases shell script. All was well after that on the server and in this log. The only thing that doesn't correspond with that outage is the 'lost+found' events. Those were happening up to this morning. I went ahead and ran the repair databases script again and it completed successfully. No events have been recorded in the log since the completion of that script.
151212 20:10:15 [Warning] Disk is full writing '/tmp/ST3pL5Kr' (Errcode: 28). Waiting for someone to free
space... (Expect up to 60 secs delay for server to continue after freeing disk space)
151212 20:10:15 [Warning] Retry in 60 secs. Message reprinted in 600 secs
151212 20:20:15 [Warning] Disk is full writing '/tmp/ST3pL5Kr' (Errcode: 28). Waiting for someone to free
space...151213 00:10:03 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
151213 0:10:06 InnoDB: Initializing buffer pool, size = 8.0M
151213 0:10:06 InnoDB: Completed initialization of buffer pool
151213 0:10:10 InnoDB: Started; log sequence number 0 44233
151213 0:10:11 [Note] Event Scheduler: Loaded 0 events
151213 0:10:11 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
151213 0:21:56 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI
'; try to repair it
151213 0:21:56 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI
'; try to repair it
151213 0:25:13 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_logentries.MYI'; try to repair it
151213 7:00:03 [ERROR] Invalid (old?) table or database name 'lost+found'
151214 7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
151215 7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
151216 7:00:02 [ERROR] Invalid (old?) table or database name 'lost+found'
151216 10:08:18 [ERROR] Invalid (old?) table or database name 'lost+found'
151217 7:00:01 [ERROR] Invalid (old?) table or database name 'lost+found'
So, the log appears to be happy with the way MySQL is running, and Nagios certainly isn't showing any errors. However, the slowness is still present. Again this morning I deleted all scheduled downtime (in the Mass Acknowledge tool) and again it performed perfect after that. I still have my scheduled downtime configured (i never deleted those entries - just the downtime on the hosts). Over time, they re-populate and Nagios begins to slow down.
What else is involved with scheduling reoccurring downtime? Does it write that to a config or table?