Since this morning about 2 hours ago, we've been experiencing the message queue filling up and BPI checks timing out.
There were no apparent changes made to trigger the change in Nagios' behaviour
Initial troubleshooting showed the mysql database requesting a repair, which was run, but after service restart and even a reboot, Nagios is still behaving erratically.
System info:
Nagios XI VI (specifically, Linux cawlkl21.xxx 2.6.32-754.6.3.el6.x86_64 #1 SMP Tue Oct 9 17:27:49 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux)
System profile attached
I've looked quickly at https://support.nagios.com/kb/article/n ... d-139.html but I don't know why we would all of a sudden have a queue processing issue if we haven't added a ton of checks... seems like something else isn't functioning properly to get the checks through the queue but not sure where to look.
TIA for your help
You do not have the required permissions to view the files attached to this post.
I've ran the check and attached the output txt. We performed an upgrade back on August 23rd to 5.3.5. Our CPU usage is also spiking, averaging 95-97% usage. I've included a screenshot from the admin console and one of top in cli.
You do not have the required permissions to view the files attached to this post.
@perric, Please run the following commands in order. This will reset all major nagios processes.
service nagios stop
service ndo2db stop
service mysqld stop
service crond stop
service httpd stop
killall -9 nagios
killall -9 ndo2db
rm -f /usr/local/nagios/var/rw/nagios.cmd
rm -f /usr/local/nagios/var/nagios.lock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service mysqld start
service ndo2db start
service nagios start
service httpd start
service crond start
Are the BPI checks still timing out? What other issues are you experiencing right now besides the CPU load?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Ok, I'll look into running that. FYI we had rebooted once after the DB repair already. As for BPI, they seem to be working now. They were previously timing out after ~70 seconds.
We are still seeing the queue length increase; at current rate it looks like it'll be full about 30 minutes after we ran the clean/restart commands.
Nagios appears to be functioning properly but I am sure there must be some impact and /var/log/messages floods with unable to write to queue errors.
Regarding 5.5.5, we can upgrade but need to go through our test environments first; we'll need to fix this problem before looking at the upgrade.
We've run everything as listed, but still no help. Are there any other logs or extracts we can run to get more info? Nagios has become unresponsive to commands again (Scheduled downtime, force service checks, etc...) Please advise as this has halted monitoring for us.
In XI 5.5.5 there are some performance enhancements and a fix for the BPI component that would cause high load and issues with NDO and the kernel message queue filling up.
5.5.5 - 10/11/2018
==================
- Fixed adding new user creating a message that says current user should update their API key if they haven't yet -JO
- Fixed login link on rapid response URL when a ticket does not exist or has expired -JO
- Fixed status check for NDO in BPI component API tool so that it properly sleeps after each call -JO
- Fixed audit log max age value undefined default to 180 instead of 30 and added to performance settings -JO
- Fixed an issue where notification settings would sometimes display incorrectly [TPS#13613] -SAW
- Fixed an issue where hosts/services with forward-slashes ("/") in their names would not reconfigure correctly [TPS#13607] -SAW
- Fixed various PHP notices in error log -JO
- Fixed issue with SLA report links not going to external (or program url if external is empty) when PDF is generated [TPS#13619] -JO
- Fixed logging scheduled reporting pdf generation to wkhtmltox.log -JO
- Fixed issue with reports/pages missing data in PDFs [TPS#13628] -JO
- Fixed user permissions on non-active objects causing large/slow SQL queries on some systems -JO
The upgrade should fix your issue.
Be sure to check out our Knowledgebase for helpful articles and solutions!
I've disabled all BPI Nagios checks for now; as I mentioned above, I cannot just upgrade our production system to 5.5.5 without testing.
I'll monitor it see if the issue recurs.
Edit: even with the Nagios BPI checks disabled, the queue is still filling. What's the fastest way to completely disable BPI without losing all of the work in setting it up? Move the BPI config to a temp location?
Edit 2: we're on 5.5.3, not 5.3.5 as I stated earlier; which version was this issue introduced?
Last edited by perric on Fri Oct 26, 2018 3:05 pm, edited 2 times in total.