ndo2db queues filling and BPI checks timing out

perric · Post by **perric** » Fri Oct 26, 2018 10:16 am

Since this morning about 2 hours ago, we've been experiencing the message queue filling up and BPI checks timing out.
There were no apparent changes made to trigger the change in Nagios' behaviour

Initial troubleshooting showed the mysql database requesting a repair, which was run, but after service restart and even a reboot, Nagios is still behaving erratically.

System info:

Nagios XI VI (specifically, Linux cawlkl21.xxx 2.6.32-754.6.3.el6.x86_64 #1 SMP Tue Oct 9 17:27:49 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux)
System profile attached

I've looked quickly at https://support.nagios.com/kb/article/n ... d-139.html but I don't know why we would all of a sudden have a queue processing issue if we haven't added a ton of checks... seems like something else isn't functioning properly to get the checks through the queue but not sure where to look.

TIA for your help

npolovenko · Post by **npolovenko** » Fri Oct 26, 2018 12:27 pm

@perric, I see database issues since the beginning of October. I think that may have caused ndo queue issues.

Please run the following command and show me the output:

mysqlcheck -r -f -uroot -pnagiosxi --all-databases

Have you upgraded the XI recently?

perric · Post by **perric** » Fri Oct 26, 2018 12:51 pm

I've ran the check and attached the output txt. We performed an upgrade back on August 23rd to 5.3.5. Our CPU usage is also spiking, averaging 95-97% usage. I've included a screenshot from the admin console and one of top in cli.

npolovenko · Post by **npolovenko** » Fri Oct 26, 2018 1:01 pm

@perric, Please run the following commands in order. This will reset all major nagios processes.

service nagios stop
service ndo2db stop
service mysqld stop
service crond stop
service httpd stop
killall -9 nagios
killall -9 ndo2db
rm -f /usr/local/nagios/var/rw/nagios.cmd
rm -f /usr/local/nagios/var/nagios.lock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service mysqld start
service ndo2db start
service nagios start
service httpd start
service crond start

Are the BPI checks still timing out? What other issues are you experiencing right now besides the CPU load?

perric · Post by **perric** » Fri Oct 26, 2018 1:24 pm

Ok, I'll look into running that. FYI we had rebooted once after the DB repair already. As for BPI, they seem to be working now. They were previously timing out after ~70 seconds.

npolovenko · Post by **npolovenko** » Fri Oct 26, 2018 1:33 pm

@perric, Ok, great. As for the CPU load, you should upgrade to XI 5.5.5. There were some improvements related to this problem.

perric · Post by **perric** » Fri Oct 26, 2018 1:45 pm

We are still seeing the queue length increase; at current rate it looks like it'll be full about 30 minutes after we ran the clean/restart commands.
Nagios appears to be functioning properly but I am sure there must be some impact and /var/log/messages floods with unable to write to queue errors.

Regarding 5.5.5, we can upgrade but need to go through our test environments first; we'll need to fix this problem before looking at the upgrade.

Anything else we should be checking?

Thanks

perric · Post by **perric** » Fri Oct 26, 2018 2:36 pm

We've run everything as listed, but still no help. Are there any other logs or extracts we can run to get more info? Nagios has become unresponsive to commands again (Scheduled downtime, force service checks, etc...) Please advise as this has halted monitoring for us.

Post by **tgriep** » Fri Oct 26, 2018 2:41 pm

In XI 5.5.5 there are some performance enhancements and a fix for the BPI component that would cause high load and issues with NDO and the kernel message queue filling up.

5.5.5 - 10/11/2018
==================
- Fixed adding new user creating a message that says current user should update their API key if they haven't yet -JO
- Fixed login link on rapid response URL when a ticket does not exist or has expired -JO
- Fixed status check for NDO in BPI component API tool so that it properly sleeps after each call -JO
- Fixed audit log max age value undefined default to 180 instead of 30 and added to performance settings -JO
- Fixed an issue where notification settings would sometimes display incorrectly [TPS#13613] -SAW
- Fixed an issue where hosts/services with forward-slashes ("/") in their names would not reconfigure correctly [TPS#13607] -SAW
- Fixed various PHP notices in error log -JO
- Fixed issue with SLA report links not going to external (or program url if external is empty) when PDF is generated [TPS#13619] -JO
- Fixed logging scheduled reporting pdf generation to wkhtmltox.log -JO
- Fixed issue with reports/pages missing data in PDFs [TPS#13628] -JO
- Fixed user permissions on non-active objects causing large/slow SQL queries on some systems -JO

The upgrade should fix your issue.

perric · Post by **perric** » Fri Oct 26, 2018 2:55 pm

I've disabled all BPI Nagios checks for now; as I mentioned above, I cannot just upgrade our production system to 5.5.5 without testing.

I'll monitor it see if the issue recurs.

Edit: even with the Nagios BPI checks disabled, the queue is still filling. What's the fastest way to completely disable BPI without losing all of the work in setting it up? Move the BPI config to a temp location?

Edit 2: we're on 5.5.3, not 5.3.5 as I stated earlier; which version was this issue introduced?

Nagios Support Forum

ndo2db queues filling and BPI checks timing out

ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out

Re: ndo2db queues filling and BPI checks timing out