Message queues can't keep up

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
tmvision
Posts: 32
Joined: Fri Dec 01, 2017 8:15 am

Message queues can't keep up

Post by tmvision »

Hi

We have problems with our XI-installation (again - see our previous topic for details, https://support.nagios.com/forum/viewto ... 16&t=46570), where our message queue can't keep up.
This results in our "Last check"-times stalling, as previously described. We have increased the message-queue limits as described in https://support.nagios.com/kb/article.php?id=139. What is the cause of this problem?
We are handling 700 passive checks, is this too much for a single installation?
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Message queues can't keep up

Post by dwhitfield »

Definitely not too many, if that 700 is accurate.

What's the output of sysctl -p? We could probably tune some things not in that article.

Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

You can also generate a profile manually using the script at /usr/local/nagiosxi/html/includes/components/profile/getprofile.sh

That should generate a profile in /usr/local/nagiosxi/var/components/ which you can get off the server with an application such as FileZilla.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

If you get an error that PROFILE BUILD FAILED, please see https://support.nagios.com/kb/article.p ... ategory=44
tmvision
Posts: 32
Joined: Fri Dec 01, 2017 8:15 am

Re: Message queues can't keep up

Post by tmvision »

I've sent the profile in a pm.

We've enabled slow-query logging for mariadb, as we were suspicious this might be caused by the database hanging.
It has recorded a few entries.
The "worst" entry logged is

Code: Select all

SELECT /*!40001 SQL_NO_CACHE */ * FROM `nagios_externalcommands`
which takes up to ~6 minutes to process. But this doesn't look like a likely cause, as our XI-interface often freezes for 20 minutes or more.

Code: Select all

> sudo sysctl -p
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Message queues can't keep up

Post by cdienger »

@dwhitfield is out today and tomorrow. Pm me the profile when you have the chance and I will be able to review it.

UPDATE: Profile received and shared with techs
Last edited by dwhitfield on Fri Jan 26, 2018 10:13 am, edited 1 time in total.
Reason: pm received
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
tmvision
Posts: 32
Joined: Fri Dec 01, 2017 8:15 am

Re: Message queues can't keep up

Post by tmvision »

I've re-sent the profile to both of you.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Message queues can't keep up

Post by dwhitfield »

Those numbers look like you took them from our document. The document though is pretty conservative. We double the defaults, but I've seen customers use 10x the defaults. You will want to try increasing them more.
tmvision
Posts: 32
Joined: Fri Dec 01, 2017 8:15 am

Re: Message queues can't keep up

Post by tmvision »

Increasing the kernel parameters will allow for a larger queue, but I'm not sure that will resolve our immediate concern - namely that messages are delayed for up to 30 minutes before showing in the web-interface.
Of course we don't want to risk dropping messages because of full queues, but we aren't seeing signs of this happening. We only see that the message queue is growing for a while, until the messages start being processed again.

What service(s) processes the queue? How can we tweak these to better keep the queue empty (or near empty)?
tmvision
Posts: 32
Joined: Fri Dec 01, 2017 8:15 am

Re: Message queues can't keep up

Post by tmvision »

I took a look in dbmaint.log, which might give some hints (should be attached to this post).
To me it looks like it gets stuck for a while during "OPTIMIZE TABLE nagios_logentries".

The lines "LOCKFILE '/usr/local/nagiosxi/var/dbmaint.lock' EXISTS - EXITING!" correspond somewhat to when the interface stops processing new messages. It only starts again a while later (around the time when "Repair Complete: FAILED TO REMOVE LOCK FILE" is posted). I guess that's when it is done optimizing the table?

Can we safely increase the optimize interval, or could this introduce other issues? Maybe we could have it optimize once every night, where fewer people are effected by the delays.
You do not have the required permissions to view the files attached to this post.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Message queues can't keep up

Post by dwhitfield »

tmvision wrote:Maybe we could have it optimize once every night, where fewer people are effected by the delays.
Each environment works a little differently in the best optimization scenario. However, this is a suggestion we have proposed in the past in similar situations. You should give it a shot and see if it works for you.
tmvision
Posts: 32
Joined: Fri Dec 01, 2017 8:15 am

Re: Message queues can't keep up

Post by tmvision »

We have now configured it to only optimize the NDOUtils database once every night. We will let it run for a few days, and see if this solves our problem - so far things are looking good.
I'm still a little worried about the way dbmaint.php ends up removing the lock after 30 minutes though, leading to two instances of the script running at once. We will just have to hope that this doesn't cause any issues.

Thanks for all your help
Locked