IPCS Queue spikes every 4 hours

ghugon · Post by **ghugon** » Tue Sep 15, 2020 8:51 am

Hi,

Every 4 hours (12am, 4am, 8am, 12pm, 16pm, 20pm) we are having huge message spikes in the IPCS queue (limit set to 500k messages).
It slowly ramps up to 500k messages, it then stays at 500k messages for a while to slowly go back to 0, it usually lasts a bit more than 2 hours.

IPCS.PNG

It started on the 1st of september at 12pm. The configuration is applied every day at 11am and 16pm and as you can see it's done in 15 minutes or so which is what we are used to.

IPCS-Apply.PNG

The issue is that since the queue is full, everything becomes slow, adding a downtime takes ages, checks are late ... Furthermore some messages are getting deleted so we are missing on some alerts (but that is to be expected because the queue is full).

We are running Nagios XI 5.6.13 on RHEL 6.10 with mod_gearman (2 job servers and 6 workers).

I'm not even sure the issue resides with Nagios but I'm throwing this out there because I have absolutely no clue where it's comming from, maybe you guys heard of a similar issue.

Also is Nagios 5.7.x compatible with RHEL 6.10?

Best regards,
Gaspard

benjaminsmith · Post by **benjaminsmith** » Wed Sep 16, 2020 9:06 am

HI Gaspard,

Since the message queues are spiking, let's do a re-start of the software stack and clear out the message queues

Code: Select all

service crond stop
service npcd stop
service nagios stop
service ndo2db stop
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
service mysqld restart
service ndo2db start
service nagios start
service npcd start
service crond start

Then let the system run for about 10-15 minutes and download a fresh system profile so we can review the logs for any errors.

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message or upload it to the post/ticket, and then reply to this post to bring it up in the queue.

Regarding 5.7.x compatibility with RHEl 6, it is. However, since you have a complex installation with customizations, I would highly recommend upgrading a test server first before making any changes to the production system. Your Nagios XI license allows for 3 activations.

More information on licensing can be found here:
https://support.nagios.com/kb/article.php?id=145

Also, Cent 6 is scheduled for end of life this November. I would recommend planning a migration to 7 or 8 sometime this year.

Regards,
Benjamin

ghugon · Post by **ghugon** » Thu Sep 17, 2020 10:51 am

I PM'ed you the system profile.

We of course did update our test servers beforehand but I just wanted to make sure we wouldn't have unforseen issues.

inversecow · Post by **inversecow** » Thu Sep 17, 2020 11:00 am

[quote="ghugon"]As requested, here is the system profile as instructed:

profile.zip

.

FYI, you may care to delete that last post, and PM the `profile.zip` to the Nagios staff member instead (as requested).

Tailing this thread as we see bursts of a similar nature in our ENV also, and am curious what comes of this analysis (without hijacking the thread).

ghugon · Post by **ghugon** » Thu Sep 17, 2020 11:25 am

Thanks @inversecow, I don't know what I was thinking as we have some sensitive stuff in there.
Well, I'm glad to hear we're not the only one having this weird issue

.

benjaminsmith · Post by **benjaminsmith** » Fri Sep 18, 2020 12:24 pm

HI,

Since the database is offloaded, the log file was not in the profile. Can you retrieve the database log from the remote server and send it over in a PM?

Code: Select all

/var/log/mariadb/mariadb.log
/var/log/mysqld.log

The fact the ipcs queues are spiking every 4 hours certainly suggests there is a scheduled process running that is impacting the system. The system load in the top command output is looks pretty good considering the overall check load on this server ( ~ 27k hosts and services). You may have network congestion that is slowing down the writes to the database. It would also be helpful to see a top command output during the spikes, if possible.

Regards,
Benjamin

ghugon · Post by **ghugon** » Mon Sep 21, 2020 9:31 am

Hi,

I pm'ed you the mysqld.log.
Yeah we are actively looking for what could be causing this. We disabled everything antivirus wise and asked around to see if something changed on the infrastructure or on the network side but no luck so far.
We temporarily set up a cron that runs every four hours and kills httpd as it was really getting out of hand, it allows the ipcs queue to go back down really quickly and not go up for another 4 hours.
It's dirty but it does the job for now.
We moved the Nagios server from an ESXi to an other but it didn't help. We are also in the process of moving the database to a freshly created VM with a more up to date version of mysql.

Best regards,
Gaspard

benjaminsmith · Post by **benjaminsmith** » Mon Sep 21, 2020 5:07 pm

Hi Gaspard,

Well, glad to hear you a workaround even it's not the most ideal solution. Can you send over the database log once more? For some reason, the last PM I have from you is the profile on the 17th.

Another strategy would be to move the nagios database local again. The process is essentially the reverse of what's in the guide below.

https://assets.nagios.com/downloads/nag ... Server.pdf

Benjamin

ghugon · Post by **ghugon** » Tue Sep 22, 2020 3:56 am

Hi,

I somehow sent the mysqld.log to myself ...
It should be in your inbox now.

Best regards,
Gaspard

benjaminsmith · Post by **benjaminsmith** » Tue Sep 22, 2020 4:55 pm

Hi Gaspard,

Here's what I've found in the database log.

InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes
InnoDB: than specified in the .cnf file 0 33554432 bytes!
200917 13:21:15 [ERROR] Plugin 'InnoDB' init function returned error.
200917 13:21:15 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed

It looks like the default innodb log was changed incorrectly and this is causing an error and possibly loss of functionality. Not sure if this related to the IPCS queue spike, but I would recommend fixing this and restarting the database and let me know if you noticed any performance improvement.

See: How to Resize MySQL Innodb Log Files Without Errors

Nagios Support Forum

IPCS Queue spikes every 4 hours

IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours

Re: IPCS Queue spikes every 4 hours