IPCS Queue spikes every 4 hours
IPCS Queue spikes every 4 hours
Hi,
Every 4 hours (12am, 4am, 8am, 12pm, 16pm, 20pm) we are having huge message spikes in the IPCS queue (limit set to 500k messages).
It slowly ramps up to 500k messages, it then stays at 500k messages for a while to slowly go back to 0, it usually lasts a bit more than 2 hours. It started on the 1st of september at 12pm. The configuration is applied every day at 11am and 16pm and as you can see it's done in 15 minutes or so which is what we are used to. The issue is that since the queue is full, everything becomes slow, adding a downtime takes ages, checks are late ... Furthermore some messages are getting deleted so we are missing on some alerts (but that is to be expected because the queue is full).
We are running Nagios XI 5.6.13 on RHEL 6.10 with mod_gearman (2 job servers and 6 workers).
I'm not even sure the issue resides with Nagios but I'm throwing this out there because I have absolutely no clue where it's comming from, maybe you guys heard of a similar issue.
Also is Nagios 5.7.x compatible with RHEL 6.10?
Best regards,
Gaspard
Every 4 hours (12am, 4am, 8am, 12pm, 16pm, 20pm) we are having huge message spikes in the IPCS queue (limit set to 500k messages).
It slowly ramps up to 500k messages, it then stays at 500k messages for a while to slowly go back to 0, it usually lasts a bit more than 2 hours. It started on the 1st of september at 12pm. The configuration is applied every day at 11am and 16pm and as you can see it's done in 15 minutes or so which is what we are used to. The issue is that since the queue is full, everything becomes slow, adding a downtime takes ages, checks are late ... Furthermore some messages are getting deleted so we are missing on some alerts (but that is to be expected because the queue is full).
We are running Nagios XI 5.6.13 on RHEL 6.10 with mod_gearman (2 job servers and 6 workers).
I'm not even sure the issue resides with Nagios but I'm throwing this out there because I have absolutely no clue where it's comming from, maybe you guys heard of a similar issue.
Also is Nagios 5.7.x compatible with RHEL 6.10?
Best regards,
Gaspard
You do not have the required permissions to view the files attached to this post.
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: IPCS Queue spikes every 4 hours
HI Gaspard,
Since the message queues are spiking, let's do a re-start of the software stack and clear out the message queues
Then let the system run for about 10-15 minutes and download a fresh system profile so we can review the logs for any errors.
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message or upload it to the post/ticket, and then reply to this post to bring it up in the queue.
Regarding 5.7.x compatibility with RHEl 6, it is. However, since you have a complex installation with customizations, I would highly recommend upgrading a test server first before making any changes to the production system. Your Nagios XI license allows for 3 activations.
More information on licensing can be found here:
https://support.nagios.com/kb/article.php?id=145
Also, Cent 6 is scheduled for end of life this November. I would recommend planning a migration to 7 or 8 sometime this year.
Regards,
Benjamin
Since the message queues are spiking, let's do a re-start of the software stack and clear out the message queues
Code: Select all
service crond stop
service npcd stop
service nagios stop
service ndo2db stop
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
service mysqld restart
service ndo2db start
service nagios start
service npcd start
service crond start
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message or upload it to the post/ticket, and then reply to this post to bring it up in the queue.
Regarding 5.7.x compatibility with RHEl 6, it is. However, since you have a complex installation with customizations, I would highly recommend upgrading a test server first before making any changes to the production system. Your Nagios XI license allows for 3 activations.
More information on licensing can be found here:
https://support.nagios.com/kb/article.php?id=145
Also, Cent 6 is scheduled for end of life this November. I would recommend planning a migration to 7 or 8 sometime this year.
Regards,
Benjamin
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: IPCS Queue spikes every 4 hours
I PM'ed you the system profile.
We of course did update our test servers beforehand but I just wanted to make sure we wouldn't have unforseen issues.
We of course did update our test servers beforehand but I just wanted to make sure we wouldn't have unforseen issues.
Last edited by ghugon on Thu Sep 17, 2020 11:21 am, edited 2 times in total.
- inversecow
- Posts: 44
- Joined: Wed Sep 25, 2019 4:17 pm
Re: IPCS Queue spikes every 4 hours
[quote="ghugon"]As requested, here is the system profile as instructed: .
FYI, you may care to delete that last post, and PM the `profile.zip` to the Nagios staff member instead (as requested).
Tailing this thread as we see bursts of a similar nature in our ENV also, and am curious what comes of this analysis (without hijacking the thread).
FYI, you may care to delete that last post, and PM the `profile.zip` to the Nagios staff member instead (as requested).
Tailing this thread as we see bursts of a similar nature in our ENV also, and am curious what comes of this analysis (without hijacking the thread).
“And who better understands the Unix-nature?” Master Foo asked.
“Is it he who writes the ten thousand lines, or he who, perceiving the emptiness of the task, gains merit by not coding?”
Master Foo - The ten thousand Lines
Unix Koans of Master Foo
“Is it he who writes the ten thousand lines, or he who, perceiving the emptiness of the task, gains merit by not coding?”
Master Foo - The ten thousand Lines
Unix Koans of Master Foo
Re: IPCS Queue spikes every 4 hours
Thanks @inversecow, I don't know what I was thinking as we have some sensitive stuff in there.
Well, I'm glad to hear we're not the only one having this weird issue
.
Well, I'm glad to hear we're not the only one having this weird issue
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: IPCS Queue spikes every 4 hours
HI,
Since the database is offloaded, the log file was not in the profile. Can you retrieve the database log from the remote server and send it over in a PM?
The fact the ipcs queues are spiking every 4 hours certainly suggests there is a scheduled process running that is impacting the system. The system load in the top command output is looks pretty good considering the overall check load on this server ( ~ 27k hosts and services). You may have network congestion that is slowing down the writes to the database. It would also be helpful to see a top command output during the spikes, if possible.
Regards,
Benjamin
Since the database is offloaded, the log file was not in the profile. Can you retrieve the database log from the remote server and send it over in a PM?
Code: Select all
/var/log/mariadb/mariadb.log
/var/log/mysqld.log
Regards,
Benjamin
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: IPCS Queue spikes every 4 hours
Hi,
I pm'ed you the mysqld.log.
Yeah we are actively looking for what could be causing this. We disabled everything antivirus wise and asked around to see if something changed on the infrastructure or on the network side but no luck so far.
We temporarily set up a cron that runs every four hours and kills httpd as it was really getting out of hand, it allows the ipcs queue to go back down really quickly and not go up for another 4 hours.
It's dirty but it does the job for now.
We moved the Nagios server from an ESXi to an other but it didn't help. We are also in the process of moving the database to a freshly created VM with a more up to date version of mysql.
Best regards,
Gaspard
I pm'ed you the mysqld.log.
Yeah we are actively looking for what could be causing this. We disabled everything antivirus wise and asked around to see if something changed on the infrastructure or on the network side but no luck so far.
We temporarily set up a cron that runs every four hours and kills httpd as it was really getting out of hand, it allows the ipcs queue to go back down really quickly and not go up for another 4 hours.
It's dirty but it does the job for now.
We moved the Nagios server from an ESXi to an other but it didn't help. We are also in the process of moving the database to a freshly created VM with a more up to date version of mysql.
Best regards,
Gaspard
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: IPCS Queue spikes every 4 hours
Hi Gaspard,
Well, glad to hear you a workaround even it's not the most ideal solution. Can you send over the database log once more? For some reason, the last PM I have from you is the profile on the 17th.
Another strategy would be to move the nagios database local again. The process is essentially the reverse of what's in the guide below.
https://assets.nagios.com/downloads/nag ... Server.pdf
Benjamin
Well, glad to hear you a workaround even it's not the most ideal solution. Can you send over the database log once more? For some reason, the last PM I have from you is the profile on the 17th.
Another strategy would be to move the nagios database local again. The process is essentially the reverse of what's in the guide below.
https://assets.nagios.com/downloads/nag ... Server.pdf
Benjamin
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: IPCS Queue spikes every 4 hours
Hi,
I somehow sent the mysqld.log to myself ...
It should be in your inbox now.
Best regards,
Gaspard
I somehow sent the mysqld.log to myself ...
It should be in your inbox now.
Best regards,
Gaspard
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: IPCS Queue spikes every 4 hours
Hi Gaspard,
Here's what I've found in the database log.
See: How to Resize MySQL Innodb Log Files Without Errors
Here's what I've found in the database log.
It looks like the default innodb log was changed incorrectly and this is causing an error and possibly loss of functionality. Not sure if this related to the IPCS queue spike, but I would recommend fixing this and restarting the database and let me know if you noticed any performance improvement.InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes
InnoDB: than specified in the .cnf file 0 33554432 bytes!
200917 13:21:15 [ERROR] Plugin 'InnoDB' init function returned error.
200917 13:21:15 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed
See: How to Resize MySQL Innodb Log Files Without Errors
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!