Nagios XI 5.5.4 - Load Issues (ipcs queue)

Post by **WillemDH** » Thu Oct 11, 2018 9:37 am

Hello,

We successfully updated our Nagios XI server from 5.4.13 to 5.5.4 yesterday. The only 'update' issue we had was that we check_mssql_database.py stopped working for our mssql database checks. We restored the original script from a backup after which it worked again. We are still investgating what has changed that caused it to stop working.

Thanks for the long awaited NagVis integration.

What I noticed in the new NagVis:

- I seem unable to change some settings on migrated maps. After recreating the map everything seems to work as expected.
- Some settings in the right-click context menu don't work
> Schedule downtime => The requested URL /nagiosxi/includes/components/xicore/cmd.cgi was not found on this server.
> Re-Schedule next Check => The requested URL /nagiosxi/includes/components/xicore/cmd.cgi was not found on this server.
> Acknowledge => The requested feature is not available for this backend. The MKLivestatus backend supports this feature.

Please let me know what is supposed to work and what not.

Grtz

Willem

Post by **lmiltchev** » Thu Oct 11, 2018 10:30 am

- I seem unable to change some settings on migrated maps. After recreating the map everything seems to work as expected.

What settings are you talking about? Can you describe in details how you migrated the NagVis maps, and what settings you tried to change? We will try to recreate the issue in-house.

- Some settings in the right-click context menu don't work
> Schedule downtime => The requested URL /nagiosxi/includes/components/xicore/cmd.cgi was not found on this server.
> Re-Schedule next Check => The requested URL /nagiosxi/includes/components/xicore/cmd.cgi was not found on this server.
> Acknowledge => The requested feature is not available for this backend. The MKLivestatus backend supports this feature.

I was able to recreate all of the three issues, and filed an internal bug report (task_id=13666) for updating/fixing the NagVis component's URLs.

Post by **WillemDH** » Fri Oct 12, 2018 9:18 am

Ludmill,

Unfortunately I seem to be getting load issues after all. The server worked fine for 2 days, where we did several apply configurations. 1 hour ago however I added 1 service to a host and after applying things didn't came up as usual. The ipcs queue stayed above 240k and didn't seem to lower.

After 1 hour, the ipcs queue still didn't go any lower and I decided to shutdown the server and add 6 extra cpu's. After rebooting, the ipcs queue stayed very high. At this moment it's around 150k or so...

I already implemented all kinds of performance optimizations (reaper /php /ramdisk). Everything seemed to work well after the update, what could cause this sudden change in behaviour...?

CPU Load is also very low (+-4) for 16 CPU's. The only thing that's off is the message queue. I kept an eye on that the last few days, and it always went back to 0 about 2 minutes after an apply.

I see 5.5.5 has been released, is there any chance 1 of the fixed issues could cause the behaviour I'm describing above? My issue seems to be similar to https://support.nagios.com/forum/viewto ... cs#p263354

EDIT1: Tried another apply and same behaviour....

I will patch Monday to 5.5.5 hoping that fixes my issues.

EDIT2: Tried disabling BPI setting "Sync all hostgroups and servicegroups on apply config." => Same issue

EDIT3: Checking /var/log/messages I seem to find quite a few of these:

Code: Select all

Oct 12 16:56:54 srvnagios ndo2db: Error: mysql_query() failed for 'INSERT INTO nagios_downtimehistory SET instance_id='1', downtime_type='2', object_id='60899', entry_time=FROM_UNIXTIME(1532597869), author_name='Claeys Stephen', comment_data='Citrix XenApp Template server voor Golden Image', internal_downtime_id='715114', triggered_by_id='0', is_fixed='1', duration='30988828800', scheduled_start_time=FROM_UNIXTIME(1514761200), scheduled_end_time=FROM_UNIXTIME(32503590000) ON DUPLICATE KEY UPDATE instance_id='1', downtime_type='2', object_id='60899', entry_time=FROM_UNIXTIME(1532597869), author_name='Cleys Stepen', comment_data='Citrix XenApp Template server voor Golden Image', internal_downtime_id='715114', triggered_by_id='0', is_fixed='1', duration='30988828800', scheduled_start_time=FROM_UNIXTIME(1514761200), scheduled_end_time=FROM_UNIXTIME(32503590000)'
Oct 12 16:56:54 srvnagios ndo2db: Error: mysql_query() failed for 'INSERT INTO nagios_scheduleddowntime SET instance_id='1', downtime_type='2', object_id='60899', entry_time=FROM_UNIXTIME(1532597869), author_name='Cleys Stepen', comment_data='Citrix XenApp Template server voor Golden Image', internal_downtime_id='715114', triggered_by_id='0', is_fixed='1', duration='30988828800', scheduled_start_time=FROM_UNIXTIME(1514761200), scheduled_end_time=FROM_UNIXTIME(32503590000) ON DUPLICATE KEY UPDATE instance_id='1', downtime_type='2', object_id='60899', entry_time=FROM_UNIXTIME(1532597869), author_name='Claeys Stephen', comment_data='Citrix XenApp Template server voor Golden Image', internal_downtime_id='715114', triggered_by_id='0', is_fixed='1', duration='30988828800', scheduled_start_time=FROM_UNIXTIME(1514761200), scheduled_end_time=FROM_UNIXTIME(32503590000)'

Grtz

Willem

Post by **lmiltchev** » Fri Oct 12, 2018 10:44 am

Upgrading to Nagios XI 5.5.5 should resolve the issues with load and the BPI component.

- Fixed user permissions on non-active objects causing large/slow SQL queries on some systems -JO
- Fixed status check for NDO in BPI component API tool so that it properly sleeps after each call -JO

https://www.nagios.com/downloads/nagios-xi/change-log/

Note: If you are still having issues after the upgrade, open a ticket via our support center - https://support.nagios.com/tickets/, and send us your latest profile.

Post by **WillemDH** » Mon Oct 15, 2018 1:56 am

Ludmill,

Seems like the # processes went up somewhere Friday morning, not sure if it is related, but wanted to ention anyway, as I can't explain which proccesses can explain this sudden rise.

So we updated to 5.5.4 last Wednessday around 11:00. The performance issues started on Friday.

Seems that since Friday morning, the number of processes went up from +- 350 to +- 475. What could cause this sudden spike in processes? I'm not immediately seeing in top what processes this could be..

EDIT 1: Updated to 5.5.5 aroun d 09:30. Did an apply config afterwards. Things don't seem to go much better, checking the message queue:

Code: Select all

09:58:15 => Apply => 0 > 175000
09:59:15 => 140000
10:00:05 => 105000
10:01:15 => 75000
10:02:15 => 50000
10:03:15 => 25000
10:03:45 => 0

So it seems to take up to 6 minutes before the message queue calms down now... I will need to get this down to an acceptable level somehow. 2 minutes was just doable..

What is the main reason this queue gets so high? Is it the number of Nagios objects (hosts / services) or the number of checks /s?

pm'ed you a system profile.

Willem

Post by **lmiltchev** » Mon Oct 15, 2018 1:05 pm

You opened a new support ticket in our system, so we will continue communicating via emails. I am locking this topic.

Nagios Support Forum

Nagios XI 5.5.4 - Load Issues (ipcs queue)

Nagios XI 5.5.4 - Load Issues (ipcs queue)

Re: Nagios XI 5.5.4

Re: Nagios XI 5.5.4

Re: Nagios XI 5.5.4 - Load Issues (ipcs queue)

Re: Nagios XI 5.5.4 - Load Issues (ipcs queue)

Re: Nagios XI 5.5.4 - Load Issues (ipcs queue)