Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted

SteveBeauchemin · Post by **SteveBeauchemin** » Wed Feb 22, 2017 2:08 pm

I finally completed my Production migration from
Red Hat 6 Nagios XI 5.2.7 to
Red Hat 7 Nagios XI 5.4.2

I did this by upgrading Nagios XI on Red Hat 6 to version 5.3.4
Then I set up a Red Hat 7 and installed Nagios XI 5.3.4
Then I used a modified version of the backup_xi.sh and restore_xi.sh. I have an offloaded DB and was keeping the same host.

Migration from Red Hat 6 to 7 then became less complicated. Same Nagios to Same Nagios - just the OS was different.
Did that. Everything looked good. Mission 1 accomplished.

Once that was completed, (many steps not listed here) I did the big upgrade to Nagios XI 5.4.2
And I say big, because the Nagios core changes from version 4.1.1 to 4.2.4
and the NDO tools change from version 2.0.0 to 2.1.2

Before the upgrade everything was running perfectly. The performance addons were humming along quite happily.
Mod_gearman, rrdcache, ramdisk, livestatus, DB offload...

Prior to the upgrade, I have to edit the upgrade script and tell it to NOT touch sudoers.
We have centrally controlled sudoers file and every time a Nagios script touches it, it gets broken.
The edit worked fine and sudoers became a non issue.

After the upgrade everything looked good at first glance. I knew I had to change the lock versus pid file name stuff.
So I updated the init.d ndo2db file and the red Database Backend icon turned green instantly.

Still Everything looked good. Then it didn't...

I noticed the OS getting sluggish and non-responsive. On the terminal where I was running the top command I saw the CPU Load change from around 3 to 97. Meaning I need to have 97 CPU installed to deal with the Load in the queue. My system runs from 2 to 4 normally. I have 8 CPU.

So that was a bad thing.

Then I noticed the gearman_top screen had a service backlog in the queue of 20,000 jobs sitting. Waiting to send work to my workers.

Another VERY bad thing.

No, not having any fun yet...

Also, I noticed that the ipcs -q had multiple line items where there should only be one line item.

It just was getting worse...

I had thoughts in my head of reverting to the old version.

So, while trying to not have a heart attack, I looked at settings in Nagios and other things. Eventually I looked into the system log files to see if anything could give me a clue as the what the *** happened. I am sending a mental Thanks to whoever realized we need to see logs, to whoever invented that idea. They need a raise. Looking in the /var/log/messages file I finally found an BIG clue. There were messages there that I have not seen in a long time because I dealt with that issue already. But... I saw this...

Code: Select all

ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 2097152 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
ndo2db: Message sent to queue.
ndo2db: Warning: queue send error, retrying...
ndo2db: Message sent to queue.
ndo2db: Warning: queue send error, retrying...

And many more send errors... Did i say many? I meant Many with a capital M.

Finally the OS got tired of the messed up situation and kicked off the oom-killer.
It started to remove stuff that was out of control. Out Of Memory

Code: Select all

kernel: php invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
kernel: Out of memory: Kill process...

I have some very big, very scary log entries.

I had changed kernel settings multiple times in the past, slowly growing the setting until my installation ran without issue.
So I was shocked to see that the setting I had previously put in place was one digit short.
I set my kernel parameters like this
kernel.msgmax = 1073741824
kernel.msgmnb = 1073741824
But now they were one tenth the size.
kernel.msgmax = 131072000
kernel.msgmnb = 131072000

And as a result, the data that should go to SQL was getting dropped. The Queue was full. The CPU was racing. I had hundreds of php processes when I ran ps -ef. And everything turned to crap. The OS Memory cop kept killing stuff.
The system would quasi recover for about 20 minutes, then again go stupid. Then recover, then stupid. And again... rinse and repeat...

So I ran my 'stop all Nagios related stuff' script. This took a long time as it had to get in the queue that was already full.
I made my kernel changes again and ran sysctl -p to invoke them. Then I ran my 'start all nagios processes' script.

I can live with the result. But you cut 10 years off my life. Mission 2 accomplished.

I never expected that the 2 kernel parameters being small would be so destructive.

Please do not change them any more. And NEVER make them smaller that what we, the users, set them to.

I am running the new version now.

Steve B

dwhitfield · Post by **dwhitfield** » Wed Feb 22, 2017 2:51 pm

Thanks for the detailed explanation. I posted internal bug report 11143 for this. Is there anything else you'd like to add/say or should we lock this up?

SteveBeauchemin · Post by **SteveBeauchemin** » Wed Feb 22, 2017 3:03 pm

Lock it - I just like to tell what happened, how it got fixed. Someone else may benefit.

Thanks

Steve B

avandemore · Post by **avandemore** » Wed Feb 22, 2017 3:09 pm

While I agree we should never adjust these downwards, for future reference this will immediately raise the limits:

Code: Select all

sysctl kernel.msgmax=1073741824 kernel.msgmnb=1073741824

bheden · Post by **bheden** » Wed Feb 22, 2017 3:46 pm

Steve,

My personal apologies. I'll get this task completed ASAP. Unfortunately, we have no way of knowing if you (the user) *changed* it or not - but we do know what is too low of a value for most cases. I think an obvious check would be "if the new value is going to be lower than the old value, don't change it."

I think it may be time for us to start looking at a "power users" install script that only does the basics and then informs the end user of potential upgrades they could perform on their own - but a script where we touch the bare minimum (filesystem -> new html, etc. along with db -> new columns/remove columns, etc.). What are your thoughts on something like that?

Nagios Support Forum

Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted

Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted

Re: Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted

Re: Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted

Re: Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted

Re: Upgrade 5.3.4 to 5.4.2 - Disaster.. Debugged.. Averted