Red Hat 6 Nagios XI 5.2.7 to
Red Hat 7 Nagios XI 5.4.2
I did this by upgrading Nagios XI on Red Hat 6 to version 5.3.4
Then I set up a Red Hat 7 and installed Nagios XI 5.3.4
Then I used a modified version of the backup_xi.sh and restore_xi.sh. I have an offloaded DB and was keeping the same host.
Migration from Red Hat 6 to 7 then became less complicated. Same Nagios to Same Nagios - just the OS was different.
Did that. Everything looked good. Mission 1 accomplished.
Once that was completed, (many steps not listed here) I did the big upgrade to Nagios XI 5.4.2
And I say big, because the Nagios core changes from version 4.1.1 to 4.2.4
and the NDO tools change from version 2.0.0 to 2.1.2
Before the upgrade everything was running perfectly. The performance addons were humming along quite happily.
Mod_gearman, rrdcache, ramdisk, livestatus, DB offload...
Prior to the upgrade, I have to edit the upgrade script and tell it to NOT touch sudoers.
We have centrally controlled sudoers file and every time a Nagios script touches it, it gets broken.
The edit worked fine and sudoers became a non issue.
After the upgrade everything looked good at first glance. I knew I had to change the lock versus pid file name stuff.
So I updated the init.d ndo2db file and the red Database Backend icon turned green instantly.
Still Everything looked good. Then it didn't...
I noticed the OS getting sluggish and non-responsive. On the terminal where I was running the top command I saw the CPU Load change from around 3 to 97. Meaning I need to have 97 CPU installed to deal with the Load in the queue. My system runs from 2 to 4 normally. I have 8 CPU.
So that was a bad thing.
Then I noticed the gearman_top screen had a service backlog in the queue of 20,000 jobs sitting. Waiting to send work to my workers.
Another VERY bad thing.
No, not having any fun yet...
Also, I noticed that the ipcs -q had multiple line items where there should only be one line item.
It just was getting worse...
I had thoughts in my head of reverting to the old version.
So, while trying to not have a heart attack, I looked at settings in Nagios and other things. Eventually I looked into the system log files to see if anything could give me a clue as the what the *** happened. I am sending a mental Thanks to whoever realized we need to see logs, to whoever invented that idea. They need a raise. Looking in the /var/log/messages file I finally found an BIG clue. There were messages there that I have not seen in a long time because I dealt with that issue already. But... I saw this...
Code: Select all
ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 2097152 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
ndo2db: Message sent to queue.
ndo2db: Warning: queue send error, retrying...
ndo2db: Message sent to queue.
ndo2db: Warning: queue send error, retrying...
Finally the OS got tired of the messed up situation and kicked off the oom-killer.
It started to remove stuff that was out of control. Out Of Memory
Code: Select all
kernel: php invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
kernel: Out of memory: Kill process...I had changed kernel settings multiple times in the past, slowly growing the setting until my installation ran without issue.
So I was shocked to see that the setting I had previously put in place was one digit short.
I set my kernel parameters like this
kernel.msgmax = 1073741824
kernel.msgmnb = 1073741824
But now they were one tenth the size.
kernel.msgmax = 131072000
kernel.msgmnb = 131072000
And as a result, the data that should go to SQL was getting dropped. The Queue was full. The CPU was racing. I had hundreds of php processes when I ran ps -ef. And everything turned to crap. The OS Memory cop kept killing stuff.
The system would quasi recover for about 20 minutes, then again go stupid. Then recover, then stupid. And again... rinse and repeat...
So I ran my 'stop all Nagios related stuff' script. This took a long time as it had to get in the queue that was already full.
I made my kernel changes again and ran sysctl -p to invoke them. Then I ran my 'start all nagios processes' script.
I can live with the result. But you cut 10 years off my life. Mission 2 accomplished.
I never expected that the 2 kernel parameters being small would be so destructive.
Please do not change them any more. And NEVER make them smaller that what we, the users, set them to.
I am running the new version now.
Steve B