Need advice from large installation admins!

Post by **BanditBBS** » Wed Mar 29, 2017 8:05 pm

Hey all,

I've been having crazy amount of issues the last month or so with ndo2db falling behind and just never catching up. I only have 1500 hosts and 35000 services, nothing that should be causing me this amount of pain. I have both my ndo and mysql offloaded. Both my NagiosXI host(16 cores, 64GB ram, load spiking past month to 20-100, before that, load of 10) and my DB host(8 cores, 32GB ram, Never a load over 1.5) are virtualized in ESX 5.0(Yes, that old!). I know in the past I moved off of virtual because it just caused issues, anyone else have history with that?

I am considering installing RHEL 6.x on a blade tomorrow bare metal style, no hypervisor! 24 CPU threads and 64GB ram and actually keeping the DB on the same fresh install box. I also will throw gearman on it and start using that as well. I had Nagios trying to help me troubleshoot but I gave up on that(not due to them being bad at all), wasn't getting any progress. I just can't keep losing sql entries and can't keep having to babysit the server all my waking hours, ndo is just killing me!

Anyone want to tell me, "NO, Don't do what you wanna do" and give me a good reason not to? Otherwise I think I'll start prepping the new server tomorrow!

Highly appreciate all and any feedback!

Post by **WillemDH** » Thu Mar 30, 2017 2:25 am

James,

First of all, my XI installation is much smaller then yours, so I'm not sure if anything I'm saying really makes sense.
Some random thoughts:

- You say you have offloaded ndo2db. What procedure did you use to do this? What is the load on the server running ndo2db?
- How long does it take to apply configuration?
- In my installation, the bottleneck seems to be disk io. What storage type are you using? What other vm's are running on the same datastore? Please run

Code: Select all

iostat -xtz 1 > /tmp/io.log

during an apply configuration and post the results here. You could try migrate all other vm's temporarily to a different datastore to see if performance increases.
- Are you monitoring network devices with your XI server? I've seem mrtg causing very high load in the past, that's why we offloaded mrtg checks to a separate Gearman node.
- What is the moderate interval of checks you are running. Is it an option to 'temporarily' increase the interval on all checks (for example from 1m to 5m)
- Running on XI barebone will definitely have advantages, as you are more sure other vm's on the same datastore is not impacting your performance. The disadvantage isq that you cannot easily scale CPU's. You can also not make use of VMware's builtin HA features.

Grtz

Willem

SteveBeauchemin · Post by **SteveBeauchemin** » Thu Mar 30, 2017 9:32 am

I had a similar experience after an upgrade recently. the upgrade made assumptions about my kernel parameters that I had changed quite a bit.

Read my post about it https://support.nagios.com/forum/viewto ... 16&t=42604

Look there as the behavior you describe pretty much sounds like what happened to me.

Keep me posted as there are other things I did to make my setup work. I do have 6 gearman workers running and my system will not be okay without them. That is one way I deal with the large installation.

This is a small piece of text from my post: regarding the change the upgrade script made to my setup.

Code: Select all

I set my kernel parameters like this
kernel.msgmax = 1073741824
kernel.msgmnb = 1073741824
But now they were one tenth the size.
kernel.msgmax = 131072000
kernel.msgmnb = 131072000

Keep us posted, and good luck... we're with you.

Steve B

Post by **BanditBBS** » Thu Mar 30, 2017 9:36 am

@SteveBeauchemin
Yeah, I made those changes, my msg queue just build up to 1 million messages and maybe hour and hours later catches up and the gui is back working, but while its anything over 20,000 then the gui is too far behind to be of any use.

@WillemDH
used a process that was given to me by Andy back in the day to offload ndo, it is on the same host as my DB.

SteveBeauchemin · Post by **SteveBeauchemin** » Thu Mar 30, 2017 9:55 am

I just had another thought - this also happens to my site every so often and I hate it. Here are a couple short horror stories to make you feel better.

I run in ESX and have to share the architecture with other systems. Sometimes the other systems steal all my IO and that causes my system Load Average to spike. When I run top, with 8 CPU, I should Never have a load average above 8. I have seen it as high as 200+. It turned out that the shared ESX architecture placed my system with a bunch of mail servers that used their disk drives heavily and stole all my disk I/O. I had them vMotion my virtual to a less used ESX host and that problem went away. This took me weeks to figure out, and many sleepless nights. Simply because there was nothing wrong with my Nagios work. It was caused by external influences that I was unaware of.

They have since changed the back end SAN to use SSD instead of drives with spindles and that helped immensely.

Also - beware the system backup process. My backup team were running an 'Image' backup of ESX virtuals. The method involved making a snapshot, run the backup, remove the snapshot. That was no good for my Nagios install as the drive data is always churning. The perfdata in particular. I had to make them change to a file by file backup process and things stopped going haywire every night.

Snapshot deletion always triggers a "Disk Consolidation required" scenario. In order for the Disk Consolidation to succeed, I had to stop all the Nagios processes. If I did not stop the processes, the consolidation would run for more than 2 hours and fail. With the Nagios processes stopped, and depending on how old the snapshot was, the disk consolidation took from 10 minutes to more than an hour, but was okay after. Because the snapshot was there, my Disk I/O would go high, and cause CPU load to climb slowly over a couple days. You get used to it being in one place, and when you notice it is using 4 more CPU than normal, and does not drop, something is wrong.

I'll keep thinking, but these are all bad memories and I had tried to forget them. I know there are more memories coming...

FYI - My setup is 8 CPU and 16GB RAM for 4987 hosts running 44814 services - today... The 6 mod_gearman make this possible.

Steve B

Post by **BanditBBS** » Thu Mar 30, 2017 10:03 am

Yeah, my DB server is on a ESX with my dev XI install, Vcenter and VCenter DB. My XI box is on ESX with two NLS servers.

What ya think, IO related, lol?

dwhitfield · Post by **dwhitfield** » Thu Mar 30, 2017 12:27 pm

@BanditBBS, have you considered two XI boxes?

Let's take the sales angle out of it...have you considered a Core box and an XI box?

I don't know the details of the setup, but I know we have one customer that runs something like 250 Core installs! I'm not sure how many XI installs go with those.

Post by **BanditBBS** » Thu Mar 30, 2017 3:07 pm

@dwhitfield, feel free to keep sales in the conversation. I'm not afraid to spend money if the need arises.

That being said, what could I do with multiple cores under XI? Have them all roll up into it? Then I'd not have the fantastic CCM to config everything

Or am I missing something?

dwhitfield · Post by **dwhitfield** » Thu Mar 30, 2017 3:58 pm

You are more-or-less correct about the CCM. You could certainly use puppet, chef, etc. to manage the Core installs if you wanted to go that route.

One thing you can certainly do with many Core installs is have them all report to Fusion. If you are just talking about a couple of installs, then maybe fusion doesn't make sense, but if you are looking at Fusion+Core+XI instead of many XI installs, then it certainly makes sense to get Fusion from the financial point of view. Of course, that still leaves you with Core, which is sounds like you don't want. The next version of Fusion should be out very soon: https://www.nagios.com/products/nagios-fusion/ . Whether XI or Core, Fusion is really going to shine once we are talking about scores of installs. That's not to say it isn't useful for smaller #s, and of course it still depends on how many hosts/services are being managed on each device. It's kinda like people transport. Sometimes you need a wider road, but sometimes you need a train.

Specifically, in https://assets.nagios.com/downloads/gen ... utions.pdf I'm talking about what we call "Federated Monitoring" (other than Fusion). Unfortunately, that document doesn't tell you a ton about how to make it go. You could easily set that up with SNMP Traps. Our tutorial uses the example of two XI machines (although traps will work from HP OneView or whatever): https://support.nagios.com/kb/article.php?id=77

Nagios Support Forum

Need advice from large installation admins!

Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!

Re: Need advice from large installation admins!