Nagios XI IO Issues / Cluster

cfgriffith · Post by **cfgriffith** » Thu Jan 21, 2016 3:28 pm

Hi,

We have been having issues with checks timing out with 100+ second timeouts occasionally every couple of days causing a lot of false alerts. Even after removing a lot of checks that are not needed this still occasionally happens.

I would like to make a front end servers for the web front end and several "poller" services that run the actual checks. Is this a thing? Is there a guide for this somewhere. Any help would be appreciated. Thanks.

rkennedy · Post by **rkennedy** » Thu Jan 21, 2016 3:40 pm

This sounds like it is indeed related to performance. I have a few questions about your environment -
How many hosts / services are you currently checking?
How many CPU's do you have allocated to the machine?
What is the result of top|head -5?

cfgriffith · Post by **cfgriffith** » Thu Jan 21, 2016 6:20 pm

Currently using Nagios XI 2014 R2.5

This is after toning back a lot:

Active Service checks:
1-min 505
5-min 2,461
15-min 2,553

Host checks:
1-min 58
5-min 349
15-min 369

A lot of the checks are using mrtg for graphing (bandwidth checks)

I would turn off the 1 min checks if I could find them in the configuration (Havn't been able too)

What is weird is when it does happen all the checks pretty much have a big hiccup and then things calm down slowly afterwords. After reducing the amount of checks this has only happened once but I want to possible double / tripple the amount of checks I currently have.

The server is a VM in an ESXI environment

8 CPUs with 2656 mhz used

12GB of memory with about 1GB active

top - 17:19:23 up 7 days, 7:22, 2 users, load average: 1.33, 1.33, 1.47
Tasks: 251 total, 2 running, 249 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.7%us, 1.8%sy, 0.0%ni, 85.0%id, 0.3%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 12298976k total, 3522968k used, 8776008k free, 324240k buffers
Swap: 1048568k total, 0k used, 1048568k free, 1998320k cached

I rebooted it recently to alleviate the issues after having removed some checks. RHEL 5.

It may also be worth noting I am inheriting this deployment from a previous admin so there may be something weird causing these big hiccups. I have checked cron and nothing seems out of place. It does run mrtg about every 5-10 minutes though I believe to initalize configurations.

tmcdonald · Post by **tmcdonald** » Fri Jan 22, 2016 3:04 pm

Just a couple random thoughts:

The MRTG cron is run every 5 minutes. It does all the heavy lifting for bandwidth checks, the Nagios basically gets the information from the MRTG result files.
In regards to your frontend/poller setup, I think you are describing mod_gearman: https://assets.nagios.com/downloads/nag ... ios_XI.pdf - Let us know if this sort of thing is what you are looking for.
Can you think of any events that correlate with the timeouts? Backups, security scans, anything like that?

cfgriffith · Post by **cfgriffith** » Mon Jan 25, 2016 10:33 am

Not really. It seems to have stopped after I deleted a bunch of checks. I am pretty sure it is / was just an IO issue. Is this an issue people see with VM's versus hardware deployments or? Gearman looks like it may be what I am looking for but what is a recommended deployment for it and does it scale?

I.E. do most people normal setup two polling servers and one primary server or more than that? What kind of amount of processors and ram do you use on said 'worker' servers? Also down the road could I add additional polling servers? Just any information about other deployments of gearman would be most helpful. Thanks again.

Do the "worker" installations require a full nagios xi 2014 install as well or just the install mentioned above?

As far as the RHEL6 requirement I am totally fine with that.

tmcdonald · Post by **tmcdonald** » Mon Jan 25, 2016 2:49 pm

mod_gearman is used in all sorts of sized environments. Some people have a single worker server, some people have dozens. Some a dual-core machines, others are more specced than the Nagios server itself. It really depends on what resources you have/need. Regarding scaling, it's pretty painless to add or remove workers as needed. All a worker needs is the gearman software and whatever plugins it might be running - no need to do a full XI or even a Core install.

Getting into specifics is tip-toeing into consulting territory, but the general advice above is what I usually give to people looking into gearman setups.

cfgriffith · Post by **cfgriffith** » Mon Jan 25, 2016 6:14 pm

Sounds good enough. I will take that approach. As far as the standard XI setup (one server) how many checks does I/O usually start to become a problem with 4 cpus and about 8gb-16gm of ram. (Final question just to determine if it is a problem with nagios or a problem with the server itself) (really old VM)

I will probably be building a brand new deployment.

tmcdonald · Post by **tmcdonald** » Tue Jan 26, 2016 10:32 am

A single XI server can handle up to about 10,000 checks before needing to have some optimizations. mod_gearman, implementing a RAM disk, offloading the MySQL database, and using rrdcached will all help improve the performance of your server at this point. Beyond that, at about the 20,000 mark I recommend splitting off the checks into two servers and giving each a part of the load.

Bear in mind there are a *lot* of variables in play here (frequency and type of checks, how many hosts/services are down, whether you have event handlers, etc.) but this general advice has held true for me for quite a while.

cfgriffith · Post by **cfgriffith** » Tue Jan 26, 2016 10:48 am

Sorry to keep bugging. I know this is kind of bordering on a consultation so again this will be my last question. What exactly do you mean by a RAM disk?

hsmith · Post by **hsmith** » Tue Jan 26, 2016 10:53 am

We have a good document explaining RAMDisks here: https://assets.nagios.com/downloads/nag ... giosXI.pdf

Nagios Support Forum

Nagios XI IO Issues / Cluster

Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster

Re: Nagios XI IO Issues / Cluster