Hi,
We have been having issues with checks timing out with 100+ second timeouts occasionally every couple of days causing a lot of false alerts. Even after removing a lot of checks that are not needed this still occasionally happens.
I would like to make a front end servers for the web front end and several "poller" services that run the actual checks. Is this a thing? Is there a guide for this somewhere. Any help would be appreciated. Thanks.
Nagios XI IO Issues / Cluster
Re: Nagios XI IO Issues / Cluster
This sounds like it is indeed related to performance. I have a few questions about your environment -
How many hosts / services are you currently checking?
How many CPU's do you have allocated to the machine?
What is the result of top|head -5?
How many hosts / services are you currently checking?
How many CPU's do you have allocated to the machine?
What is the result of top|head -5?
Former Nagios Employee
-
cfgriffith
- Posts: 83
- Joined: Tue Jan 15, 2013 4:22 pm
Re: Nagios XI IO Issues / Cluster
Currently using Nagios XI 2014 R2.5
This is after toning back a lot:
Active Service checks:
1-min 505
5-min 2,461
15-min 2,553
Host checks:
1-min 58
5-min 349
15-min 369
A lot of the checks are using mrtg for graphing (bandwidth checks)
I would turn off the 1 min checks if I could find them in the configuration (Havn't been able too)
What is weird is when it does happen all the checks pretty much have a big hiccup and then things calm down slowly afterwords. After reducing the amount of checks this has only happened once but I want to possible double / tripple the amount of checks I currently have.
The server is a VM in an ESXI environment
8 CPUs with 2656 mhz used
12GB of memory with about 1GB active
top - 17:19:23 up 7 days, 7:22, 2 users, load average: 1.33, 1.33, 1.47
Tasks: 251 total, 2 running, 249 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.7%us, 1.8%sy, 0.0%ni, 85.0%id, 0.3%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 12298976k total, 3522968k used, 8776008k free, 324240k buffers
Swap: 1048568k total, 0k used, 1048568k free, 1998320k cached
I rebooted it recently to alleviate the issues after having removed some checks. RHEL 5.
It may also be worth noting I am inheriting this deployment from a previous admin so there may be something weird causing these big hiccups. I have checked cron and nothing seems out of place. It does run mrtg about every 5-10 minutes though I believe to initalize configurations.
This is after toning back a lot:
Active Service checks:
1-min 505
5-min 2,461
15-min 2,553
Host checks:
1-min 58
5-min 349
15-min 369
A lot of the checks are using mrtg for graphing (bandwidth checks)
I would turn off the 1 min checks if I could find them in the configuration (Havn't been able too)
What is weird is when it does happen all the checks pretty much have a big hiccup and then things calm down slowly afterwords. After reducing the amount of checks this has only happened once but I want to possible double / tripple the amount of checks I currently have.
The server is a VM in an ESXI environment
8 CPUs with 2656 mhz used
12GB of memory with about 1GB active
top - 17:19:23 up 7 days, 7:22, 2 users, load average: 1.33, 1.33, 1.47
Tasks: 251 total, 2 running, 249 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.7%us, 1.8%sy, 0.0%ni, 85.0%id, 0.3%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 12298976k total, 3522968k used, 8776008k free, 324240k buffers
Swap: 1048568k total, 0k used, 1048568k free, 1998320k cached
I rebooted it recently to alleviate the issues after having removed some checks. RHEL 5.
It may also be worth noting I am inheriting this deployment from a previous admin so there may be something weird causing these big hiccups. I have checked cron and nothing seems out of place. It does run mrtg about every 5-10 minutes though I believe to initalize configurations.
Re: Nagios XI IO Issues / Cluster
Just a couple random thoughts:
- The MRTG cron is run every 5 minutes. It does all the heavy lifting for bandwidth checks, the Nagios basically gets the information from the MRTG result files.
- In regards to your frontend/poller setup, I think you are describing mod_gearman: https://assets.nagios.com/downloads/nag ... ios_XI.pdf - Let us know if this sort of thing is what you are looking for.
- Can you think of any events that correlate with the timeouts? Backups, security scans, anything like that?
Former Nagios employee
-
cfgriffith
- Posts: 83
- Joined: Tue Jan 15, 2013 4:22 pm
Re: Nagios XI IO Issues / Cluster
Not really. It seems to have stopped after I deleted a bunch of checks. I am pretty sure it is / was just an IO issue. Is this an issue people see with VM's versus hardware deployments or? Gearman looks like it may be what I am looking for but what is a recommended deployment for it and does it scale?
I.E. do most people normal setup two polling servers and one primary server or more than that? What kind of amount of processors and ram do you use on said 'worker' servers? Also down the road could I add additional polling servers? Just any information about other deployments of gearman would be most helpful. Thanks again.
Do the "worker" installations require a full nagios xi 2014 install as well or just the install mentioned above?
As far as the RHEL6 requirement I am totally fine with that.
I.E. do most people normal setup two polling servers and one primary server or more than that? What kind of amount of processors and ram do you use on said 'worker' servers? Also down the road could I add additional polling servers? Just any information about other deployments of gearman would be most helpful. Thanks again.
Do the "worker" installations require a full nagios xi 2014 install as well or just the install mentioned above?
As far as the RHEL6 requirement I am totally fine with that.
Re: Nagios XI IO Issues / Cluster
mod_gearman is used in all sorts of sized environments. Some people have a single worker server, some people have dozens. Some a dual-core machines, others are more specced than the Nagios server itself. It really depends on what resources you have/need. Regarding scaling, it's pretty painless to add or remove workers as needed. All a worker needs is the gearman software and whatever plugins it might be running - no need to do a full XI or even a Core install.
Getting into specifics is tip-toeing into consulting territory, but the general advice above is what I usually give to people looking into gearman setups.
Getting into specifics is tip-toeing into consulting territory, but the general advice above is what I usually give to people looking into gearman setups.
Former Nagios employee
-
cfgriffith
- Posts: 83
- Joined: Tue Jan 15, 2013 4:22 pm
Re: Nagios XI IO Issues / Cluster
Sounds good enough. I will take that approach. As far as the standard XI setup (one server) how many checks does I/O usually start to become a problem with 4 cpus and about 8gb-16gm of ram. (Final question just to determine if it is a problem with nagios or a problem with the server itself) (really old VM)
I will probably be building a brand new deployment.
I will probably be building a brand new deployment.
Re: Nagios XI IO Issues / Cluster
A single XI server can handle up to about 10,000 checks before needing to have some optimizations. mod_gearman, implementing a RAM disk, offloading the MySQL database, and using rrdcached will all help improve the performance of your server at this point. Beyond that, at about the 20,000 mark I recommend splitting off the checks into two servers and giving each a part of the load.
Bear in mind there are a *lot* of variables in play here (frequency and type of checks, how many hosts/services are down, whether you have event handlers, etc.) but this general advice has held true for me for quite a while.
Bear in mind there are a *lot* of variables in play here (frequency and type of checks, how many hosts/services are down, whether you have event handlers, etc.) but this general advice has held true for me for quite a while.
Former Nagios employee
-
cfgriffith
- Posts: 83
- Joined: Tue Jan 15, 2013 4:22 pm
Re: Nagios XI IO Issues / Cluster
Sorry to keep bugging. I know this is kind of bordering on a consultation so again this will be my last question. What exactly do you mean by a RAM disk?
Re: Nagios XI IO Issues / Cluster
We have a good document explaining RAMDisks here: https://assets.nagios.com/downloads/nag ... giosXI.pdf
Former Nagios Employee.
me.
me.