I've upgraded from 3.4.0 to 4.0.0 - my installation is monitoring 2300 hosts and 4500 services. Nagiostats shows that the load averages on the system have dropped dramatically since the upgrade, the check count rate has evened out from it's previous peaky nature but the number of checks per second has dropped as well. Is this expected behaviour? Am I mis-interpreting the nagiostats output?
Check rate:
Load Average
Gavin
Check rate under 4.0.0
Re: Check rate under 4.0.0
This probably has to do with the addition of Core Workers in Nagios 4.0.0. See the extended release notes: http://labs.nagios.com/2013/09/20/nagio ... available/
Re: Check rate under 4.0.0
These two metrics are symptoms of some of the most important changes in core 4. The Nagios world conference is next week, and over the following months, videos of all the presentations will be available online. Keep your eyes on the labs page as Eric Stanley, Ethan Galstad and Andreas Ericsson are all covering different aspects of core 4.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Check rate under 4.0.0
The drop in the load average metric I understand and expected with the changes in the core workers model. The drop in the check rate reported by nagiostats is not so clear though.magna.vis wrote:This probably has to do with the addition of Core Workers in Nagios 4.0.0. See the extended release notes: http://labs.nagios.com/2013/09/20/nagio ... available/
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Check rate under 4.0.0
From your graph it still looks like nagios is checking roughly the same amount, do you have something else you are referencing for this info? The current, average, and maximum plots all include peaks, so if ti did peak before the update, as it looks like it did, those numbers will all slowly get smaller as problems are resolved, and checks return to their normal rates, there can of course be other explanations to this but this is probably the main one.
Re: Check rate under 4.0.0
The peaky section of the graphs prior to 1240 in x axis was the system running under 3.4.0, the drop to zero is me changing over to 4.0.0 and the section after that is the system running under 4.0.0
There are no outstanding problem sites on the system for the duration of the plot so that's not a factor.
This is a 24 core Linux system with 32GB of RAM and 4 x 1GB NICs (trunked). I have 2 of these systems running the same checks in parallel for redundancy, the other box is still running 3.4.0 as a control during the changeover to 4.0.0
What has shown up with 4.0.0 after letting it run for a day is that the 15 minute check count which gives me a more "smoothed" metric has now stabilised at a higher rate than under 3.4.0 and the system LA metrics have stayed consistently lower than under 3.4.0 so the performance under 4.0.0 is a marked improvement.
The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.
There are no outstanding problem sites on the system for the duration of the plot so that's not a factor.
This is a 24 core Linux system with 32GB of RAM and 4 x 1GB NICs (trunked). I have 2 of these systems running the same checks in parallel for redundancy, the other box is still running 3.4.0 as a control during the changeover to 4.0.0
What has shown up with 4.0.0 after letting it run for a day is that the 15 minute check count which gives me a more "smoothed" metric has now stabilised at a higher rate than under 3.4.0 and the system LA metrics have stayed consistently lower than under 3.4.0 so the performance under 4.0.0 is a marked improvement.
The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.
Re: Check rate under 4.0.0
Well, for your sake, I hope you don't get a chance to test it. Although, we would be highly interested in your results. Keep us informed. Those performance numbers are exciting.GJN65 wrote: The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.