Check rate under 4.0.0

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
GJN65
Posts: 3
Joined: Tue Sep 24, 2013 10:21 pm

Check rate under 4.0.0

Post by GJN65 »

I've upgraded from 3.4.0 to 4.0.0 - my installation is monitoring 2300 hosts and 4500 services. Nagiostats shows that the load averages on the system have dropped dramatically since the upgrade, the check count rate has evened out from it's previous peaky nature but the number of checks per second has dropped as well. Is this expected behaviour? Am I mis-interpreting the nagiostats output?

Check rate:
nagios_cehck_rate.png
Load Average
nagios_load_average.png
Gavin
magna.vis
Posts: 8
Joined: Tue Sep 24, 2013 9:37 pm

Re: Check rate under 4.0.0

Post by magna.vis »

This probably has to do with the addition of Core Workers in Nagios 4.0.0. See the extended release notes: http://labs.nagios.com/2013/09/20/nagio ... available/
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Check rate under 4.0.0

Post by abrist »

These two metrics are symptoms of some of the most important changes in core 4. The Nagios world conference is next week, and over the following months, videos of all the presentations will be available online. Keep your eyes on the labs page as Eric Stanley, Ethan Galstad and Andreas Ericsson are all covering different aspects of core 4.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
GJN65
Posts: 3
Joined: Tue Sep 24, 2013 10:21 pm

Re: Check rate under 4.0.0

Post by GJN65 »

magna.vis wrote:This probably has to do with the addition of Core Workers in Nagios 4.0.0. See the extended release notes: http://labs.nagios.com/2013/09/20/nagio ... available/
The drop in the load average metric I understand and expected with the changes in the core workers model. The drop in the check rate reported by nagiostats is not so clear though.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Check rate under 4.0.0

Post by slansing »

From your graph it still looks like nagios is checking roughly the same amount, do you have something else you are referencing for this info? The current, average, and maximum plots all include peaks, so if ti did peak before the update, as it looks like it did, those numbers will all slowly get smaller as problems are resolved, and checks return to their normal rates, there can of course be other explanations to this but this is probably the main one.
GJN65
Posts: 3
Joined: Tue Sep 24, 2013 10:21 pm

Re: Check rate under 4.0.0

Post by GJN65 »

The peaky section of the graphs prior to 1240 in x axis was the system running under 3.4.0, the drop to zero is me changing over to 4.0.0 and the section after that is the system running under 4.0.0

There are no outstanding problem sites on the system for the duration of the plot so that's not a factor.

This is a 24 core Linux system with 32GB of RAM and 4 x 1GB NICs (trunked). I have 2 of these systems running the same checks in parallel for redundancy, the other box is still running 3.4.0 as a control during the changeover to 4.0.0

What has shown up with 4.0.0 after letting it run for a day is that the 15 minute check count which gives me a more "smoothed" metric has now stabilised at a higher rate than under 3.4.0 and the system LA metrics have stayed consistently lower than under 3.4.0 so the performance under 4.0.0 is a marked improvement.

The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Check rate under 4.0.0

Post by abrist »

GJN65 wrote: The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.
Well, for your sake, I hope you don't get a chance to test it. Although, we would be highly interested in your results. Keep us informed. Those performance numbers are exciting.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked