Page 1 of 1

Check rate under 4.0.0

Posted: Tue Sep 24, 2013 10:32 pm
by GJN65
I've upgraded from 3.4.0 to 4.0.0 - my installation is monitoring 2300 hosts and 4500 services. Nagiostats shows that the load averages on the system have dropped dramatically since the upgrade, the check count rate has evened out from it's previous peaky nature but the number of checks per second has dropped as well. Is this expected behaviour? Am I mis-interpreting the nagiostats output?

Check rate:
nagios_cehck_rate.png
Load Average
nagios_load_average.png
Gavin

Re: Check rate under 4.0.0

Posted: Wed Sep 25, 2013 9:36 am
by magna.vis
This probably has to do with the addition of Core Workers in Nagios 4.0.0. See the extended release notes: http://labs.nagios.com/2013/09/20/nagio ... available/

Re: Check rate under 4.0.0

Posted: Wed Sep 25, 2013 11:00 am
by abrist
These two metrics are symptoms of some of the most important changes in core 4. The Nagios world conference is next week, and over the following months, videos of all the presentations will be available online. Keep your eyes on the labs page as Eric Stanley, Ethan Galstad and Andreas Ericsson are all covering different aspects of core 4.

Re: Check rate under 4.0.0

Posted: Wed Sep 25, 2013 8:03 pm
by GJN65
magna.vis wrote:This probably has to do with the addition of Core Workers in Nagios 4.0.0. See the extended release notes: http://labs.nagios.com/2013/09/20/nagio ... available/
The drop in the load average metric I understand and expected with the changes in the core workers model. The drop in the check rate reported by nagiostats is not so clear though.

Re: Check rate under 4.0.0

Posted: Thu Sep 26, 2013 9:53 am
by slansing
From your graph it still looks like nagios is checking roughly the same amount, do you have something else you are referencing for this info? The current, average, and maximum plots all include peaks, so if ti did peak before the update, as it looks like it did, those numbers will all slowly get smaller as problems are resolved, and checks return to their normal rates, there can of course be other explanations to this but this is probably the main one.

Re: Check rate under 4.0.0

Posted: Thu Sep 26, 2013 8:47 pm
by GJN65
The peaky section of the graphs prior to 1240 in x axis was the system running under 3.4.0, the drop to zero is me changing over to 4.0.0 and the section after that is the system running under 4.0.0

There are no outstanding problem sites on the system for the duration of the plot so that's not a factor.

This is a 24 core Linux system with 32GB of RAM and 4 x 1GB NICs (trunked). I have 2 of these systems running the same checks in parallel for redundancy, the other box is still running 3.4.0 as a control during the changeover to 4.0.0

What has shown up with 4.0.0 after letting it run for a day is that the 15 minute check count which gives me a more "smoothed" metric has now stabilised at a higher rate than under 3.4.0 and the system LA metrics have stayed consistently lower than under 3.4.0 so the performance under 4.0.0 is a marked improvement.

The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.

Re: Check rate under 4.0.0

Posted: Fri Sep 27, 2013 10:57 am
by abrist
GJN65 wrote: The proof of the pudding will be when we next have a major outage, I've found that under 3.x.x the system has taken a long time to detect all the problem nodes back into good state, hopefully 4.0.0 will help me with this issue.
Well, for your sake, I hope you don't get a chance to test it. Although, we would be highly interested in your results. Keep us informed. Those performance numbers are exciting.