Page 1 of 2

Alert's history not working after changing concurrent checks

Posted: Wed Jul 16, 2014 3:45 pm
by uselessid
Hello.

Our nagios (3.5.1) has about 10k services and mostly all of it are active service checks.

We were experiencing high loads at OS and (we dont known if is related) some gaps in random graphics.

So, we followed the instructions to improve nagios's performance (http://nagios.sourceforge.net/docs/3_0/tuning.html) which lead us to change some parameters, including max_concurrent_checks until we came to a balance between OS's load and check latency. But, lowing max_concurrent_checks from 0 (unlimited) to 250 polluted the log file causing alert's history to simply stop (at least for the most part of the time). It keeps logging "Max concurrent service checks (250) has been reached" and it seems to mess the alert's history as the log grows up.

Our check latency were around 170 secs and after the changes it lowered to 5 to 15 secs.

I known that it keeps postponing the checks but it was the best way we found to make active and on-demand checks more responsive.

Re: Alert's history not working after changing concurrent ch

Posted: Wed Jul 16, 2014 8:06 pm
by Box293
Are you using MRTG?

Re: Alert's history not working after changing concurrent ch

Posted: Thu Jul 17, 2014 6:32 am
by uselessid
For general graphics i'm using nagios grapher.

But for nagiostats in specific i'm using MRTG, yes.

Re: Alert's history not working after changing concurrent ch

Posted: Thu Jul 17, 2014 5:05 pm
by abrist
uselessid wrote:For general graphics i'm using nagios grapher.

But for nagiostats in specific i'm using MRTG, yes.
The reason for the lower load is that a ton of your checks are most likely not running anywhere near their scheduled time. Were your issue mostly due to load or io wait?

Re: Alert's history not working after changing concurrent ch

Posted: Thu Jul 17, 2014 9:19 pm
by Box293
uselessid wrote:But for nagiostats in specific i'm using MRTG, yes.
I asked this because I've seen situations where stale MRTG configuration can lead to a higher load which can cause gaps in graphs.

If you have decomissioned objects in your MRTG config file, delays in MRTG trying to access these non-contactable devices can make MRTG run longer than expected. With about 10k services it could be possible that decomissioned devices may still be in your mrtg configs.

I hope this is of some help.

Re: Alert's history not working after changing concurrent ch

Posted: Fri Jul 18, 2014 7:14 am
by uselessid
abrist wrote:
uselessid wrote:For general graphics i'm using nagios grapher.

But for nagiostats in specific i'm using MRTG, yes.
The reason for the lower load is that a ton of your checks are most likely not running anywhere near their scheduled time. Were your issue mostly due to load or io wait?

Mostly related to load.

I know that nagios was postponing the checks when i used a low max_concurrent_checks value but in general the check latency time (active checks) got considerably lower and on-demand checks almost instantaneous.

For our scneario, it seems to be the best setting (to not use unlimited max_concurrent_checks).

I only wish that could be a way to supress those log messages (concurrent checks reached).

Re: Alert's history not working after changing concurrent ch

Posted: Fri Jul 18, 2014 12:49 pm
by lmiltchev
I only wish that could be a way to supress those log messages (concurrent checks reached).
I don't believe there is a way to suppress these messages. I would recommend focusing on finding out what is causing the high load on the system.

Re: Alert's history not working after changing concurrent ch

Posted: Wed Jul 23, 2014 6:36 am
by uselessid
The load is possibly due a lot of services being monitored by active checks, about 10k, so there's a lot of concurrent checks.

Besides that, the server's also run the script's checks and, as we monitor remote locations through MPLS/Internet and latency/packet loss may occur, then it happens that some scripts keep hanging waiting for the timeout. Sum it all and you got the high load.

I don't think there's much we can do. Lowing max_concurrent_checks is our best choice!

Re: Alert's history not working after changing concurrent ch

Posted: Wed Jul 23, 2014 4:26 pm
by tmcdonald
Is there any way you could offload some of the checks to be passive? That would help with both the load and the concurrent checks.

Also, maybe try tweaking the check_interval for some non-critical checks to 10, 15, or higher.

Re: Alert's history not working after changing concurrent ch

Posted: Fri Jul 25, 2014 9:38 am
by uselessid
I've changed the check_interval for the most costly services but it didn't help much.

The load wouldn't be a problem as long it didn't affect the check latency or graphing (don't know exactly if it has something to do with it).