Passive check processing lag over 15 minutes
Posted: Tue Oct 02, 2012 2:17 pm
I've now on multiple occasions in the last 30 days noticed nagios fall way behind in processing passive checks, and would love some insight as to how to address it!
Background:
I've setup nagios in a distributed configuration, where the master acts as the web frontend and alert system, and a variable number of workers pump data through SQS to a process that (eventually) writes to the /var/lib/nagios3/rw/nagios.cmd pipeline. (3.2.0-4ubuntu2.2)
Most of the time, this works just fine, and it processes around 3500 checks/min with a lag of between 10 and 20 seconds from reality - more than acceptable for my needs, but over the last few weeks I've noticed that occasionally a weird backlog will develop;
var/log/nagios3/nagios.log will be up to date with the various "PASSIVE SERVICE CHECK: i-a9bc1dce.us-east-1..." entries, however /var/lib/nagios3/{host,service}-perfdata.out will lag behind, being updated in less-than-realtime.
After writing a cronjob to diff `date +%s` to the timestamps in those logs, I've now witnessed on more than one occasion the perfdata.out's being behind by upwards of 45 minutes, while the nagios.log seems perfectly caught up. From the webUI, it also indicated that the last check was received 45 minutes ago, even though the logs themselves disagreed.
Analyzing what I can from the logdata, most of the time this happens, it catches up before it gets over the 15 minute threshold (where I've now added a step to restart nagios
), and even in the 45 minute case, it was still "working", just with less speed than the checks were coming in.
I have raised the buffer size for checks to about 16k, though typically "Max" has stayed around 4k.
So three part question:
- Can anyone fathom why this would happen on a machine that's dedicated solely to nagios?
- Is there a way to force nagios to "flush" its processing buffer?
- And failing those two - In these sorts of states, is there a way to force it to just drop its buffer on the floor? I'd rather it just "get through it as fast as possible" so that I can get current alerts, rather than worrying about dealing with the state from 45 minutes ago that's no longer timely or relevant?
Thanks,
J
Background:
I've setup nagios in a distributed configuration, where the master acts as the web frontend and alert system, and a variable number of workers pump data through SQS to a process that (eventually) writes to the /var/lib/nagios3/rw/nagios.cmd pipeline. (3.2.0-4ubuntu2.2)
Most of the time, this works just fine, and it processes around 3500 checks/min with a lag of between 10 and 20 seconds from reality - more than acceptable for my needs, but over the last few weeks I've noticed that occasionally a weird backlog will develop;
var/log/nagios3/nagios.log will be up to date with the various "PASSIVE SERVICE CHECK: i-a9bc1dce.us-east-1..." entries, however /var/lib/nagios3/{host,service}-perfdata.out will lag behind, being updated in less-than-realtime.
After writing a cronjob to diff `date +%s` to the timestamps in those logs, I've now witnessed on more than one occasion the perfdata.out's being behind by upwards of 45 minutes, while the nagios.log seems perfectly caught up. From the webUI, it also indicated that the last check was received 45 minutes ago, even though the logs themselves disagreed.
Analyzing what I can from the logdata, most of the time this happens, it catches up before it gets over the 15 minute threshold (where I've now added a step to restart nagios
I have raised the buffer size for checks to about 16k, though typically "Max" has stayed around 4k.
So three part question:
- Can anyone fathom why this would happen on a machine that's dedicated solely to nagios?
- Is there a way to force nagios to "flush" its processing buffer?
- And failing those two - In these sorts of states, is there a way to force it to just drop its buffer on the floor? I'd rather it just "get through it as fast as possible" so that I can get current alerts, rather than worrying about dealing with the state from 45 minutes ago that's no longer timely or relevant?
Thanks,
J