Re: [Nagios-devel] A different way?
Posted: Fri Sep 25, 2009 5:13 pm
It's very similar.
I call it semi-passive (although someone mentioned passive-aggressive might=
be a better name for it).
You still have an active nagios instance and it's still checking to make su=
re checks did execute on time (similar to active), it's just not doing the =
actual execution anymore (similar to passive), and instead of processing th=
e "meaning" of the results of a check it would just process the outcome as =
directed by the rules for the host/service being monitored.
Let me give a for instance...
Under my current setup I dispatch a check to a DNX worker node, the check e=
xecutes and the result is handed wholesale back to Nagios.
Nagios parses the result, and tries to divine if the service is up, down, f=
lapping etc and then takes appropriate action.
Here's a breakdown of where time is spent.
Nagios event loop approx 0.07s handing service check to DNX
DNX average of 3 seconds round trip
Nagios up to 10 seconds to process the result depending on how many depende=
ncies are involved, and as much as 30 seconds if a host check is required.
Now obviously this is because all of my service checks are active and not p=
assive and I have 3,000 hosts and 30,000 service checks
Under the proposed design it would look more like this.
Nagios initializes and pushes all schedule pieces to all hosts.
Next nagios enters a passive mode where it listens for results, and audit m=
ode where it watches the schedule looking for results that haven't come in =
yet.
On the flip side the execution daemon is running on each host and it's exec=
uting the checks, determining what is meant by the check "service up/down f=
lapping etc" and passes that meaning back to nagios which subsequently take=
s the appropriate action.
All the while the auditor is watching for checks that were scheduled but ha=
ven't come in yet, and contacting hosts to find out whats up etc.
So really in some ways this is an expansion of the current passive model fo=
r checks, but in some ways this is a whole new model (compared to what we d=
o now anyways)=20
Those are my thoughts on the matter, what do you think?
Sincerely,
Steve
=20
________________________________________
From: hemebond [[email protected]]
Sent: Friday, September 25, 2009 2:19 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] A different way?
Isn't this the same as using passive checks? It sounds like what I've set u=
p. I wrote a simple agent (script) that has its own schedule and runs the c=
hecks, sending the result back to a Nagios server.
2009/9/25 Steven D. Morrey >
Hello everyone,
I've decided to take a break for a bit from multi-threading nagios to focus=
on DNX since that is my day job after all
While working on all of this I had a few thoughts that might make some good=
ideas if Nagios is ever re-designed again, say for a 4.x branch.
As you know, under nagios, all checks are dispatched by nagios to be execut=
ed on the local machine at set intervals.
Under a distributed nagios setup, you have multiple nagios instances runnin=
g on various machines executing checks and passing the results back to a pa=
ssive master controller.
Under DNX, we distribute the load to "worker nodes" which then execute the =
checks and hand the results back to an active master controller that then p=
rocesses the result etc.
Profiling shows that (under DNX at least) 2/3rds of our time is spent in th=
e reaper processing results, so wouldn't it make more sense to flip the pr=
ocess around?
The checks are already executing on the local machine, so how about a daemo=
n on each machine, the daemon would keep the schedule and execute service c=
hecks locally, processing the result and returning the results and the requ=
ired actions (based on a local policy) to nagios which would then do the ac=
tual work of handling notifications etc and so forth.
This way nagios could be an auditor, if it doesn't receive a result on time=
as expected, then it could query the daemon to see whats gone wrong, if th=
at fails then it could initiate
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: emebond [[email protected]
I call it semi-passive (although someone mentioned passive-aggressive might=
be a better name for it).
You still have an active nagios instance and it's still checking to make su=
re checks did execute on time (similar to active), it's just not doing the =
actual execution anymore (similar to passive), and instead of processing th=
e "meaning" of the results of a check it would just process the outcome as =
directed by the rules for the host/service being monitored.
Let me give a for instance...
Under my current setup I dispatch a check to a DNX worker node, the check e=
xecutes and the result is handed wholesale back to Nagios.
Nagios parses the result, and tries to divine if the service is up, down, f=
lapping etc and then takes appropriate action.
Here's a breakdown of where time is spent.
Nagios event loop approx 0.07s handing service check to DNX
DNX average of 3 seconds round trip
Nagios up to 10 seconds to process the result depending on how many depende=
ncies are involved, and as much as 30 seconds if a host check is required.
Now obviously this is because all of my service checks are active and not p=
assive and I have 3,000 hosts and 30,000 service checks
Under the proposed design it would look more like this.
Nagios initializes and pushes all schedule pieces to all hosts.
Next nagios enters a passive mode where it listens for results, and audit m=
ode where it watches the schedule looking for results that haven't come in =
yet.
On the flip side the execution daemon is running on each host and it's exec=
uting the checks, determining what is meant by the check "service up/down f=
lapping etc" and passes that meaning back to nagios which subsequently take=
s the appropriate action.
All the while the auditor is watching for checks that were scheduled but ha=
ven't come in yet, and contacting hosts to find out whats up etc.
So really in some ways this is an expansion of the current passive model fo=
r checks, but in some ways this is a whole new model (compared to what we d=
o now anyways)=20
Those are my thoughts on the matter, what do you think?
Sincerely,
Steve
=20
________________________________________
From: hemebond [[email protected]]
Sent: Friday, September 25, 2009 2:19 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] A different way?
Isn't this the same as using passive checks? It sounds like what I've set u=
p. I wrote a simple agent (script) that has its own schedule and runs the c=
hecks, sending the result back to a Nagios server.
2009/9/25 Steven D. Morrey >
Hello everyone,
I've decided to take a break for a bit from multi-threading nagios to focus=
on DNX since that is my day job after all
While working on all of this I had a few thoughts that might make some good=
ideas if Nagios is ever re-designed again, say for a 4.x branch.
As you know, under nagios, all checks are dispatched by nagios to be execut=
ed on the local machine at set intervals.
Under a distributed nagios setup, you have multiple nagios instances runnin=
g on various machines executing checks and passing the results back to a pa=
ssive master controller.
Under DNX, we distribute the load to "worker nodes" which then execute the =
checks and hand the results back to an active master controller that then p=
rocesses the result etc.
Profiling shows that (under DNX at least) 2/3rds of our time is spent in th=
e reaper processing results, so wouldn't it make more sense to flip the pr=
ocess around?
The checks are already executing on the local machine, so how about a daemo=
n on each machine, the daemon would keep the schedule and execute service c=
hecks locally, processing the result and returning the results and the requ=
ired actions (based on a local policy) to nagios which would then do the ac=
tual work of handling notifications etc and so forth.
This way nagios could be an auditor, if it doesn't receive a result on time=
as expected, then it could query the daemon to see whats gone wrong, if th=
at fails then it could initiate
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: emebond [[email protected]