Page 1 of 1

[Nagios-devel] A different way?

Posted: Thu Sep 24, 2009 10:05 pm
by Guest
Hello everyone,

I've decided to take a break for a bit from multi-threading nagios to focus=
on DNX since that is my day job after all :)
While working on all of this I had a few thoughts that might make some good=
ideas if Nagios is ever re-designed again, say for a 4.x branch.

As you know, under nagios, all checks are dispatched by nagios to be execut=
ed on the local machine at set intervals.
Under a distributed nagios setup, you have multiple nagios instances runnin=
g on various machines executing checks and passing the results back to a pa=
ssive master controller.

Under DNX, we distribute the load to "worker nodes" which then execute the =
checks and hand the results back to an active master controller that then p=
rocesses the result etc.

Profiling shows that (under DNX at least) 2/3rds of our time is spent in th=
e reaper processing results, so wouldn't it make more sense to flip the pr=
ocess around?

The checks are already executing on the local machine, so how about a daemo=
n on each machine, the daemon would keep the schedule and execute service c=
hecks locally, processing the result and returning the results and the requ=
ired actions (based on a local policy) to nagios which would then do the ac=
tual work of handling notifications etc and so forth.
This way nagios could be an auditor, if it doesn't receive a result on time=
as expected, then it could query the daemon to see whats gone wrong, if th=
at fails then it could initiate a host check, etc.

From a design standpoint this is a bit more work than the current setup, bu=
t it seems to me that this could allow for much greater flexibility and sca=
lability in the long run.

Anyways I hope this sparks a little debate but I don't want to "come in and=
shake things up", or go around changing everything, stepping on toes all t=
he while, it's just that putting the responsibility of actually executing t=
he check and doing so on time, onto the computer it needs to execute on, ju=
st makes more sense to me.
It's not really dramatically different from what we do now, it's just addin=
g a scheduler/timer to the existing execution framework and adding somethin=
g to push the original schedule and any changes such as scheduled downtime =
to the appropriate machines, putting everything else into a semi passive mo=
de effectively turning each machine to be checked into it's own "worker nod=
e"=20

Thoughts?

Sincerely,
Steve




NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.







This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]