[Nagios-devel] Multi-Threaded Nagios, The story so far...

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Multi-Threaded Nagios, The story so far...

Post by Guest »

Hi Everyone,

There is a law of the Universe that says that the more complex something i=
s, the more complex it's problems tend to be, due to complexity of interact=
ions of the disparate systems.
I imagine this is similar in form to chaos theory, a butterfly flapping it=
's wings in Africa could trigger a hurricane in Florida etc...
In this case a change I made to speed things up worked a little too well, c=
ausing a whirlwind to take place.

As you know a few weeks ago after a lot of profiling I decided to take the =
high priority event queue and move it into it's own thread.
This worked great for several days but eventually I noticed that the latenc=
y would again climb to unacceptable levels and fewer and fewer service chec=
ks were executing.

While looking at the log I realized that the results buffer was constantly =
overflowing so I increased it, and increased it and increased it some more,=
eventually realizing the problem wasn't that the buffer wasn't big enough,=
but that it simply wasn't emptying fast enough.
It dawned on me that part of the problem was that the service reaper is ser=
ialized, and only has 10 seconds to complete, but on average it takes almos=
t 3 times longer to reap than to execute.
So I modified the reape, removing the time limit, but obviously the reape=
r would be the only high priority event to ever run, since the reaper could=
never keep up with the executor (dnx).
My final modification to the reaper was to have each reaper event launch in=
to it's own thread.
This created what in effect is a semi-self managed pool of threads since if=
it takes more than X seconds for the first reaper to finish, a second reap=
er will launch, X seconds later a third then a fourth and so on.

This design works fantastic except that after the first pass through the sy=
stem there would be a double free condition and eventually and hours or so =
later no more checks would be executing.
The event list was empty, but the application didn't exit (in events.c if e=
vent_list_low =3D=3D NULL the program should shut down).

My initial suspicion was that when the double free or corruption would occu=
r it would do so while holding a mutex open, thereby preventing reschedulin=
g from occurring on any events received.
While I never did find out which mutex was causing the problem I was able t=
o eventually able to verify this theory because the application crashed at =
some point but left 5 process alive, an strace -ff -p showed that not only=
was each process stuck waiting on a mutex, they were all waiting on the sa=
me mutex.

It's always occuring in the free_memory function, and so I tried to control=
access to that function via a mutex, however the double free condition con=
tinues to occur anyways, so I took a different tact.
Noticing that the double free or corruption issue was occurring on a regula=
r basis in a custom function I created to allow Nagios 2.7 to run host chec=
ks concurrent with service checks instead of blocking on them, I went ahead=
an commented out the problem code and reverted the host check behavior bac=
k to normal. It now seems to be functionally ok, even though we are sporad=
ically getting double free(s) conditions.

I'm going to look into removing the free_memory event all together and have=
it fire as a high priority event periodically, since it looks to me like i=
t's just basically a garbage collection step anyways.
I'm hoping someone could let me know if I'm on the right track or not.

My numbers aren't quite as good now, it's a 25% drop from the top numbers I=
was able to get by parallelizing host checks, but it's still working 200-3=
00% faster than it was before I began the work to multi-thread it.
I'm confident that at some point in the future I'll be able to parallelize =
host checks again, but since 3.x has that already I'm not wasting any more =
resources on it, we have a roadmap in place to upgrade in the near future a=
nyways.

Assuming it runs stably, I'll get another patch out in a few more days for =
testing.

Sincerely,
Steve


NOTICE: This email message is

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked