[Nagios-devel] Multi-Threaded Nagios, The story so far...

Guest · Post by **Guest** » Fri Sep 11, 2009 8:59 pm

Hi Everyone,

There is a law of the Universe that says that the more complex something i=
s, the more complex it's problems tend to be, due to complexity of interact=
ions of the disparate systems.
I imagine this is similar in form to chaos theory, a butterfly flapping it=
's wings in Africa could trigger a hurricane in Florida etc...
In this case a change I made to speed things up worked a little too well, c=
ausing a whirlwind to take place.

As you know a few weeks ago after a lot of profiling I decided to take the =
high priority event queue and move it into it's own thread.
This worked great for several days but eventually I noticed that the latenc=
y would again climb to unacceptable levels and fewer and fewer service chec=
ks were executing.

While looking at the log I realized that the results buffer was constantly =
overflowing so I increased it, and increased it and increased it some more,=
eventually realizing the problem wasn't that the buffer wasn't big enough,=
but that it simply wasn't emptying fast enough.
It dawned on me that part of the problem was that the service reaper is ser=
ialized, and only has 10 seconds to complete, but on average it takes almos=
t 3 times longer to reap than to execute.
So I modified the reape, removing the time limit, but obviously the reape=
r would be the only high priority event to ever run, since the reaper could=
never keep up with the executor (dnx).
My final modification to the reaper was to have each reaper event launch in=
to it's own thread.
This created what in effect is a semi-self managed pool of threads since if=
it takes more than X seconds for the first reaper to finish, a second reap=
er will launch, X seconds later a third then a fourth and so on.

This design works fantastic except that after the first pass through the sy=
stem there would be a double free condition and eventually and hours or so =
later no more checks would be executing.
The event list was empty, but the application didn't exit (in events.c if e=
vent_list_low =3D=3D NULL the program should shut down).

My initial suspicion was that when the double free or corruption would occu=
r it would do so while holding a mutex open, thereby preventing reschedulin=
g from occurring on any events received.
While I never did find out which mutex was causing the problem I was able t=
o eventually able to verify this theory because the application crashed at =
some point but left 5 process alive, an strace -ff -p showed that not only=
was each process stuck waiting on a mutex, they were all waiting on the sa=
me mutex.

It's always occuring in the free_memory function, and so I tried to control=
access to that function via a mutex, however the double free condition con=
tinues to occur anyways, so I took a different tact.
Noticing that the double free or corruption issue was occurring on a regula=
r basis in a custom function I created to allow Nagios 2.7 to run host chec=
ks concurrent with service checks instead of blocking on them, I went ahead=
an commented out the problem code and reverted the host check behavior bac=
k to normal. It now seems to be functionally ok, even though we are sporad=
ically getting double free(s) conditions.

I'm going to look into removing the free_memory event all together and have=
it fire as a high priority event periodically, since it looks to me like i=
t's just basically a garbage collection step anyways.
I'm hoping someone could let me know if I'm on the right track or not.

My numbers aren't quite as good now, it's a 25% drop from the top numbers I=
was able to get by parallelizing host checks, but it's still working 200-3=
00% faster than it was before I began the work to multi-thread it.
I'm confident that at some point in the future I'll be able to parallelize =
host checks again, but since 3.x has that already I'm not wasting any more =
resources on it, we have a roadmap in place to upgrade in the near future a=
nyways.

Assuming it runs stably, I'll get another patch out in a few more days for =
testing.

Sincerely,
Steve

NOTICE: This email message is

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]