[Nagios-devel] (Multi-Threaded Nagios results buffer overflowing) A
Posted: Wed Sep 09, 2009 4:40 pm
Hi Everyone,
After creating the multi-thread patch for Nagios I noticed that after a few=
hours my performance would quickly begin to degrade, less and less service=
checks were executing and yet latency would remain the same.
When I started looking for the problem I noticed that what was happening wa=
s that the service result buffer would fill quickly and then would be const=
antly overflowing.
Used/High/Total Check Result Buffers: 4096 / 4096 / 4096
So I doubled the size of the buffer and it helped but eventually it would o=
verflow again.
Eventually it struck me to tie the size of the buffer to the total number o=
f service checks, so I set it to 30,000. It worked very well, but after a =
day or so it would just overflow again.
Used/High/Total Check Result Buffers: 30000 / 30000 / 30000
Then I had an epiphany. The problem isn't that the buffer is too small, in=
fact it's really only a symptom.
The actual problem is that the service reaper is too slow.=20=20
Profiling shows that the system spends 2/3rds of it's time just running the=
reaper and only 1/3rd actually executing checks.
When I moved the high priority events into their own thread, I stopped the =
reaper from blocking the system, but the reaper still needed that time to a=
ctually empty the results buffer.
So I removed the timeout in the reaper that bails out after so many seconds=
have passed, to give it as much time as it needed, that helped but it stil=
l never came close to catching up.
My final solution was to create a thread in handle_timed_event just for the=
service reaper.=20=20
The reaper is infrequent enough that I don't think thread creation overhead=
will be a significant issue, but what it does do, is allow more threads an=
d therefore more resources to be devoted to the service reaper when it's ne=
eded, and when the results buffer empties the threads can exit freeing reso=
urces for other tasks.
The proof is in the pudding.
Used/High/Total Check Result Buffers: 290 / 706 / 30000
As you can see I'm no longer treading the high water mark.
It produces an interesting pattern when running under gdb
[New Thread -1542456416 (LWP 18647)]
[New Thread -1550845024 (LWP 18675)]
[New Thread -1559233632 (LWP 18736)]
[New Thread -1567622240 (LWP 18869)]
[New Thread -1576010848 (LWP 19020)]
[New Thread -1584399456 (LWP 19129)]
[New Thread -1592788064 (LWP 19224)]
[New Thread -1601176672 (LWP 19319)]
[New Thread -1609565280 (LWP 19434)]
[Thread -1601176672 (LWP 19319) exited]
[Thread -1609565280 (LWP 19434) exited]
[Thread -1592788064 (LWP 19224) exited]
[Thread -1542456416 (LWP 18647) exited]
[Thread -1584399456 (LWP 19129) exited]
[Thread -1567622240 (LWP 18869) exited]
[Thread -1550845024 (LWP 18675) exited]
[Thread -1576010848 (LWP 19020) exited]
As you can see the threads appear to be interleaving, for instance even tho=
ugh 18647 is the first one to launch, it's the 4th one to exit.=20
Also the number of threads launched is never consistent, I've seen every nu=
mber from 1 to as many as 20 but I'm sure there is no upper bound, so death=
by threads is entirely possible here, but unlikely.
I'd appreciate any thoughts you may have on the matter, and maybe some enco=
uragement, advice, and stern words of warning if anyone has been down this =
path before.=20=20
If I'm treading uncharted waters in undiscovered lands, I'd like to know th=
at as well
Sincerely,
Steve
NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
After creating the multi-thread patch for Nagios I noticed that after a few=
hours my performance would quickly begin to degrade, less and less service=
checks were executing and yet latency would remain the same.
When I started looking for the problem I noticed that what was happening wa=
s that the service result buffer would fill quickly and then would be const=
antly overflowing.
Used/High/Total Check Result Buffers: 4096 / 4096 / 4096
So I doubled the size of the buffer and it helped but eventually it would o=
verflow again.
Eventually it struck me to tie the size of the buffer to the total number o=
f service checks, so I set it to 30,000. It worked very well, but after a =
day or so it would just overflow again.
Used/High/Total Check Result Buffers: 30000 / 30000 / 30000
Then I had an epiphany. The problem isn't that the buffer is too small, in=
fact it's really only a symptom.
The actual problem is that the service reaper is too slow.=20=20
Profiling shows that the system spends 2/3rds of it's time just running the=
reaper and only 1/3rd actually executing checks.
When I moved the high priority events into their own thread, I stopped the =
reaper from blocking the system, but the reaper still needed that time to a=
ctually empty the results buffer.
So I removed the timeout in the reaper that bails out after so many seconds=
have passed, to give it as much time as it needed, that helped but it stil=
l never came close to catching up.
My final solution was to create a thread in handle_timed_event just for the=
service reaper.=20=20
The reaper is infrequent enough that I don't think thread creation overhead=
will be a significant issue, but what it does do, is allow more threads an=
d therefore more resources to be devoted to the service reaper when it's ne=
eded, and when the results buffer empties the threads can exit freeing reso=
urces for other tasks.
The proof is in the pudding.
Used/High/Total Check Result Buffers: 290 / 706 / 30000
As you can see I'm no longer treading the high water mark.
It produces an interesting pattern when running under gdb
[New Thread -1542456416 (LWP 18647)]
[New Thread -1550845024 (LWP 18675)]
[New Thread -1559233632 (LWP 18736)]
[New Thread -1567622240 (LWP 18869)]
[New Thread -1576010848 (LWP 19020)]
[New Thread -1584399456 (LWP 19129)]
[New Thread -1592788064 (LWP 19224)]
[New Thread -1601176672 (LWP 19319)]
[New Thread -1609565280 (LWP 19434)]
[Thread -1601176672 (LWP 19319) exited]
[Thread -1609565280 (LWP 19434) exited]
[Thread -1592788064 (LWP 19224) exited]
[Thread -1542456416 (LWP 18647) exited]
[Thread -1584399456 (LWP 19129) exited]
[Thread -1567622240 (LWP 18869) exited]
[Thread -1550845024 (LWP 18675) exited]
[Thread -1576010848 (LWP 19020) exited]
As you can see the threads appear to be interleaving, for instance even tho=
ugh 18647 is the first one to launch, it's the 4th one to exit.=20
Also the number of threads launched is never consistent, I've seen every nu=
mber from 1 to as many as 20 but I'm sure there is no upper bound, so death=
by threads is entirely possible here, but unlikely.
I'd appreciate any thoughts you may have on the matter, and maybe some enco=
uragement, advice, and stern words of warning if anyone has been down this =
path before.=20=20
If I'm treading uncharted waters in undiscovered lands, I'd like to know th=
at as well
Sincerely,
Steve
NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]