Re: [Nagios-devel] Nagios 2.6 still not draining command pipe fast

Guest · Post by **Guest** » Wed Feb 28, 2007 12:13 pm

In message ,
Ethan Galstad writes:

>John P. Rouillard wrote:
>> In message ,
>> Ethan Galstad writes:
>>
>>> John P. Rouillard wrote:
>>>> Hi all:
>>>>
>>>> I am trying to get my external correlation engine working with nagios
>>>> 2.x , and I just can't get
>>>> nagios to drain the command pipe fast enough. I see approx. 5% failure
>>>> rate on writing to the command pipe with an EAGAIN error.
>>>>
>>>> I have increased:
>>>>
>>>> nagios.h:#define COMMAND_BUFFER_SLOTS 20480
>>>> nagios.h:#define SERVICE_BUFFER_SLOTS 20480
>>>>
>>>> from the original 1024. In the increase of the settings from 10240 to
>>>> 20480, I may see a slight decrease (maybe .5%), but I think I just want
>to
>>>> see it. I don't think it's statistically viable.
>>> John - Does this problem still occur with Nagios 2.7 or the latest 2.x
>>> CVS code? A separate command file worker thread should be reading
>>> entries from the external command file as fast as it can read them (as
>>> long as their are free buffer slots).
>>>
>>> If there aren't any external commands, the thread waits 0.5 seconds
>>> before checking for new commands in the file. If you have occasional
>>> bursts of check results, this could be too long to wait. You could try
>>> experimenting with decreasing the 0.5 second delay. Around line 4948 of
>>> base/utils.c, you'll find...
>>>
>>> /* wait a bit */
>>> tv.tv_sec=0;
>>> tv.tv_usec=500000;
>>> select(0,NULL,NULL,NULL,&tv);
>>>
>>> You could try decreasing the value of tv.tv_usec to 100000 (0.1 seconds)
>>> and see if that helps at all.

I installed Nagios 2.7 last Thursday. Now the occurrence has dropped
from 5% to something in the neighborhood of .7%. But that may not be
the stable point as it is still growing, it was .5% a couple of days
ago. I haven't tried changing the sleep times mentioned above because
of a dramatic increase in average latency.

I am now seeing average latency in the 20 second range rather than 1
second as was occurring with my nagios 2.6 install. What is funny is
that the gui is showing:

Check Latency: 0.00 sec 109.37 sec 34.685 sec

that doesn't agree with what nagiostats reports. The max latency is
understandable as we have been having some network drops, but even in
a freshly started nagios with no network issues, the latency is in the
same range after a couple of hours. A 5 day old nagios process was
reporting the following from nagiostats:

Nagios Stats 2.7
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 01-19-2007
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File: /var/log/nagios/status.dat
Status File Age: 0d 0h 0m 1s
Status File Version: 2.7

Program Running Time: 5d 21h 28m 58s
Nagios PID: 29914
Used/High/Total Command Buffers: 0 / 45 / 4096
Used/High/Total Check Result Buffers: 96 / 441 / 4096

Total Services: 1876
Services Checked: 1696
Services Scheduled: 1627
Active Service Checks: 1692
Passive Service Checks: 184
Total Service State Change: 0.000 / 73.420 / 2.913 %
Active Service Latency: 0.000 / 90.954 / 19.948 sec
Active Service Execution Time: 0.000 / 55.244 / 4.032 sec
Active Service State Change: 0.000 / 73.420 / 3.188 %
Active Services Last 1/5/15/60 min: 870 / 1353 / 1414 / 1450
Passive Service State Change: 0.000 / 16.780 / 0.381 %
Passive Services Last 1/5/15/60 min: 123 / 175 / 176 / 177
Services Ok/Warn/Unk/Crit: 1400 / 24 / 274 / 178
Services Flapping: 0
Services In Downtime: 0

Total Hosts: 118
Hosts Checked: 118
Hosts Scheduled: 0
Active Host Checks: 118
Passive Host Checks: 0
Tota

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]