Re: [Nagios-devel] Continuing issues with retention file causing
Posted: Thu Mar 09, 2006 12:23 pm
This is a multi-part message in MIME format.
--------------090509000402030202020302
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Here I go multitasking, file attached. I've also attached a day's worth
of 'premature script header' errors from the apache logs, WRT that
error. Here's an example of a view in extinfo.cgi that was working one
minute, and then after a "refresh" it errors out:
Loading this URL:
https://monitor02/nagios/cgi-bin/extinf ... thstar1258
Results in this error (momentarily):
[Thu Mar 09 12:08:59 2006] [error] [client 10.73.16.108] Premature
end of script headers: extinfo.cgi, referer:
https://monitor02/nagios/cgi-bin/status ... stprops=42
I still haven't been able to get any indication of the cause (or even
the existence) of the scheduling/event stalling issues. Nothing ever
appears "incorrect" in nagios' logs or schedule, only the lack of events
occuring. One more item I noticed after I removed the retention.dat
file yesterday: In addition to event handlers for one service not being
executed, there was one user who did not trigger "acknowledgement"
emails even though it should have, while my ack's sent an email. After
the file removal, that problem went away also. In practice this can
take several weeks to a month+ of running before I notice the issue
cropping up again, in that time I add/remove hundreds (thousands) of
hosts/services, reload and stop/start nagios dozens of times...
Are there any potential fixes for these behaviour in CVS? I havent seen
them addressed at all in -devel, while there have been a few reports of
similar issues.
(Nagios 2.0, x86_64,
7385 services.
754 hosts.
6454 service dependencies.
47 commands.
)
/eli
Eli Stair wrote:
>
> I'm continuing to have problems when retention.dat file gets into a
> state where the nagios process stops functioning properly. The problems
> I've had in the past were increasing numbers of hosts or entire
> hostgroups no longer executing their service checks, and now (today)
> that the event handler for one particular service stopped being executed
> (while all others continue to work).
>
> In this and all previous cases, stopping nagios and moving the retention
> file out of the way resolves the issue. Reloading or a hard stop/start
> of nagios doesn't have any effect. There has never appeared to be
> anything "wrong" with the retention file.
>
> The only issues with my installation are this issue, and the
> all-too-frequent "premature end of script headers" in all the CGI's, and
> "Warning: Size of service_message struct (528 bytes) is >
> POSIX-guaranteed atomic write size (512 bytes). " due to compiling
> x86_64. That being said, I have enough issues that there dozens of
> daily "premature script header/Internal Server Error" wreaking havoc
> with production, and these instances of event failures that are
> extremely critical. The script header problem came into being
> immediately upon upgrading from 2.0b6 to 2.0rc2+, and the
> scheduling/retention problem has been present to varying degrees in
> every 2.0b+ I've tried.
>
> I am happy to find these are configuration/optimization issues on my end
> I can resolve, but my suspicion is they are bugs. I will do anything I
> can to help provide a debug testbed for identifying and tracking them
> down. Attached is my main nagios config (objects are not included), and
> I can provide any other data (object configs, logs, retention.dat, etc)
> privately if needed (security concerns).
>
> Please let me know what I can do to help address this and find a
> resolution.
>
> Regards,
>
> /eli
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live
> webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk ... dat=121642
> ____________
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
--------------090509000402030202020302
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Here I go multitasking, file attached. I've also attached a day's worth
of 'premature script header' errors from the apache logs, WRT that
error. Here's an example of a view in extinfo.cgi that was working one
minute, and then after a "refresh" it errors out:
Loading this URL:
https://monitor02/nagios/cgi-bin/extinf ... thstar1258
Results in this error (momentarily):
[Thu Mar 09 12:08:59 2006] [error] [client 10.73.16.108] Premature
end of script headers: extinfo.cgi, referer:
https://monitor02/nagios/cgi-bin/status ... stprops=42
I still haven't been able to get any indication of the cause (or even
the existence) of the scheduling/event stalling issues. Nothing ever
appears "incorrect" in nagios' logs or schedule, only the lack of events
occuring. One more item I noticed after I removed the retention.dat
file yesterday: In addition to event handlers for one service not being
executed, there was one user who did not trigger "acknowledgement"
emails even though it should have, while my ack's sent an email. After
the file removal, that problem went away also. In practice this can
take several weeks to a month+ of running before I notice the issue
cropping up again, in that time I add/remove hundreds (thousands) of
hosts/services, reload and stop/start nagios dozens of times...
Are there any potential fixes for these behaviour in CVS? I havent seen
them addressed at all in -devel, while there have been a few reports of
similar issues.
(Nagios 2.0, x86_64,
7385 services.
754 hosts.
6454 service dependencies.
47 commands.
)
/eli
Eli Stair wrote:
>
> I'm continuing to have problems when retention.dat file gets into a
> state where the nagios process stops functioning properly. The problems
> I've had in the past were increasing numbers of hosts or entire
> hostgroups no longer executing their service checks, and now (today)
> that the event handler for one particular service stopped being executed
> (while all others continue to work).
>
> In this and all previous cases, stopping nagios and moving the retention
> file out of the way resolves the issue. Reloading or a hard stop/start
> of nagios doesn't have any effect. There has never appeared to be
> anything "wrong" with the retention file.
>
> The only issues with my installation are this issue, and the
> all-too-frequent "premature end of script headers" in all the CGI's, and
> "Warning: Size of service_message struct (528 bytes) is >
> POSIX-guaranteed atomic write size (512 bytes). " due to compiling
> x86_64. That being said, I have enough issues that there dozens of
> daily "premature script header/Internal Server Error" wreaking havoc
> with production, and these instances of event failures that are
> extremely critical. The script header problem came into being
> immediately upon upgrading from 2.0b6 to 2.0rc2+, and the
> scheduling/retention problem has been present to varying degrees in
> every 2.0b+ I've tried.
>
> I am happy to find these are configuration/optimization issues on my end
> I can resolve, but my suspicion is they are bugs. I will do anything I
> can to help provide a debug testbed for identifying and tracking them
> down. Attached is my main nagios config (objects are not included), and
> I can provide any other data (object configs, logs, retention.dat, etc)
> privately if needed (security concerns).
>
> Please let me know what I can do to help address this and find a
> resolution.
>
> Regards,
>
> /eli
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live
> webcast
> and join the prime developer group breaking into this new coding territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk ... dat=121642
> ____________
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]