Re: [Nagios-devel] Possible patch to cure CGI's not finding data

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Possible patch to cure CGI's not finding data

Post by Guest »

Cary Petterborg wrote:
> Our status.dat file is about 37MB. We occasionally will find that
> valid services are not showing up from a status.cgi or extinfo.cgi
> page. This results in people getting confused or they know the
> problem and refresh the page to get the REAL data they need. Since
> the status.dat file is written to a temp file which is moved into
> place once the file is closed, it should not have partial contents.
> But, in our case at least, we were seeing results from the CGI's as
> if the file were only partially written. The problem with the current
> implementation is that it is possible that the file gets closed, but
> the contents are not completely flushed to disk when it is moved into
> replace the old file. In testing this phenomenon I took a service
> from the end of the status.dat file and looked at a CGI page as
> quickly as I could for many iterations. I found that about every 30th
> time (my average) the page acted as if the service didn't exist.
>
> That seems to be quite a high number of instances for the page to
> fail, so I added an fflush() before the fclose() and an fsync() right
> after the fclose(). This virtually guarantees that the file is
> completely written before the temp file is moved in to replace the
> outdated file. After making the change I was never able to get a
> failed page in more than 200 iterations of viewing the same page.
>
> The other files that could be a problem (and for completeness sake)
> are retention.dat, comments.dat and downtime.dat. So I applied the
> same principle change to each of these.
>
> I'm attaching a patch file that was done against our 2.7 version. I
> looked in the 3.0 code and it was not substantially different. The
> line numbers are different, though the context is the same, but the
> patch doesn't work on 3.0. I'm quite sure that a similar fix will
> work properly for 3.0.
>
> If anyone else is having this problem, you might want to try this
> patch and see if it fixes your problems as well. It is probably a
> good candidate for a bug fix if it is found to be a valuable
> modification. I don't know if smaller installations of Nagios are
> having any issues like this or not, but I suspect it is possible
> since actually flushing to the disk is handled by the OS on it's own
> timetable unless forced with fsync().
>
> If you try this modification, please let me know of any issues you
> have.
>
> Cary Petterborg

Good patch - I'll get this applied to Nagios 3.x HEAD.

- Ethan Galstad





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked