I'd like to find out if this patch is working for other installations.
Has anyone tried this patch yet?
Is anyone else even having issues with status.cgi or extinfo.cgi not findin=
g the hosts/services that are supposed to be there, but needing a reload of=
the page to see it?
I know the code for all versions of Nagios suffer from this problem (at lea=
st through 3.0, I haven't looked at 3.1). If there is someone that would li=
ke me to create a patchfile that will work for version 3.0 (or possibly oth=
er versions), please let me know. I'll do my best to give you a good patch =
file.
Thanks!
________________________________________
From: Cary Petterborg [[email protected]]
Sent: Wednesday, July 15, 2009 5:37 PM
To: Nagios Developers List
Subject: [Nagios-devel] Possible patch to cure CGI's not finding data for o=
bjects in status.dat
Our status.dat file is about 37MB. We occasionally will find that valid ser=
vices are not showing up from a status.cgi or extinfo.cgi page. This result=
s in people getting confused or they know the problem and refresh the page =
to get the REAL data they need. Since the status.dat file is written to a t=
emp file which is moved into place once the file is closed, it should not h=
ave partial contents. But, in our case at least, we were seeing results fro=
m the CGI's as if the file were only partially written. The problem with th=
e current implementation is that it is possible that the file gets closed, =
but the contents are not completely flushed to disk when it is moved into r=
eplace the old file. In testing this phenomenon I took a service from the e=
nd of the status.dat file and looked at a CGI page as quickly as I could fo=
r many iterations. I found that about every 30th time (my average) the page=
acted as if the service didn't exist.
That seems to be quite a high number of instances for the page to fail, so =
I added an fflush() before the fclose() and an fsync() right after the fclo=
se(). This virtually guarantees that the file is completely written before =
the temp file is moved in to replace the outdated file. After making the ch=
ange I was never able to get a failed page in more than 200 iterations of v=
iewing the same page.
The other files that could be a problem (and for completeness sake) are ret=
ention.dat, comments.dat and downtime.dat. So I applied the same principle =
change to each of these.
I'm attaching a patch file that was done against our 2.7 version. I looked =
in the 3.0 code and it was not substantially different. The line numbers ar=
e different, though the context is the same, but the patch doesn't work on =
3.0. I'm quite sure that a similar fix will work properly for 3.0.
If anyone else is having this problem, you might want to try this patch and=
see if it fixes your problems as well. It is probably a good candidate for=
a bug fix if it is found to be a valuable modification. I don't know if sm=
aller installations of Nagios are having any issues like this or not, but I=
suspect it is possible since actually flushing to the disk is handled by t=
he OS on it's own timetable unless forced with fsync().
If you try this modification, please let me know of any issues you have.
Cary Petterborg
NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ary Petterborg [[email protected]