Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=
Posted: Thu Aug 06, 2009 6:18 pm
Really? This makes no sense at all.
All pending stdio output should be flushed by fclose(). If it isn't, your s=
tdio is broken.
All pending disk writes will read back as if committed when read on the sam=
e host, without needing a very expensive fsync(). If it isn't, then your ke=
rnel / filesystem is broken.
Please do _not_ add this code change. If there's a real bug in Nagios, this=
doesn't fix it, just hides it. And if the bug is in the OS, working around=
it isn't the right answer (unless you want to add checks for brokenness to=
autoconf).
Cary, can you please provide details of the system on which you are experie=
ncing the problem?
-----Original Message-----
From: Ethan Galstad [mailto:[email protected]]=20
Sent: Friday, July 31, 2009 8:07 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=
or objects in status.dat
Cary Petterborg wrote:
> Our status.dat file is about 37MB. We occasionally will find that
> valid services are not showing up from a status.cgi or extinfo.cgi
> page. This results in people getting confused or they know the
> problem and refresh the page to get the REAL data they need. Since
> the status.dat file is written to a temp file which is moved into
> place once the file is closed, it should not have partial contents.
> But, in our case at least, we were seeing results from the CGI's as
> if the file were only partially written. The problem with the current
> implementation is that it is possible that the file gets closed, but
> the contents are not completely flushed to disk when it is moved into
> replace the old file. In testing this phenomenon I took a service
> from the end of the status.dat file and looked at a CGI page as
> quickly as I could for many iterations. I found that about every 30th
> time (my average) the page acted as if the service didn't exist.
>=20
> That seems to be quite a high number of instances for the page to
> fail, so I added an fflush() before the fclose() and an fsync() right
> after the fclose(). This virtually guarantees that the file is
> completely written before the temp file is moved in to replace the
> outdated file. After making the change I was never able to get a
> failed page in more than 200 iterations of viewing the same page.
>=20
> The other files that could be a problem (and for completeness sake)
> are retention.dat, comments.dat and downtime.dat. So I applied the
> same principle change to each of these.
>=20
> I'm attaching a patch file that was done against our 2.7 version. I
> looked in the 3.0 code and it was not substantially different. The
> line numbers are different, though the context is the same, but the
> patch doesn't work on 3.0. I'm quite sure that a similar fix will
> work properly for 3.0.
>=20
> If anyone else is having this problem, you might want to try this
> patch and see if it fixes your problems as well. It is probably a
> good candidate for a bug fix if it is found to be a valuable
> modification. I don't know if smaller installations of Nagios are
> having any issues like this or not, but I suspect it is possible
> since actually flushing to the disk is handled by the OS on it's own
> timetable unless forced with fsync().
>=20
> If you try this modification, please let me know of any issues you
> have.
>=20
> Cary Petterborg
Good patch - I'll get this applied to Nagios 3.x HEAD.
- Ethan Galstad
---------------------------------------------------------------------------=
---
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day=
=20
trial. Simplify your report design, integration and deployment - and focus =
on=20
what you do best, core application coding. Discover what's new with=20
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/lis ... gios-devel
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: than Galstad [mailto:[email protected]]=2
All pending stdio output should be flushed by fclose(). If it isn't, your s=
tdio is broken.
All pending disk writes will read back as if committed when read on the sam=
e host, without needing a very expensive fsync(). If it isn't, then your ke=
rnel / filesystem is broken.
Please do _not_ add this code change. If there's a real bug in Nagios, this=
doesn't fix it, just hides it. And if the bug is in the OS, working around=
it isn't the right answer (unless you want to add checks for brokenness to=
autoconf).
Cary, can you please provide details of the system on which you are experie=
ncing the problem?
-----Original Message-----
From: Ethan Galstad [mailto:[email protected]]=20
Sent: Friday, July 31, 2009 8:07 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=
or objects in status.dat
Cary Petterborg wrote:
> Our status.dat file is about 37MB. We occasionally will find that
> valid services are not showing up from a status.cgi or extinfo.cgi
> page. This results in people getting confused or they know the
> problem and refresh the page to get the REAL data they need. Since
> the status.dat file is written to a temp file which is moved into
> place once the file is closed, it should not have partial contents.
> But, in our case at least, we were seeing results from the CGI's as
> if the file were only partially written. The problem with the current
> implementation is that it is possible that the file gets closed, but
> the contents are not completely flushed to disk when it is moved into
> replace the old file. In testing this phenomenon I took a service
> from the end of the status.dat file and looked at a CGI page as
> quickly as I could for many iterations. I found that about every 30th
> time (my average) the page acted as if the service didn't exist.
>=20
> That seems to be quite a high number of instances for the page to
> fail, so I added an fflush() before the fclose() and an fsync() right
> after the fclose(). This virtually guarantees that the file is
> completely written before the temp file is moved in to replace the
> outdated file. After making the change I was never able to get a
> failed page in more than 200 iterations of viewing the same page.
>=20
> The other files that could be a problem (and for completeness sake)
> are retention.dat, comments.dat and downtime.dat. So I applied the
> same principle change to each of these.
>=20
> I'm attaching a patch file that was done against our 2.7 version. I
> looked in the 3.0 code and it was not substantially different. The
> line numbers are different, though the context is the same, but the
> patch doesn't work on 3.0. I'm quite sure that a similar fix will
> work properly for 3.0.
>=20
> If anyone else is having this problem, you might want to try this
> patch and see if it fixes your problems as well. It is probably a
> good candidate for a bug fix if it is found to be a valuable
> modification. I don't know if smaller installations of Nagios are
> having any issues like this or not, but I suspect it is possible
> since actually flushing to the disk is handled by the OS on it's own
> timetable unless forced with fsync().
>=20
> If you try this modification, please let me know of any issues you
> have.
>=20
> Cary Petterborg
Good patch - I'll get this applied to Nagios 3.x HEAD.
- Ethan Galstad
---------------------------------------------------------------------------=
---
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day=
=20
trial. Simplify your report design, integration and deployment - and focus =
on=20
what you do best, core application coding. Discover what's new with=20
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/lis ... gios-devel
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: than Galstad [mailto:[email protected]]=2