Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=

Post by Guest »

In response to your request for details of our system: We are running SuSE =
9 writing to a Rieser FS (with a separate web server reading the status.dat=
, etc. from an NFS mount off the main Nagios server). Our status.dat file i=
s 37MB, and objects.cache is 32MB. If you need more details than this, plea=
se let me know what you need.

In my test (and they are *not* extensive), the web server would occasionall=
y get incomplete results from the status.dat file. The likelyhood of this i=
ncreased if the service information was at the end of the status.dat file, =
though not exclusive to that positioning. After the implementation of this =
change, 300 reloads of the extinfo.cgi page for the entry of the service fo=
und at the end of the status.dat file completed everytime. That seemed to b=
e enough anecdotal information to post it as a possible patch. I'm glad to =
see that it was looked at and is being scrutinized.

I may be wrong in this next information, but I did homework on it before pr=
oceeding to try to implment the fix on our system, and I'm taking the infor=
mation from what I found. The fsync() call is the more important function c=
all in the fix. fclose() almost always guarantees fflush(), but it doesn't =
guarantee that it will be written to the disk immediately, especially if th=
e program doesn't exit. fflush() asks the OS to flush the output to the dis=
k, but it will do it at the OS level, meaning it may wait momentarily to do=
so. fsync() does incur a very slight perfomance hit, but it is not like sy=
nc() (which a user program should not call). fsync() has much less an impac=
t than sync(). Since *we* are reading the file across NFS, that may be the =
reason we are seeing the absense of file data. Since the data is written to=
a temporary file, then renamed to replace the previous version, there isn'=
t much chance for the complete file not to be available.=20

Can you provide another explanation of why the status.cgi and extinfo.cgi p=
rograms are failing to find the data for a host or service one second, but =
succeeding a few seconds later if not that the status.dat file, etc. do not=
contain the information? We would seriously like to fix this problem.

Thanks!

Cary

________________________________________
From: Gaspar, Carson [[email protected]]
Sent: Thursday, August 06, 2009 1:18 PM
To: 'Nagios Developers List'
Subject: Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=
or objects in status.dat

Really? This makes no sense at all.

All pending stdio output should be flushed by fclose(). If it isn't, your s=
tdio is broken.

All pending disk writes will read back as if committed when read on the sam=
e host, without needing a very expensive fsync(). If it isn't, then your ke=
rnel / filesystem is broken.

Please do _not_ add this code change. If there's a real bug in Nagios, this=
doesn't fix it, just hides it. And if the bug is in the OS, working around=
it isn't the right answer (unless you want to add checks for brokenness to=
autoconf).

Cary, can you please provide details of the system on which you are experie=
ncing the problem?

-----Original Message-----
From: Ethan Galstad [mailto:[email protected]]
Sent: Friday, July 31, 2009 8:07 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Possible patch to cure CGI's not finding data f=
or objects in status.dat

Cary Petterborg wrote:
> Our status.dat file is about 37MB. We occasionally will find that
> valid services are not showing up from a status.cgi or extinfo.cgi
> page. This results in people getting confused or they know the
> problem and refresh the page to get the REAL data they need. Since
> the status.dat file is written to a temp file which is moved into
> place once the file is closed, it should not have partial contents.
> But, in our case at least, we were seeing results from the CGI's as
> if the file were only partially written. The problem with the current
> implementation is that it is possible that the file gets closed, but
> the contents are not completely flushed to disk when it is moved i

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: than Galstad [mailto:[email protected]
Locked