Re: [Nagios-devel] Bug report: nagios shutdown removing lock file

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Bug report: nagios shutdown removing lock file

Post by Guest »

Ton Voon wrote:
> On 19 Jun 2006, at 21:46, Ethan Galstad wrote:
>> Ton Voon wrote:
>>> Ethan,
>>>
[snip]
>
> I think the lockfile removal is the source of the "multiple Nagios
> processes running". The example daemon-init script uses the lockfile
> as the status of the process. If you were to do a restart, Nagios
> would complete the stop because the signal was sent, but Nagios would
> actually be in the process of shutting down. Meanwhile a start would
> run, so another Nagios process is kicked off. Then, as both Nagios
> processes are trying to access the same files, mayhem can ensue :)
>
> We've got our own startup script and we've change the stop routine to
> wait until nagios has actually stopped before moving out of the stop
> function. Much more stable, but there's a long delay if Nagios is in
> the middle of a host check.
>
>> The file gets
>> deleted immediately upon receiving a SIGHUP/etc. to prevent it from
>> staying around if Nagios has problems shutting down.
>
> I see why, but I think it is probably better to leave the lock file
> around if there was a problem shutting down, and handle the existence
> of the lock file on startup.
>
> Ton

From looking at the code, it looks like I intended to clean this up at
some point, but never did. main() in nagios.c deletes the lock file as
one of the last things it does before exiting, but the file was still
being prematurely removed in sighandler() in utils.c. I just
uncommented the calls in sighandler(), so this should be fixed.

Also, I did add some checks in base/checks.c to bail out of the host
check logic at reasonable points if a SIGHUP/SIGINT is encountered. A
stop/restart may still not be immediate, because the signal doesn't kill
a single host check command from executing, but it should prevent Nagios
from re-checking a host (or propagating checks to parents/children) when
a signal is encountered.

I'll be posting the patches to CVS shortly, so if anyone has a chance to
test this, please let me know how it works. Thanks again Ton for the
heads up on this!


Ethan Galstad,
Nagios Developer
---
Email: [email protected]
Website: http://www.nagios.org





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked