Re: [Nagios-devel] Bug report: nagios shutdown removing lock file
Posted: Tue Jun 20, 2006 6:56 am
On 19 Jun 2006, at 21:46, Ethan Galstad wrote:
> Ton Voon wrote:
>> Ethan,
>>
>> I think I've seen a problem with the nagios shutdown routine. If
>> nagios is doing a host check and a INT signal is sent, it seems to
>> take a long time before the nagios daemon dies. It looks like the
>> child nagios process is trying to complete all the retries for a host
>> check before going back into the main loop.
>>
>> Also, it appears that the lockfile is being removed before the main
>> process dies. Below is the output for a 'while true; do ps -p 728; ls
>> -l /usr/local/nagios/var/nagios.lock; sleep 1; done' during a kill
>> 728.
>>
>> [snipped]
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> -rw-r--r-- 1 nagios nagios 4 Jun 13 17:20 /usr/local/nagios/var/
>> nagios.lock
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> -rw-r--r-- 1 nagios nagios 4 Jun 13 17:20 /usr/local/nagios/var/
>> nagios.lock
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> ls: /usr/local/nagios/var/nagios.lock: No such file or directory
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> ls: /usr/local/nagios/var/nagios.lock: No such file or directory
>>
>> This shows the lockfile gets removed before the main daemon dies.
>> (This is from a kill 728, not using any init scripts.) Eventually the
>> daemon dies.
>>
>> I've tested this on Nagios 2.2 on MacOSX 10.4, Nagios 2.0 on Debian
>> and Nagios 2.4 on Debian.
>>
>> Sorry, not had time to delve into the source code.
>
> Yep, this is a bug. Its been present for several years now, so I
> suppose we could get around to fixing it.
> removal causing noticeable problems with anything?
I think the lockfile removal is the source of the "multiple Nagios
processes running". The example daemon-init script uses the lockfile
as the status of the process. If you were to do a restart, Nagios
would complete the stop because the signal was sent, but Nagios would
actually be in the process of shutting down. Meanwhile a start would
run, so another Nagios process is kicked off. Then, as both Nagios
processes are trying to access the same files, mayhem can ensue
We've got our own startup script and we've change the stop routine to
wait until nagios has actually stopped before moving out of the stop
function. Much more stable, but there's a long delay if Nagios is in
the middle of a host check.
> The file gets
> deleted immediately upon receiving a SIGHUP/etc. to prevent it from
> staying around if Nagios has problems shutting down.
I see why, but I think it is probably better to leave the lock file
around if there was a problem shutting down, and handle the existence
of the lock file on startup.
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]