Lost Persistent Comments\Acknowledged Service Issues

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
yavin
Posts: 2
Joined: Wed Dec 05, 2012 2:30 pm

Lost Persistent Comments\Acknowledged Service Issues

Post by yavin »

I have started to have a strange problem with our Nagios 3.3 installation. When restarting nagios (via a service stop/start or restart) I am occasionally losing all of my persistent acknowledgement history and comments. So this does not happen every time but I would guess in the last 15 restarts I have lost these about 3 times now. Not only do I lose the historical data of acknowledgments which is painful enough but then all of the known service issues (of which there are LOTS) all have to be re-acknowledged to clean the system so real issues can identified. I see comments being written to the nagios' status.dat file but am having a hard time finding documentation on how these are preserved across restarts so I can focus on the point of where the transaction is breaking down. Any suggestions at all would be appreciated, thanks!
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Lost Persistent Comments\Acknowledged Service Issues

Post by slansing »

Just to be clear this is happening when you:

Code: Select all

service nagios stop

service nagios start
and you are NOT shutting the whole server down? This can happen when you unexpectedly shut the entire system down.

When you do these nagios restarts what are you doing before which requires this?
yavin
Posts: 2
Joined: Wed Dec 05, 2012 2:30 pm

Re: Lost Persistent Comments\Acknowledged Service Issues

Post by yavin »

I have experienced the issue both while doing individual 'service nagios stop' and 'service nagios start' and doing a 'service nagios restart' (which appears to just automate the stop and start process). Our nagios configurations files are built from a remote cacti database and a restart has been historically used to push the changes from cacti into nagios. While investigating the start and stop scripts a little closer tonight I did find two pieces of mis-configured variables in our /etc/rc.d/init.d/nagios scripts (btw I have recently inherited responsibility of this system so I am unsure how or when this came to be):

NagiosStatusFile=/var/log/nagios/status.dat
NagiosCommandFile=/var/log/nagios/rw/nagios.cmd

These two entries were pointing to invalid targets and I just corrected them to match the locations listed in the nagios.cfg file. All the other entries appear to be correct. Reading through the nagios documentation and start up scripts it appears both of these files should be deleted each time nagios is stopped. So I guess my question is that if they were not being deleted what is the effect of this on a subsequent start? Could this intermittently confuse nagios when it came back into service if these files still existed? I can see if this resolves the problem but I would rather avoid as much trial and error as possible with the re-acknowledgement process being so time consuming. :-)
Locked