Page 1 of 1

Best practice for monitoring process crash/restarts?

Posted: Wed Mar 07, 2012 4:46 pm
by cscholz
We have some processes which we'd like an alert for if they crash. Typically we'd just monitor the service as normal with NagiosXI, however this process restarts itself after it crashes and logs the crash in /var/log/messages.

Are there any best practices for monitoring when this process has crashed? You can't simply monitor the service since there's a good chance if it crashes and restarts, it will happen between NRPE checks. We can parse log files and check for the crash, but then when do you take it off the board as all clear?

Just wondering if anyone else is monitoring for these sorts of crash/restart events, and how you handle them.

Re: Best practice for monitoring process crash/restarts?

Posted: Thu Mar 08, 2012 1:26 pm
by scottwilkerson
This is going to come down to preference, when do you want it marked all clear?

What I mean by that is, is it all clear if the process restarts correctly?

Re: Best practice for monitoring process crash/restarts?

Posted: Mon Mar 12, 2012 8:54 am
by cscholz
scottwilkerson wrote:This is going to come down to preference, when do you want it marked all clear?

What I mean by that is, is it all clear if the process restarts correctly?
If it restarted cleanly and has been running that way for, say, 10 minutes without another restart I would consider that all clear. The most important thing is the email alert to the team so we know to pull the logs for the crash.

I am tempted to use logwatch for this, but I don't know if that can run every minute without adding to system load on production systems.

Re: Best practice for monitoring process crash/restarts?

Posted: Mon Mar 12, 2012 12:02 pm
by scottwilkerson
I found some 3rd party plugins that look like they could do the trick.

The first is at http://www.unixautomation.com/unix-log- ... alysis.htm and although it looks like the developer wants $9.95 for it, it does exactly what you want, you can set the amount of time and pattern. I haven't used this, just read the documentation.

The second option is open source, more comprehensive but I believe you may be able to use it to fulfill your needs
http://labs.consol.de/lang/en/nagios/check_logfiles/