We have some processes which we'd like an alert for if they crash. Typically we'd just monitor the service as normal with NagiosXI, however this process restarts itself after it crashes and logs the crash in /var/log/messages.
Are there any best practices for monitoring when this process has crashed? You can't simply monitor the service since there's a good chance if it crashes and restarts, it will happen between NRPE checks. We can parse log files and check for the crash, but then when do you take it off the board as all clear?
Just wondering if anyone else is monitoring for these sorts of crash/restart events, and how you handle them.
Best practice for monitoring process crash/restarts?
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Best practice for monitoring process crash/restarts?
This is going to come down to preference, when do you want it marked all clear?
What I mean by that is, is it all clear if the process restarts correctly?
What I mean by that is, is it all clear if the process restarts correctly?
Re: Best practice for monitoring process crash/restarts?
If it restarted cleanly and has been running that way for, say, 10 minutes without another restart I would consider that all clear. The most important thing is the email alert to the team so we know to pull the logs for the crash.scottwilkerson wrote:This is going to come down to preference, when do you want it marked all clear?
What I mean by that is, is it all clear if the process restarts correctly?
I am tempted to use logwatch for this, but I don't know if that can run every minute without adding to system load on production systems.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Best practice for monitoring process crash/restarts?
I found some 3rd party plugins that look like they could do the trick.
The first is at http://www.unixautomation.com/unix-log- ... alysis.htm and although it looks like the developer wants $9.95 for it, it does exactly what you want, you can set the amount of time and pattern. I haven't used this, just read the documentation.
The second option is open source, more comprehensive but I believe you may be able to use it to fulfill your needs
http://labs.consol.de/lang/en/nagios/check_logfiles/
The first is at http://www.unixautomation.com/unix-log- ... alysis.htm and although it looks like the developer wants $9.95 for it, it does exactly what you want, you can set the amount of time and pattern. I haven't used this, just read the documentation.
The second option is open source, more comprehensive but I believe you may be able to use it to fulfill your needs
http://labs.consol.de/lang/en/nagios/check_logfiles/