View details during the period a service was above threshold

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
vrn_shan
Posts: 4
Joined: Tue Apr 10, 2012 10:49 am

View details during the period a service was above threshold

Post by vrn_shan »

Hi,

I am trying out Nagios Core 3.2.3. and I have a question.
I read somewhere that:

Instead of monitoring values, Nagios only uses four states to describe status: OK, WARNING, CRITICAL,
and UNKNOWN.

Now this looks like a problem to me.

Say for a host, I have configured a service to check CPU load average and send me a mail if the load average is more than 2 for three minutes. I left office at 9:00 PM. In the morning I saw a mail alert that regarding high load average. I went to Nagios web interface and found that load avg was above threshold all the way between 10:00 PM to 04:00 AM. Now I would be interested in knowing how the load average behaved the whole night. I would like to see the system's load average values at intervals of say every 5 minutes during this entire period. It would be great If get to know that
- load avg was around 15 from 10:00 PM to 11:00 PM
- then load avg was around 20 from 11:00PM to 1:00 AM
- then load avg was around 50 from 1:00PM to 1:30 PM
- then load avg dropped and it was around 10 from 1:30 AM to 2:00 AM
- load avg was around 5 from 2:00AM to 4:00 AM.

Having this information will help a lot in investigation. But I don't see how I can have this information in Nagios.
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: View details during the period a service was above thres

Post by jsmurphy »

There is an option you can turn on called "state stalking" which will have Nagios track changes in the state information which will do what you want. More information: http://nagios.sourceforge.net/docs/3_0/stalking.html
vrn_shan
Posts: 4
Joined: Tue Apr 10, 2012 10:49 am

Re: View details during the period a service was above thres

Post by vrn_shan »

Hi,

Thanks for your quick reply.

This makes me a bit curious. The page you just mentioned says:
State "stalking" is a feature which is probably not going to used by most users.
and
As a general rule, I would suggest that you not enable stalking for hosts and services without thinking things through.

But I feel like this is a must have thing for most services I will be monitoring. For example in the situation I described above, just knowing that Load avg was above threshold whole night doesn't tell me how severe the proble was. I would have questions in my mind like
* Was the Load avergage above 50 whole night? Was the Load avergage above 100 whole night?
* How is CPU load related to other processes starting at night? Say I have a process starting every night at 11:00 PM. Does starting that process considerbaly worsens the situation.

and many more questions...

Am I asking for something special which others don't require and hence I need to enable a special feature called "Stalking"? What you call 'Stalking', shouldn't it be available by default for each service? Isn't it something everyone would require while debugging? Let me know if I am thinking on wrong lines? I would love to know how others monitor services.

Secondly, I would like to view a graph showing the Load average values every 5 minutes during this period. I know that many open source plugin are available for drawing graphs. Now my question is, since the "Service Check Output:" loggged during stalking is in the form of a sentence, is it possible to draw this graph (Even using some plugin)?
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: View details during the period a service was above thres

Post by jsmurphy »

I was actually kind of hoping you would ask those questions :)

It's very open ended, I think you will find that most of us out there we only care that a problem started and a problem stopped... we can then use application logs, syslogs or OS logs to determine the root cause of the failure. I am speaking very generally here, when it comes to things like Memory and CPU load you may want to use state stalking if you have no other platform currently trending that data or you are not doing so with Nagios graping (such as your hypervisor if it's virtualised or even some vendors have hardware management platforms that allow you to do it).

There are plenty of different graphing solutions out there for Nagios and they rely on the performance data of the check not the actual status information of the check, so state stalking or not should have no impact on the graphing. I'm not really the go to guy for advice on how to set up the graphing side of things but this guy does some pretty good tutorials so I would check this out: http://xavier.dusart.free.fr/nagios/en/nagiosgraph.html

It's important to note that Nagios perceives itself as an alerting tool first and foremost not a forensics tool for analyzing the root cause of your problems which is why it says state stalking probably won't be used by most. I personally only use the stalking feature for SNMP traps and Event Log traps where missing that data could mean missing a piece of the puzzle.
Locked