Outage / Availability Report
Re: Outage / Availability Report
Fair enough.
I'm not finding an easy to find the relation though between the OK and the first non-OK state. I might be able to use state_time_usec, but I cannot correlate that value to the state_time column. Can you elaborate on that column?
I'm not finding an easy to find the relation though between the OK and the first non-OK state. I might be able to use state_time_usec, but I cannot correlate that value to the state_time column. Can you elaborate on that column?
Re: Outage / Availability Report
You have to compute the time between state changes yourself. The description of the fields are:
http://nagios.sourceforge.net/docs/ndou ... _Model.pdf
See the following doc for more information:state_time: The date/time that the state changed.
state_time_usec: The microsecond portion of the time the state change applies to.
http://nagios.sourceforge.net/docs/ndou ... _Model.pdf
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Outage / Availability Report
I have some of the query written, but MySQL has very limited analytic functions so it's a mess.
Here's what I've written so far. I find the following:
1. Start Time - The first time after an OK state that an object is non-OK.
2. End Time - The time at which an object goes OK.
I then rank those times per object and attempt to line the Start Time and End Time up. It works sometimes, but not always.
Issue A:
If the report is ran when an object is non-OK, then I have more start times than end times, so I need to shift the ranking and place the current date/time at the top rank.
Issue B:
Given the information that is logged in the table, I don't think this is going to be a viable query to be run against the transactional database.
For example, say HostA has ServiceA configured. If the following occurs, it becomes an unwieldy query to run:
1. 5/1/2014 - Service goes down and is logged in Nagios
2. 5/1/2014 (later that day) - an admin disables ServiceA in Nagios as the server needs reconfigured
3. 5/2/2014 - Admin corrects the server configuration and enables ServiceA in Nagios
4. 5/2/2014 - Service goes in OK state
ServiceA will appear as though it's down for over one day.
So, it becomes a task of gathering the time periods for every object in the database for when it was enabled. Then use that to find availability.
Here's what I've written so far. I find the following:
1. Start Time - The first time after an OK state that an object is non-OK.
2. End Time - The time at which an object goes OK.
I then rank those times per object and attempt to line the Start Time and End Time up. It works sometimes, but not always.
Issue A:
If the report is ran when an object is non-OK, then I have more start times than end times, so I need to shift the ranking and place the current date/time at the top rank.
Issue B:
Given the information that is logged in the table, I don't think this is going to be a viable query to be run against the transactional database.
For example, say HostA has ServiceA configured. If the following occurs, it becomes an unwieldy query to run:
1. 5/1/2014 - Service goes down and is logged in Nagios
2. 5/1/2014 (later that day) - an admin disables ServiceA in Nagios as the server needs reconfigured
3. 5/2/2014 - Admin corrects the server configuration and enables ServiceA in Nagios
4. 5/2/2014 - Service goes in OK state
ServiceA will appear as though it's down for over one day.
So, it becomes a task of gathering the time periods for every object in the database for when it was enabled. Then use that to find availability.
Re: Outage / Availability Report
I understand your pain. Computing proper availability is not easy. You could do what core does and just parse the logs, but I doubt that will be any easier. You may have to overlay downtime starts and ends to check for when a "down" service should actually be considered down.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Outage / Availability Report
Unfortunately, I think the database needs a redesign. I have found no way to correlate acknowledgements to a state either. There appear to be no foreign keys on the tables either. 
Re: Outage / Availability Report
One of the major issues with comments is that they are not persistent in the db. Acks are stored in the nagios_acknowledgements table, but you must do an inner join on the nagios_objects table to identify the relevant host/service.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Outage / Availability Report
Right I can do that, but I cannot associate an acknowledgement with a particular entry in the nagios_statehistory table.abrist wrote:One of the major issues with comments is that they are not persistent in the db. Acks are stored in the nagios_acknowledgements table, but you must do an inner join on the nagios_objects table to identify the relevant host/service.
At the end of the day, the information isn't logged correctly in the tables to get what I need:
Code: Select all
1. No correlation between an acknowledgement and state history
2. No log of when an object is disabled for checks
3. No relation of the state histories for an object
Re: Outage / Availability Report
I don't have all the answers here, but here is what I can comment on:
1) You could assume that the reported state of an object at the time of an ack on said object applied to that object's state at that time. This is not perfect, but it should allow you to derive which acks went with which states.
2) This is true, though you could either assume a previous state or ignore time ranges without states in nagios_statehistory.
3) Relation to what?
My sql-fu is failing me today, but you could run a combination of the following queries to get close. The first will output the ack table with a join on object_id (where name1=<hostname>):
This should output a table similar to:
And then run a second query on the entry_time for each ack to output the nearest (in the past) state change from nagios_statehistory(where name1=<hostname>, name2=<service description>, and state_time=<the ack entry_time> - I used the last entry from the query results above):
This query should output something similar to:
Notice how the state is the same.
The iteration through the ack's entry_time(s) may require a stored procedure, a script, or a query written by someone with more sql-fu than myself.
The queries above may need to be altered a bit for services, but the principle is the same.
1) You could assume that the reported state of an object at the time of an ack on said object applied to that object's state at that time. This is not perfect, but it should allow you to derive which acks went with which states.
2) This is true, though you could either assume a previous state or ignore time ranges without states in nagios_statehistory.
3) Relation to what?
My sql-fu is failing me today, but you could run a combination of the following queries to get close. The first will output the ack table with a join on object_id (where name1=<hostname>):
Code: Select all
echo "select entry_time,state,author_name,comment_data,name1,name2 from nagios.nagios_acknowledgements inner join nagios.nagios_objects on nagios.nagios_acknowledgements.object_id=nagios.nagios_objects.object_id where name1='gent';" | mysql -t -pnagiosxiCode: Select all
+---------------------+-------+----------------------+------------------------------+-------+------------------+
| entry_time | state | author_name | comment_data | name1 | name2 |
+---------------------+-------+----------------------+------------------------------+-------+------------------+
| 2012-11-26 05:07:21 | 1 | Nagios Administrator | Problem is acknowledged | gent | Memory Usage |
| 2012-11-26 05:07:21 | 1 | Nagios Administrator | Problem is acknowledged | gent | / Disk Usage |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | CPU Stats |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | Swap Usage |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | Memory Usage |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | Users |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | / Disk Usage |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | Total Processes |
| 2013-02-18 13:48:04 | 2 | Andyb | Problem is acknowledged | gent | Load |
| 2013-05-10 14:33:50 | 1 | Nagios Administrator | Problem is acknowledged | gent | NULL |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | Ping |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | Total Processes |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | CPU Stats |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | Swap Usage |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | Memory Usage |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | Users |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | / Disk Usage |
| 2013-05-10 14:33:50 | 2 | Nagios Administrator | Problem is acknowledged | gent | Load |
| 2013-05-10 14:33:50 | 3 | Nagios Administrator | Problem is acknowledged | gent | super ping check |
| 2013-05-10 14:34:17 | 2 | Nagios Administrator | Problem is acknowledged | gent | Swap Usage |
| 2013-06-21 11:58:33 | 3 | andy | Some Acknowledgement Comment | gent | Load |
| 2013-06-21 11:59:17 | 3 | andy | Some Acknowledgement Comment | gent | Load |
+---------------------+-------+----------------------+------------------------------+-------+------------------+Code: Select all
echo "select name1,state_time,output,state from nagios.nagios_statehistory inner join nagios.nagios_objects on nagios.nagios_statehistory.object_id=nagios.nagios_objects.object_id where name1='gent' and name2='Load' and state_time <= '2013-06-21 11:59:17' order by state_time desc LIMIT 1;" | mysql -t -pnagiosxiCode: Select all
+-------+-------+---------------------+---------------------------------------------------------------------------------------------+-------+
| name1 | name2 | state_time | output | state |
+-------+-------+---------------------+---------------------------------------------------------------------------------------------+-------+
| gent | Load | 2013-06-04 23:18:17 | CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. | 3 |
+-------+-------+---------------------+---------------------------------------------------------------------------------------------+-------+
The iteration through the ack's entry_time(s) may require a stored procedure, a script, or a query written by someone with more sql-fu than myself.
The queries above may need to be altered a bit for services, but the principle is the same.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Outage / Availability Report
That's the real issue here. The correlation of the tables, with the exception of object_id's, is blurry at best. Joins must be exact.abrist wrote:I don't have all the answers here, but here is what I can comment on:
1) You could assume that the reported state of an object at the time of an ack on said object applied to that object's state at that time. This is not perfect, but it should allow you to derive which acks went with which states.
That's not making sense to me. I have a host that went into a problem state, then was disabled. When re-enabled, it went to a problem state again then went OK. I have no way of knowing that it was disabled by examining the database.abrist wrote:2) This is true, though you could either assume a previous state or ignore time ranges without states in nagios_statehistory.
State histories to other state histories (like a child and parent). See the post at the beginning of this thread. It explains it fairly good detail. In fact, none of the tables have any foreign keys that I can find.abrist wrote:3) Relation to what?
Unfortunately, relating object_id's is about the only thing I can find that is consistent in the tables.abrist wrote:My sql-fu is failing me today, but you could run a combination of the following queries to get close. The first will output the ack table with a join on object_id (where name1=<hostname>):Code: Select all
echo "select entry_time,state,author_name,comment_data,name1,name2 from nagios.nagios_acknowledgements inner join nagios.nagios_objects on nagios.nagios_acknowledgements.object_id=nagios.nagios_objects.object_id where name1='gent';" | mysql -t -pnagiosxi
Issue here is if I have more than one acknowledgement for an object, it will return all objects. Again, this is blurry for the joins and making a lot of assumptions. Neither of those are good for joins.abrist wrote:And then run a second query on the entry_time for each ack to output the nearest (in the past) state change from nagios_statehistory(where name1=<hostname>, name2=<service description>, and state_time=<the ack entry_time> - I used the last entry from the query results above):Code: Select all
echo "select name1,state_time,output,state from nagios.nagios_statehistory inner join nagios.nagios_objects on nagios.nagios_statehistory.object_id=nagios.nagios_objects.object_id where name1='gent' and name2='Load' and state_time <= '2013-06-21 11:59:17' order by state_time desc LIMIT 1;" | mysql -t -pnagiosxi
Honestly, there's just too much fuzzy logic in the queries. I'm able to get my query to work for some, but not all services, because the data just isn't there.abrist wrote: Notice how the state is the same.
The iteration through the ack's entry_time(s) may require a stored procedure, a script, or a query written by someone with more sql-fu than myself.
The queries above may need to be altered a bit for services, but the principle is the same.