Outage / Availability Report

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
belvdr
Posts: 81
Joined: Tue Oct 08, 2013 9:17 pm

Re: Outage / Availability Report

Post by belvdr »

Fair enough.

I'm not finding an easy to find the relation though between the OK and the first non-OK state. I might be able to use state_time_usec, but I cannot correlate that value to the state_time column. Can you elaborate on that column?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Outage / Availability Report

Post by abrist »

You have to compute the time between state changes yourself. The description of the fields are:
state_time: The date/time that the state changed.
state_time_usec: The microsecond portion of the time the state change applies to.
See the following doc for more information:
http://nagios.sourceforge.net/docs/ndou ... _Model.pdf
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
belvdr
Posts: 81
Joined: Tue Oct 08, 2013 9:17 pm

Re: Outage / Availability Report

Post by belvdr »

I have some of the query written, but MySQL has very limited analytic functions so it's a mess.

Here's what I've written so far. I find the following:

1. Start Time - The first time after an OK state that an object is non-OK.
2. End Time - The time at which an object goes OK.

I then rank those times per object and attempt to line the Start Time and End Time up. It works sometimes, but not always.

Issue A:

If the report is ran when an object is non-OK, then I have more start times than end times, so I need to shift the ranking and place the current date/time at the top rank.

Issue B:
Given the information that is logged in the table, I don't think this is going to be a viable query to be run against the transactional database.

For example, say HostA has ServiceA configured. If the following occurs, it becomes an unwieldy query to run:

1. 5/1/2014 - Service goes down and is logged in Nagios
2. 5/1/2014 (later that day) - an admin disables ServiceA in Nagios as the server needs reconfigured
3. 5/2/2014 - Admin corrects the server configuration and enables ServiceA in Nagios
4. 5/2/2014 - Service goes in OK state

ServiceA will appear as though it's down for over one day.

So, it becomes a task of gathering the time periods for every object in the database for when it was enabled. Then use that to find availability.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Outage / Availability Report

Post by abrist »

I understand your pain. Computing proper availability is not easy. You could do what core does and just parse the logs, but I doubt that will be any easier. You may have to overlay downtime starts and ends to check for when a "down" service should actually be considered down.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
belvdr
Posts: 81
Joined: Tue Oct 08, 2013 9:17 pm

Re: Outage / Availability Report

Post by belvdr »

Unfortunately, I think the database needs a redesign. I have found no way to correlate acknowledgements to a state either. There appear to be no foreign keys on the tables either. :o
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Outage / Availability Report

Post by abrist »

One of the major issues with comments is that they are not persistent in the db. Acks are stored in the nagios_acknowledgements table, but you must do an inner join on the nagios_objects table to identify the relevant host/service.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
belvdr
Posts: 81
Joined: Tue Oct 08, 2013 9:17 pm

Re: Outage / Availability Report

Post by belvdr »

abrist wrote:One of the major issues with comments is that they are not persistent in the db. Acks are stored in the nagios_acknowledgements table, but you must do an inner join on the nagios_objects table to identify the relevant host/service.
Right I can do that, but I cannot associate an acknowledgement with a particular entry in the nagios_statehistory table.

At the end of the day, the information isn't logged correctly in the tables to get what I need:

Code: Select all

1.  No correlation between an acknowledgement and state history
2.  No log of when an object is disabled for checks
3.  No relation of the state histories for an object
I don't believe there's any way to get this type of report from the database. Also, foreign keys are non-existent, so data can be inserted and/or removed without a parent. I have no idea if that's occurring, but know it's technically possible.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Outage / Availability Report

Post by abrist »

I don't have all the answers here, but here is what I can comment on:
1) You could assume that the reported state of an object at the time of an ack on said object applied to that object's state at that time. This is not perfect, but it should allow you to derive which acks went with which states.
2) This is true, though you could either assume a previous state or ignore time ranges without states in nagios_statehistory.
3) Relation to what?

My sql-fu is failing me today, but you could run a combination of the following queries to get close. The first will output the ack table with a join on object_id (where name1=<hostname>):

Code: Select all

echo "select entry_time,state,author_name,comment_data,name1,name2 from nagios.nagios_acknowledgements inner join nagios.nagios_objects on nagios.nagios_acknowledgements.object_id=nagios.nagios_objects.object_id where name1='gent';" | mysql -t -pnagiosxi
This should output a table similar to:

Code: Select all

+---------------------+-------+----------------------+------------------------------+-------+------------------+
| entry_time          | state | author_name          | comment_data                 | name1 | name2            |
+---------------------+-------+----------------------+------------------------------+-------+------------------+
| 2012-11-26 05:07:21 |     1 | Nagios Administrator | Problem is acknowledged      | gent  | Memory Usage     |
| 2012-11-26 05:07:21 |     1 | Nagios Administrator | Problem is acknowledged      | gent  | / Disk Usage     |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | CPU Stats        |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | Swap Usage       |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | Memory Usage     |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | Users            |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | / Disk Usage     |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | Total Processes  |
| 2013-02-18 13:48:04 |     2 | Andyb                | Problem is acknowledged      | gent  | Load             |
| 2013-05-10 14:33:50 |     1 | Nagios Administrator | Problem is acknowledged      | gent  | NULL             |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Ping             |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Total Processes  |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | CPU Stats        |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Swap Usage       |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Memory Usage     |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Users            |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | / Disk Usage     |
| 2013-05-10 14:33:50 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Load             |
| 2013-05-10 14:33:50 |     3 | Nagios Administrator | Problem is acknowledged      | gent  | super ping check |
| 2013-05-10 14:34:17 |     2 | Nagios Administrator | Problem is acknowledged      | gent  | Swap Usage       |
| 2013-06-21 11:58:33 |     3 | andy                 | Some Acknowledgement Comment | gent  | Load             |
| 2013-06-21 11:59:17 |     3 | andy                 | Some Acknowledgement Comment | gent  | Load             |
+---------------------+-------+----------------------+------------------------------+-------+------------------+
And then run a second query on the entry_time for each ack to output the nearest (in the past) state change from nagios_statehistory(where name1=<hostname>, name2=<service description>, and state_time=<the ack entry_time> - I used the last entry from the query results above):

Code: Select all

echo "select name1,state_time,output,state from nagios.nagios_statehistory inner join nagios.nagios_objects on nagios.nagios_statehistory.object_id=nagios.nagios_objects.object_id where name1='gent' and name2='Load' and state_time <= '2013-06-21 11:59:17' order by state_time desc LIMIT 1;" | mysql -t -pnagiosxi
This query should output something similar to:

Code: Select all

+-------+-------+---------------------+---------------------------------------------------------------------------------------------+-------+
| name1 | name2 | state_time          | output                                                                                      | state |
+-------+-------+---------------------+---------------------------------------------------------------------------------------------+-------+
| gent  | Load  | 2013-06-04 23:18:17 | CHECK_NRPE: Received 0 bytes from daemon.  Check the remote server logs for error messages. |     3 |
+-------+-------+---------------------+---------------------------------------------------------------------------------------------+-------+

Notice how the state is the same.
The iteration through the ack's entry_time(s) may require a stored procedure, a script, or a query written by someone with more sql-fu than myself.
The queries above may need to be altered a bit for services, but the principle is the same.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
belvdr
Posts: 81
Joined: Tue Oct 08, 2013 9:17 pm

Re: Outage / Availability Report

Post by belvdr »

abrist wrote:I don't have all the answers here, but here is what I can comment on:
1) You could assume that the reported state of an object at the time of an ack on said object applied to that object's state at that time. This is not perfect, but it should allow you to derive which acks went with which states.
That's the real issue here. The correlation of the tables, with the exception of object_id's, is blurry at best. Joins must be exact.
abrist wrote:2) This is true, though you could either assume a previous state or ignore time ranges without states in nagios_statehistory.
That's not making sense to me. I have a host that went into a problem state, then was disabled. When re-enabled, it went to a problem state again then went OK. I have no way of knowing that it was disabled by examining the database.
abrist wrote:3) Relation to what?
State histories to other state histories (like a child and parent). See the post at the beginning of this thread. It explains it fairly good detail. In fact, none of the tables have any foreign keys that I can find.
abrist wrote:My sql-fu is failing me today, but you could run a combination of the following queries to get close. The first will output the ack table with a join on object_id (where name1=<hostname>):

Code: Select all

echo "select entry_time,state,author_name,comment_data,name1,name2 from nagios.nagios_acknowledgements inner join nagios.nagios_objects on nagios.nagios_acknowledgements.object_id=nagios.nagios_objects.object_id where name1='gent';" | mysql -t -pnagiosxi
Unfortunately, relating object_id's is about the only thing I can find that is consistent in the tables.
abrist wrote:And then run a second query on the entry_time for each ack to output the nearest (in the past) state change from nagios_statehistory(where name1=<hostname>, name2=<service description>, and state_time=<the ack entry_time> - I used the last entry from the query results above):

Code: Select all

echo "select name1,state_time,output,state from nagios.nagios_statehistory inner join nagios.nagios_objects on nagios.nagios_statehistory.object_id=nagios.nagios_objects.object_id where name1='gent' and name2='Load' and state_time <= '2013-06-21 11:59:17' order by state_time desc LIMIT 1;" | mysql -t -pnagiosxi
Issue here is if I have more than one acknowledgement for an object, it will return all objects. Again, this is blurry for the joins and making a lot of assumptions. Neither of those are good for joins.
abrist wrote: Notice how the state is the same.
The iteration through the ack's entry_time(s) may require a stored procedure, a script, or a query written by someone with more sql-fu than myself.
The queries above may need to be altered a bit for services, but the principle is the same.
Honestly, there's just too much fuzzy logic in the queries. I'm able to get my query to work for some, but not all services, because the data just isn't there. :?
Locked