Page 1 of 1

Translating state history to more meaningful report

Posted: Fri Jun 20, 2014 3:00 pm
by ajorgens
I'm a Nagios and Linux novice: I have next to zero experience with bash scripts, etc. My boss has tasked me with generating reports going back three months of ISP activity in our various facilities. We monitor the PFSense at each facility with Nagios and get reports on each PFSense' ISP1 and ISP2. The "State History" is the only report I've found that displays the relevant data but it isn't in a format that's readable by the executives we're reporting the information to. What I get from the report is like so:

Code: Select all

|2014-06-19 19:55:07| |100-Facility-PRSense| |ISP2_Interface_Ping| |OK| |HARD| etc.
|2014-06-19 19:50:17| |100-Facility-PFSense| |ISP2_Interface_Ping| |CRITICAL| |HARD
and so on

100-Facility-PFSense is the host and ISP2_Interface_Ping is the service (each of the PFSense routers will have an ISP1 and ISP2 interface ping).

I need to translate that data into:

Code: Select all

|     Date    |  |Facility|  |ISP| |Time Down| |Time Up| |Total| |  Impact  |
|6/19/2014|   |    100    |  | 2 | | 7:50PM  | |7:55PM | |5min | |No impact |
So what this says is that on the 19th, the secondary ISP for facility 100 went down for 5 minutes from 7:50-7:55pm and had no impact because the primary ISP didn't go down.
Each facility has an ISP1 and ISP2 with 1 being the primary and 2 being the secondary at much lower bandwidth. If ISP2 goes down, there is no impact because the primary is still up. If ISP1 goes down, the impact is "Slow" because the secondary ISP has much lower bandwidth. If both go down, the impact is "Down" because the facility has lost all internet connectivity.

Currently, I have to go through the state history and find each instance of "CRITICAL", note the date, time, facility, etc. and scroll up until I find an "OK" for that same facility and ISP. We have enough facilities with enough incidents that doing a weekly report takes two hours and I'm being asked to do a report that goes back three months.

What I'm asking is, how can I go about automating this report and exporting it to an excel sheet or CSV? I'm sure that scripts can be written to grab the data from Nagios directly and I'm sure command line or powershell or visual basic scripts can be used to parse the data from a CSV exported from NagiosXI into an Excel sheet. There may also be a way do it directly in Excel but my Excel expertise is even thinner than my linux. There may even be a nagios plugin for this functionality but I haven't been able to find one. I don't care which of these methods I end up using; I just need to figure out a way to speed this process up so I can get this reporting done in a reasonable time rather than devoting the next two weeks to it.

Re: Translating state history to more meaningful report

Posted: Mon Jun 23, 2014 3:48 pm
by sreinhardt
Well, I'm not going to write anything that would convert it for you, but I would be happy to provide an outline of what I might do to format the data in a way that you want.

First thing I would do, is separate the single log into individual logs per device or per location, whichever you choose.

Then I would take the log files one at a time, and parse them a single line at a time, using the | as separators much like a csv.
  • Capture date from the first section, and store it in a variable
    We should already know what facility this is, based on the first separation of the log files.
    Capture the Service name from the third section, or store the isp name.
    Capture the state it moved to, and if needed soft\hard status.
    If the captured state was becoming critical, you want to capture the date and time from the next line for that service as well, so that you know recovery times.
Now you can begin generating your output.
  • Start by making a single variable to store the entire line of output
    push the date of your event as captured above, into your variable.
    push the facility name, into the variable.
    push the isp captured from above, into the variable.
    push the warning or critical hard state as the time down.
    push the ok or recovery state as time up
    calculate and push the difference for total duration.
    determine your impact... Not sure how you would go about that.
print your line out, as it should contain all the information you need. Loop to the beginning of parsing your source log file and start on the next event.

I realize this isn't the easiest thing to do or follow, unfortunately there really aren't any better options that I know of. On that note, have you tried the core report features and verified they won't do what you need?