Availability Reporting not accurate

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
rrosenbaum
Posts: 3
Joined: Fri May 19, 2017 8:43 am

Availability Reporting not accurate

Post by rrosenbaum »

Hello,

In trying to create SLA reports, I've run across issues which I thought after seeing outside posts from 2010/2012 would have been resolved. It seems reports are just not accurate, and if even one isn't accurate, how can I/we trust any???????

I expect there is nothing crazy about my installation, it's running on CentOS 7, has been for years, and alerts etc have been running great. I've got 20 or so host groups, 125 or so hosts, sotrage, switches etc. I'm not a guru, but it all works (except for the reporting)

Today I upgraded from Core 4.1x to 4.3.2.

Although running this test for a single host for this discussion, the data also appears incorrect for hostgroups that contain this and other hosts. Same thing for some services. I picked a single host that I know was down for some short periods of time in order to verify it should not be %100. It was not "scheduled down", it was just down, which for purposes of this discussion is a good thing. In running a report for a single host, here are my findings.....

1. Availability for a single host, all defaults, Last 7 days. Report Time period None, yes, yes, yes, no, First Assumed Host State (Host Up),First Assumed Service State: (Service OK)
  • looks fine, except why would "Host Log Entries:" include items from 2015 through 2017? weird because it's not in the Time period specified by the report. (no, I did not click "View full log entries")
2. Same thing, except changed Time Period to "This Year"
  • ok, now I get something other than %100 which is what we want and expect. Up is 99.508% (137d 16h 51m 54s) Down is .492% (0d 16h 19m 42s) The Host Log Entries outside of the report period persists, but include the January and early April down time---- take note.
3. Changed time period to "Custom" . 5-1-2016 to 4-30-2017
  • Here's the problem- Up 100%, Down 0%, Nothing in host log entries
My questions are:

So if we know the host was down in January and early April of this year (see #2), and that time range is included in the Custom Report time range, why does the report show uptime of 100%, and also not show the host log entries like it did before.

Is there a different mechanism for reports that is used for a custom date range?

Is there a way to produce the report query manually?

I briefly read something about a mysql interface..... is that a possibility?

I don't like it when I'm asked to provide data for us to publish, what I can easily provide is not accurate - sorry, but I didn't need this right now...

Thanks for any input,

Rich
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Availability Reporting not accurate

Post by tgriep »

I am trying to recreate what you are seeing but without having the exact settings and logs, I may not be able to recreate it.

1. The Host Log entries is set by the Backtracked Archives setting. If you run it again and set it to zero, does the Host Log Entries look correct?
That setting tells the Availability report to go back that many archived log files. If the system thinks the log files were archived yearly, that could cause that.
To check this, look in the /usr/local/nagios/var/archives folder.
Run this command as root and post the output.

Code: Select all

ls -l /usr/local/nagios/var/archives
2. Could be the same as above.

3. Can you provide a screen capture of that one so I can see the issue?
Be sure to check out our Knowledgebase for helpful articles and solutions!
rrosenbaum
Posts: 3
Joined: Fri May 19, 2017 8:43 am

Re: Availability Reporting not accurate

Post by rrosenbaum »

Thanks for the reply. Apparently there are only 3 log files in that directory.

I tried running the reports again after choosing "Backtracked archives:0" Same results. It's been such a long time since I originally set Nagios up I just looked at the log configurations, which I don't believe I've ever touched. Log rotation is set for daily.

Code: Select all

ls -l /usr/local/nagios/var/archives
-rw-r--r-- 1 nagios nagios 91187067 May 19 23:59 nagios-05-20-2017-00.log
-rw-r--r-- 1 nagios nagios    57615 May 20 23:59 nagios-05-21-2017-00.log
-rw-r--r-- 1 nagios nagios    57285 May 21 23:59 nagios-05-22-2017-00.log
Thank you for looking at this.
Attachments
Report for April 30 2016 through May 1 2017 - should include data from "This year" - not correct..
Report for April 30 2016 through May 1 2017 - should include data from "This year" - not correct..
Report for "This year" - looks correct
Report for "This year" - looks correct
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Availability Reporting not accurate

Post by tgriep »

If you look at the size of the archive log file called nagios-05-20-2017-00.log, it is much larger that the others and without looking at it, I would guess something was set wrong in the log rotation before the upgrade and the nagios system logged all of the history in to one file.
After the upgrade, it looks like the setting was changes to daily rotation as they are separate files now.

When you ran the report using the custom time period, it looks at the archived log file in that time frame by the time stamp of the files, but since you do not have any files in that time frame, it is showing you that the host it up because the First Assumed Host State setting was set to UP.
I think you can fix this by splitting that large log file into daily log files by following this example from this site.
https://stackoverflow.com/questions/266 ... s-log-file
Be sure to check out our Knowledgebase for helpful articles and solutions!
rrosenbaum
Posts: 3
Joined: Fri May 19, 2017 8:43 am

Re: Availability Reporting not accurate

Post by rrosenbaum »

Thanks for the input. You were correct, however the solution I finally ended up using was a bit different than what you suggested

My logs went back to mid 2015. I started experimenting, and since the log rotation was set to daily, I got this to work....

Code: Select all

d1=2015-09-11
d2=2015-09-12
awk -v t1=$(date -d $d1 +"%s") -v t2=$(date -d $d2 +"%s") -F '[][]' ' { if ($2 >= t1 && $2 < t2) print }' nagios-05-20-2017-00.log.orig > nagios-$(date -d $d1 +"%m-%d-%Y")-00.log
I had to make some changes from the post you referenced - the field separator was different, and the naming convention for the logs had to be changed. I did about 5 of them that way. I was going to script it all out to increase the date and use a while loop, but then I wondered why it happened. It seems in Core3 the rotation default is none, and it might have been that in Core4 at first, but it changed to daily, which is what seemed to screw things up.

I shut Nagios down, then copied all the daily logs from the last 5 days or so, putting the content in order at the end of the nagios.log file, restarted Nagois, and tudum.... reporting works fine.

Also, I changed log rotation in the Nagios config file from daily back to none

Thanks
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Availability Reporting not accurate

Post by tgriep »

Putting all of the logs back in to one file would be a quick fix for this issue, great idea.
The only downside is over time, the log file will grow to a point it could effect the Nagios Daemon from running efficiently, but keep an eye on it and make sure it doesn't get out of control.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked