avail.cgi, 100% CPU when running report.

yo_marc · Post by **yo_marc** » Mon Jun 03, 2019 2:19 pm

Hi Nagios Support,

I've got an XI server here with about 900 Hosts and 4,880 Services. I'm having trouble running various reports - specifically when looking backwards in time more than a week or so. The Availability report and Executive Summary reports are two, for example. If I do a report for the Host and Service data for a single host, for the last month, they both take a long time to complete. (5 minutes?). This is also seemingly causing host and service check result processing to get delayed and in-turn congested; and since I have alerting setup for Host/Service check Latency, I get alerts for that.

Looking deeper, the troublesome reports seem to call 'avail.cgi', which maxes out at 100% cpu for a few minutes at a time. If I try to go back, or open other Nagios pages, I see my browser waiting for an available socket.

Doing a strace of avail.cgi, I see a LOT of messages such as this:

Code: Select all

brk(NULL)                               = 0x7ec9f000
brk(0x7ecc0000)                         = 0x7ecc0000
brk(NULL)                               = 0x7ecc0000
brk(0x7ece1000)                         = 0x7ece1000
brk(NULL)                               = 0x7ece1000
brk(0x7ed02000)                         = 0x7ed02000
brk(NULL)                               = 0x7ed02000
brk(0x7ed23000)                         = 0x7ed23000
brk(NULL)                               = 0x7ed23000

The server is running XI 5.5.11. I've had this server running for about 3 years now. I spun up one of our daily XI backups of this server on a different system, upgraded that one to XI 5.6.2, and the problem persists there.

Can you help me figure out what the problem here is? Is there perhaps any cleanup that I can or should perform?

I've removed all perfdata files of anything older than (not updated in) 90-days. I do have a lot of 'disabled' hosts and services sitting in the CCM.

Thanks,
-marc

yo_marc · Post by **yo_marc** » Mon Jun 03, 2019 2:33 pm

Another data point or two:

I have two other XI servers which do not have this issue. One is only running about 37 Hosts and 127 Services - The other is 79/306 respectively.

First server is at version 5.6.2, the second is at 5.5.11.

They've both been running for 2-3 years.

npolovenko · Post by **npolovenko** » Mon Jun 03, 2019 2:52 pm

Hello, @yo_marc. How many host and service checks are running on the server that is not working correctly? Avail.cgi component is not very efficient when processing info for a large number of hosts and services. We're planning on rewriting it in the future release.
https://github.com/NagiosEnterprises/na ... issues/280

But as of right now, I'd highly recommend deleting old archived nagios log files from:

Code: Select all

/usr/local/nagios/var/archives

If you only going to be running reports for the last 30 days, I'd remove all archived log files older then 30 days. There should be a lot saved up during the last 3 years.

yo_marc · Post by **yo_marc** » Mon Jun 03, 2019 3:27 pm

Thanks! I'll be taking a good look at that archives directory.

The server with the slow reports is running nearly 900 Hosts and 4,880 Services.

npolovenko · Post by **npolovenko** » Mon Jun 03, 2019 3:40 pm

@yo_marc, Sounds good! Keep us updated.

yo_marc · Post by **yo_marc** » Tue Jun 04, 2019 10:57 am

I trimmed down the contents of the /usr/local/nagios/var/archives directory - from 1186 items to 186. 6-months worth of data is what remains... Unfortunately the time it takes to run the various types of availability reports didn't improve.

Management is looking to review the availability of a specific host to see if uptime and service availability is trending in the right direction... the slow reports are definitely a hindrance... We're shipping monitoring logs off to an ELK cluster, so I can extract availability data from there pretty quickly-- however, that data is raw and it would be nice to be able to use the nice succinct Nagios reports instead.

If I narrow down the focus of the reports to just 1 month, I can get what I need - but it's not particularly fast. Between 3-6 minutes. However, if I want to get the availability report of just 1 host, for the last quarter - avail.cgi times out (fails to generate a report) after 20 minutes 30 seconds:

Code: Select all

brk(NULL)                               = 0x1ebc86000
brk(0x1ebca7000)                        = 0x1ebca7000
????( <unfinished ...>
+++ killed by SIGKILL +++

Would it be Nagios XI version 6.x.x that we can expect a refactoring of the reporting tools?

npolovenko · Post by **npolovenko** » Tue Jun 04, 2019 4:39 pm

@yo_marc, Yes, the refactoring is on the roadmap for XI 6.
https://www.nagios.com/roadmaps/

Reports are highly dependable on the disk IO, so increasing the disk speed would increase the report speed in return. Perhaps switching to an SSD drive could massively increase the IO and improve the report generation speed.

Aside from that, generating reports for only 1 month at a time would be my main recommendation for right now.

Nagios Support Forum

avail.cgi, 100% CPU when running report.

avail.cgi, 100% CPU when running report.

Re: avail.cgi, 100% CPU when running report.

Re: avail.cgi, 100% CPU when running report.

Re: avail.cgi, 100% CPU when running report.

Re: avail.cgi, 100% CPU when running report.

Re: avail.cgi, 100% CPU when running report.

Re: avail.cgi, 100% CPU when running report.