NagiosXI availabilty report takes long time

mejokj · Post by **mejokj** » Thu Sep 26, 2019 9:21 am

Hello,

We are trying to export NagiosXI availability reports for last 6 months and take a long time. Browser always crashing when running the report.

We have done changes in the php.ini
max_input_vars = 50000
memory_limit = 4048M
max_execution_time = 3600
max_input_time = 3600

server has 32 GB ram and 32 core processes.
Please help us to resolve the issue.

benjaminsmith · Post by **benjaminsmith** » Thu Sep 26, 2019 10:57 am

Hello @mejokj,

Generally speaking, the availability reports will take some time as this data stored in archived text files on the Nagios server. The process is I/O intensive and increasing the PHP settings does help but the main limitation is disk read/write activity. I would recommend running this report for smaller time periods to reduce load and the overall time required to create each report.

Please send a copy of your system profile for us to review so we can make sure there are no other issues. Also, please review our guide for increasing performance in Nagios XI.

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message and then reply to this post to bring it up in the queue.

mejokj · Post by **mejokj** » Fri Sep 27, 2019 3:12 am

Hello,

Is there any way we could store those files on a ramdisk to improve the performance. Also letus know which exact files are used to generate the reports.

One thing I notice is that when running from nagios core it's not taking much time. http://domain/core/.
The nagiosxi takes long time. http://domaion/nagiosxi.

So its seems not I/O issue, because core is not taking much time.

Kindly help me to troubleshoot the issue.

benjaminsmith · Post by **benjaminsmith** » Fri Sep 27, 2019 10:56 am

Hello @mejokj,

Most of the processing is done by a cgi script called avail.c. The main difference between the two interfaces is the formatting and there's some additional php processing to produce the XI version of the report compared to Nagios Core.

90826 16:30:51 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and should be repaired
190826 16:30:51 [Warning] Checking table: './nagios/nagios_logentries'
190826 16:30:58 [ERROR] mysqld: Table './nagiosxi/xi_eventqueue' is marked as crashed and should be repaired
190826 16:30:58 [ERROR] mysqld: Table './nagiosxi/xi_eventqueue' is marked as crashed and should be repaired
190826 16:30:58 [ERROR] mysqld: Table './nagiosxi/xi_eventqueue' is marked as crashed and should be repaired
190826 16:30:58 [ERROR] mysqld: Table './nagiosxi/xi_eventqueue' is marked as crashed and should be repaired
190826 16:30:58 [ERROR] mysqld: Table './nagiosxi/xi_eventqueue' is marked as crashed and should be repaired
190826 16:30:58 [ERROR] mysqld: Table './nagiosxi/xi_eventqueue' is marked as crashed and should be repaired
190826 16:30:58 [Warning] Checking table: './nagiosxi/xi_eventqueue'
190826 16:30:59 [ERROR] mysqld: Table './nagios/nagios_systemcommands' is marked as crashed and should be repaired
190826 16:30:59 [Warning] Checking table: './nagios/nagios_systemcommands'
190826 16:30:59 [ERROR] mysqld: Table './nagios/nagios_eventhandlers' is marked as crashed and should be repaired

I didn't notice you have several crashed database tables, so let's get that repaired and check the size of your tables as very large tables can affect performance.

To repair the database tables, log in as root and run the following command:

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh

Next, post the output of the following command to check the size of the database tables.

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table

mejokj · Post by **mejokj** » Sat Sep 28, 2019 7:04 am

Hello,

I have repaired the databases and rerun the availability report but its take much time, now more than 2 hours still not completed.

Please find the command output.

+++++++++++++++++++++++++++++
[root@localhost ~]# echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
+--------------------------------------------+------------+
| Table | Size in MB |
+--------------------------------------------+------------+
| nagios_acknowledgements | 0.00 |
| nagios_commands | 0.02 |
| nagios_commenthistory | 91.04 |
| nagios_comments | 0.02 |
| nagios_configfiles | 0.00 |
| nagios_configfilevariables | 0.01 |
| nagios_conninfo | 0.21 |
| nagios_contact_addresses | 0.00 |
| nagios_contact_notificationcommands | 0.01 |
| nagios_contactgroup_members | 0.00 |
| nagios_contactgroups | 0.00 |
| nagios_contactnotificationmethods | 102.77 |
| nagios_contactnotifications | 108.65 |
| nagios_contacts | 0.00 |
| nagios_contactstatus | 0.00 |
| nagios_customvariables | 1.37 |
| nagios_customvariablestatus | 1.37 |
| nagios_dbversion | 0.00 |
| nagios_downtimehistory | 0.00 |
| nagios_eventhandlers | 0.00 |
| nagios_externalcommands | 0.00 |
| nagios_flappinghistory | 50.83 |
| nagios_host_contactgroups | 0.00 |
| nagios_host_contacts | 0.42 |
| nagios_host_parenthosts | 0.00 |
| nagios_hostchecks | 0.00 |
| nagios_hostdependencies | 0.00 |
| nagios_hostescalation_contactgroups | 0.00 |
| nagios_hostescalation_contacts | 0.00 |
| nagios_hostescalations | 0.00 |
| nagios_hostgroup_members | 0.44 |
| nagios_hostgroups | 0.01 |
| nagios_hosts | 1.84 |
| nagios_hoststatus | 3.51 |
| nagios_instances | 0.00 |
| nagios_logentries | 1058.66 |
| nagios_notifications | 135.29 |
| nagios_objects | 1.98 |
| nagios_processevents | 0.24 |
| nagios_programstatus | 0.00 |
| nagios_runtimevariables | 0.00 |
| nagios_scheduleddowntime | 0.00 |
| nagios_service_contactgroups | 0.00 |
| nagios_service_contacts | 0.31 |
| nagios_service_parentservices | 0.00 |
| nagios_servicechecks | 0.00 |
| nagios_servicedependencies | 0.00 |
| nagios_serviceescalation_contactgroups | 0.00 |
| nagios_serviceescalation_contacts | 0.00 |
| nagios_serviceescalations | 0.00 |
| nagios_servicegroup_members | 0.29 |
| nagios_servicegroups | 0.00 |
| nagios_services | 1.19 |
| nagios_servicestatus | 3.54 |
| nagios_statehistory | 935.52 |
| nagios_systemcommands | 0.03 |
| nagios_timedeventqueue | 0.00 |
| nagios_timedevents | 0.00 |
| nagios_timeperiod_timeranges | 0.01 |
| nagios_timeperiods | 0.00 |
| tbl_command | 0.03 |
| tbl_contact | 0.01 |
| tbl_contactgroup | 0.01 |
| tbl_contacttemplate | 0.01 |
| tbl_domain | 0.01 |
| tbl_host | 1.55 |
| tbl_hostdependency | 0.00 |
| tbl_hostescalation | 0.00 |
| tbl_hostextinfo | 0.00 |
| tbl_hostgroup | 0.01 |
| tbl_hosttemplate | 0.01 |
| tbl_info | 0.13 |
| tbl_lnkContactToCommandHost | 0.00 |
| tbl_lnkContactToCommandService | 0.00 |
| tbl_lnkContactToContactgroup | 0.00 |
| tbl_lnkContactToContacttemplate | 0.00 |
| tbl_lnkContactToVariabledefinition | 0.00 |
| tbl_lnkContactgroupToContact | 0.00 |
| tbl_lnkContactgroupToContactgroup | 0.00 |
| tbl_lnkContacttemplateToCommandHost | 0.00 |
| tbl_lnkContacttemplateToCommandService | 0.00 |
| tbl_lnkContacttemplateToContactgroup | 0.00 |
| tbl_lnkContacttemplateToContacttemplate | 0.00 |
| tbl_lnkContacttemplateToVariabledefinition | 0.00 |
| tbl_lnkHostToContact | 0.28 |
| tbl_lnkHostToContactgroup | 0.00 |
| tbl_lnkHostToHost | 0.00 |
| tbl_lnkHostToHostgroup | 0.26 |
| tbl_lnkHostToHosttemplate | 0.23 |
| tbl_lnkHostToVariabledefinition | 0.18 |
| tbl_lnkHostdependencyToHost_DH | 0.00 |
| tbl_lnkHostdependencyToHost_H | 0.00 |
| tbl_lnkHostdependencyToHostgroup_DH | 0.00 |
| tbl_lnkHostdependencyToHostgroup_H | 0.00 |
| tbl_lnkHostescalationToContact | 0.00 |
| tbl_lnkHostescalationToContactgroup | 0.00 |
| tbl_lnkHostescalationToHost | 0.00 |
| tbl_lnkHostescalationToHostgroup | 0.00 |
| tbl_lnkHostgroupToHost | 0.00 |
| tbl_lnkHostgroupToHostgroup | 0.00 |
| tbl_lnkHosttemplateToContact | 0.00 |
| tbl_lnkHosttemplateToContactgroup | 0.00 |
| tbl_lnkHosttemplateToHost | 0.00 |
| tbl_lnkHosttemplateToHostgroup | 0.00 |
| tbl_lnkHosttemplateToHosttemplate | 0.00 |
| tbl_lnkHosttemplateToVariabledefinition | 0.00 |
| tbl_lnkServiceToContact | 0.03 |
| tbl_lnkServiceToContactgroup | 0.00 |
| tbl_lnkServiceToHost | 0.02 |
| tbl_lnkServiceToHostgroup | 0.00 |
| tbl_lnkServiceToServicegroup | 0.00 |
| tbl_lnkServiceToServicetemplate | 0.04 |
| tbl_lnkServiceToVariabledefinition | 0.03 |
| tbl_lnkServicedependencyToHost_DH | 0.00 |
| tbl_lnkServicedependencyToHost_H | 0.00 |
| tbl_lnkServicedependencyToHostgroup_DH | 0.00 |
| tbl_lnkServicedependencyToHostgroup_H | 0.00 |
| tbl_lnkServicedependencyToService_DS | 0.00 |
| tbl_lnkServicedependencyToService_S | 0.00 |
| tbl_lnkServiceescalationToContact | 0.00 |
| tbl_lnkServiceescalationToContactgroup | 0.00 |
| tbl_lnkServiceescalationToHost | 0.00 |
| tbl_lnkServiceescalationToHostgroup | 0.00 |
| tbl_lnkServiceescalationToService | 0.00 |
| tbl_lnkServicegroupToService | 0.00 |
| tbl_lnkServicegroupToServicegroup | 0.00 |
| tbl_lnkServicetemplateToContact | 0.00 |
| tbl_lnkServicetemplateToContactgroup | 0.00 |
| tbl_lnkServicetemplateToHost | 0.00 |
| tbl_lnkServicetemplateToHostgroup | 0.00 |
| tbl_lnkServicetemplateToServicegroup | 0.00 |
| tbl_lnkServicetemplateToServicetemplate | 0.00 |
| tbl_lnkServicetemplateToVariabledefinition | 0.00 |
| tbl_lnkTimeperiodToTimeperiod | 0.00 |
| tbl_logbook | 0.00 |
| tbl_mainmenu | 0.00 |
| tbl_service | 0.16 |
| tbl_servicedependency | 0.00 |
| tbl_serviceescalation | 0.00 |
| tbl_serviceextinfo | 0.00 |
| tbl_servicegroup | 0.01 |
| tbl_servicetemplate | 0.02 |
| tbl_session | 0.00 |
| tbl_session_locks | 0.00 |
| tbl_settings | 0.00 |
| tbl_submenu | 0.00 |
| tbl_timedefinition | 0.01 |
| tbl_timeperiod | 0.01 |
| tbl_user | 0.01 |
| tbl_variabledefinition | 0.48 |
+--------------------------------------------+------------+
+++++++++++++++++++++++++++++

ssax · Post by **ssax** » Mon Sep 30, 2019 12:44 pm

What is the output of these commands?

Code: Select all

ls -lh /usr/local/nagios/var/
ls -lh /usr/local/nagios/var/archives
top -n5
ls -lh /usr/local/nagiosxi/var
echo "SELECT relname as Table, pg_size_pretty(pg_total_relation_size(relid)) As Size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as ExternalSize FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" | psql nagiosxi nagiosxi
sar -A -f /var/log/sa/sa29

ssax · Post by **ssax** » Mon Sep 30, 2019 12:51 pm

In addition to my previous message (please read it and send me the output), pay close attention to this:

Because of the size of some of your database tables I recommend you make these changes:

Please go to Admin > Performance Settings > Databases tab:
- Update all 3 Optimize Intervals to 300
- Click the Update Settings button

Making that change should prevent an issue where one DB optimize hasn't finished when the next one starts, that can cause crashing of DB tables, increasing this timeout should alleviate that.

FAQ: Can I truncate the tables first before proceeding with database repair (if I have crashed tables)?

You can truncate before repairing the DB, it's up to you. If you want to back it up first, you'll need to repair it. If you don't care, or already have a backup, truncate it first as it will speed up the DB repair process.

NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the commands if your DB is housed/stored/offloaded/contained on a different server and/or you've changed the root mysql password

If you don't care about the data, or already have a backup, you can just truncate the tables which will essentially drop and recreate the table with zero data in it (removing all historical data for the respective reports):

Code: Select all

nagios_logentries - Impacts Event Log report length

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_logentries;'

nagios_statehistory - Impacts the State History report length

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_statehistory;'

nagios_notifications - Impacts the Notifications report length

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_notifications;'

These should technically work to clean the DB tables up manually (if the tables aren't crashed, if they ARE crashed, you will need to repair the database FIRST in order to run these queries):

Code: Select all

nagios_logentries - Impacts Event Log report length

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'

nagios_statehistory - Impacts the State History report length

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_statehistory WHERE state_time <= (NOW() - INTERVAL 6 MONTH);'

nagios_notifications - Impacts the Notifications report length

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_notifications WHERE start_time <= (NOW() - INTERVAL 6 MONTH);'

Then you should go to Admin > Performance Settings > Databases tab and adjust ALL of the retention intervals to meet your business data policy standards to keep them cleaned up as these settings are for adjusting the retention on those DB tables.

I would lower them to the smallest possible level and utilize the XI backup/restore process and the Admin > Scheduled Backups process to offload the backups to another server. Since these XI backups contain database backups you can spin them up to grab the data and report on them if needed.

See here for more information:

https://assets.nagios.com/downloads/nag ... os-XI.pdf

And here:

https://assets.nagios.com/downloads/nag ... abase.pdf

Including this info as well:

Generally at 10K total combined host/service checks (you're at 14K) we recommend that you setup a RAMDisk (you've already done that), and at around 20K we recommend you start looking at adding an additional XI server because they can only process so much. Now this may come sooner or later than 20K depending on what type of checks you are running, how much resources they use, your hardware speed, and what you're doing to mitigate the impact.

You can read more about setting up a RAMDisk here:

https://assets.nagios.com/downloads/nag ... giosXI.pdf

You should run this check profiler script and see what long running checks you have and determine what some of your long running checks are, they consume resources the whole time they are running so reducing those helps a lot:

https://exchange.nagios.org/directory/P ... me/details

The next step would be for you to look at offloading the checks using mod gearman to reduce the impact on the XI server (you've also already done this), this would be my recommendation at what you can do to add more services and alleviate the system issues. There's just so much going with around 20K checks that you will need to do what you can to mitigate the impact such as using mod gearman, please see here for more information:

https://assets.nagios.com/downloads/nag ... ios_XI.pdf
https://support.nagios.com/kb/article.php?id=484

NOTE: Make sure that you follow the "Remote Worker Considerations" and the "Host groups and Service groups" sections from the second link above and then follow the "Disable Worker" section from the first link once you've setup your exclude groups.

Please read through this doc as well, did you enable Jumbo Frames on your network for better offloaded DB throughput? Given the sizes of the tables, that's likely most of the issue.

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

You can only do so much on a single server, you'll need to do what you can to mitigate the impact but you should start looking at adding another XI server soon if you continue to experience load/performance issues after doing the mitigation.

Let me know if you have any questions or if I can clarify anything.

mejokj · Post by **mejokj** » Wed Oct 02, 2019 3:12 am

I would like to clarify a few things here:

1. The XI is running on a physical server that has 32 cores and 32 GB RAM.

2. The load average on the server is 1.0 pretty much all the time and I guess this means we do not have to do any more performance tuning?

3. Yes, we do have a lot of checks - around 15k as you say but they do simple tasks like reading from files, and there's almost no load on the server at any given point of time. The interface is very responsive and all functions work smoothly.

4. I agree there were a few crashed tables and that could be because of a power outage a few weeks ago.

5. The only issue we are facing is while generating the availability reports - it takes way too long. As mentioned in this thread: https://support.nagios.com/forum/viewto ... =6&t=53484.

What is surprising is that the legacy reports(which uses the same cgi?) gets generated in 5 - 10 mins. You wouldn't need an hour to convert it to pdf and add a pie chart, would you? That's assuming XI report has nothing extra but all we need is a legacy report in pdf with a pie chart.

In the end we generated the legacy report, copied the content to open office writer, took the total averages and generated pie chart from https://www.meta-chart.com/pie#/data, converted it all to PDF and sent it to the customer. The whole process took around 15 minutes.

Please let me know if you can give us a time line to come up with a fix. If not we will have to automate the above process somehow as this is bothering a lot of our customers (some of them with servers under high load as well).

Thanks!

ssax · Post by **ssax** » Wed Oct 02, 2019 9:16 am

That's a pre-written thing that I wrote to send out to customers who need it (they are the next steps you need to be looking at as you grow). A RAMDisk is your next step to alleviate IO wait (which is generally the killer with the Availability report), the rest of the stuff is just recommended practice since you're starting to enter into that territory.

If the legacy reports are fast I'm wondering if it's maybe some XI specific thing but I need the output you sent in order to try to see from afar.

Load mitigation is something you should always be thinking of in this area just because of the way it works. You're asking a server to run tens of thousands of processes on a schedule (plugins we don't control, we don't limit your use), with an unknown resource impact we can't control that or plan for it, we can only provide general recommendations).

Here's a very good KB article:

Code: Select all

https://support.nagios.com/kb/article/nagios-xi-hardware-requirements-baseline-testing-523.html

mejokj · Post by **mejokj** » Sun Oct 13, 2019 5:24 am

Hello,

We have already set up the ramdisk, But still the same issue.

Nagios Support Forum

NagiosXI availabilty report takes long time

NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time

Re: NagiosXI availabilty report takes long time