Page 2 of 2

Re: NagiosXI UI 'Last Check' Lagging Behind

Posted: Fri May 03, 2019 11:42 am
by benjaminsmith
Hi @azreenariff,

Getting back to the initial issue, can you provide anymore details about how the system is lagging? Is this consistent across all host and services or only a few? Is the lag an intermittent issue ( does the server ever catch up)?

While you have made performance upgrades, I believe the issue here is that the Nagios XI server is having trouble processing results as you have a large number of services.

We would like to check the size of the database tables, can you post the output of the following command?
NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the first command if your DB is offloaded to another server and/or you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --tab
Also, post the output of the following command to verify the tables:

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table | grep NULL
What are your kernel message queue settings at:

Code: Select all

sysctl kernel.msg{max,mni,mnb}
If you haven't done so already, follow the guide below to increase the settings to allow for more messages to be processed.

NDOUtils - Message Queue Exceeded

Re: NagiosXI UI 'Last Check' Lagging Behind

Posted: Mon May 06, 2019 2:39 am
by azreenariff
Hi Benjamin,

The issue is actually on the NagiosXI interface display. Yes, it is consistent across all hosts & services. You can refer to attached sample image where I open both NagiosXI and Nagios Core UI on the same screen and we can see that the 'Last Check' time displayed on NagiosXI UI is about 30-60min behind, whereas on Nagios Core UI it is showing the correct up-to-date 'Last Check' time. There is no issue processing results, as on Nagios Core UI it shows correctly, only that on NagiosXI UI it is behind. What we need to know is whether there is a way we can get the NagiosXI UI to display statuses up-to-date as in the Nagios Core UI?
Nagios-Compare-2.PNG
For the size of the database tables, below are the outputs as requested:

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --tab
+--------------------------------------------+------------+
| Table | Size in MB |
+--------------------------------------------+------------+
| alc | 0.00 |
| bdc | 0.00 |
| hla | 0.00 |
| hlib | 0.00 |
| hlisb | 0.00 |
| limit_t1 | 0.00 |
| limit_t2 | 0.00 |
| limit_t3 | 0.00 |
| limit_total | 0.00 |
| nagios_acknowledgements | 0.50 |
| nagios_commands | 0.03 |
| nagios_commenthistory | 19.14 |
| nagios_comments | 0.01 |
| nagios_configfiles | 0.00 |
| nagios_configfilevariables | 0.01 |
| nagios_conninfo | 0.36 |
| nagios_contact_addresses | 0.00 |
| nagios_contact_notificationcommands | 0.05 |
| nagios_contactgroup_members | 0.01 |
| nagios_contactgroups | 0.00 |
| nagios_contactnotificationmethods | 26.91 |
| nagios_contactnotifications | 26.31 |
| nagios_contacts | 0.01 |
| nagios_contactstatus | 0.01 |
| nagios_customvariables | 2.63 |
| nagios_customvariablestatus | 2.28 |
| nagios_dbversion | 0.00 |
| nagios_downtimehistory | 0.14 |
| nagios_eventhandlers | 0.01 |
| nagios_externalcommands | 58.99 |
| nagios_flappinghistory | 2.40 |
| nagios_host_contactgroups | 0.08 |
| nagios_host_contacts | 0.09 |
| nagios_host_parenthosts | 0.00 |
| nagios_hostchecks | 0.00 |
| nagios_hostdependencies | 0.00 |
| nagios_hostescalation_contactgroups | 0.00 |
| nagios_hostescalation_contacts | 0.00 |
| nagios_hostescalations | 0.00 |
| nagios_hostgroup_members | 0.19 |
| nagios_hostgroups | 0.01 |
| nagios_hosts | 0.37 |
| nagios_hoststatus | 0.78 |
| nagios_instances | 0.00 |
| nagios_logentries | 10757.31 |
| nagios_notifications | 6351.91 |
| nagios_objects | 10.58 |
| nagios_processevents | 0.56 |
| nagios_programstatus | 0.00 |
| nagios_runtimevariables | 0.00 |
| nagios_scheduleddowntime | 0.00 |
| nagios_service_contactgroups | 2.28 |
| nagios_service_contacts | 1.47 |
| nagios_service_parentservices | 0.00 |
| nagios_servicechecks | 0.00 |
| nagios_servicedependencies | 0.00 |
| nagios_serviceescalation_contactgroups | 0.00 |
| nagios_serviceescalation_contacts | 0.00 |
| nagios_serviceescalations | 0.00 |
| nagios_servicegroup_members | 0.00 |
| nagios_servicegroups | 0.00 |
| nagios_services | 9.11 |
| nagios_servicestatus | 19.85 |
| nagios_statehistory | 1580.26 |
| nagios_systemcommands | 0.01 |
| nagios_timedeventqueue | 0.00 |
| nagios_timedevents | 0.00 |
| nagios_timeperiod_timeranges | 0.01 |
| nagios_timeperiods | 0.00 |
| profile | 0.00 |
| region | 0.02 |
| tc | 0.00 |
| tbl_command | 0.05 |
| tbl_contact | 0.02 |
| tbl_contactgroup | 0.01 |
| tbl_contacttemplate | 0.01 |
| tbl_domain | 0.01 |
| tbl_host | 0.33 |
| tbl_hostdependency | 0.00 |
| tbl_hostescalation | 0.00 |
| tbl_hostextinfo | 0.00 |
| tbl_hostgroup | 0.02 |
| tbl_hosttemplate | 0.01 |
| tbl_info | 0.13 |
| tbl_lnkcontactgrouptocontact | 0.00 |
| tbl_lnkcontactgrouptocontactgroup | 0.00 |
| tbl_lnkcontacttemplatetocommandhost | 0.00 |
| tbl_lnkcontacttemplatetocommandservice | 0.00 |
| tbl_lnkcontacttemplatetocontactgroup | 0.00 |
| tbl_lnkcontacttemplatetocontacttemplate | 0.00 |
| tbl_lnkcontacttemplatetovariabledefinition | 0.00 |
| tbl_lnkcontacttocommandhost | 0.00 |
| tbl_lnkcontacttocommandservice | 0.00 |
| tbl_lnkcontacttocontactgroup | 0.00 |
| tbl_lnkcontacttocontacttemplate | 0.01 |
| tbl_lnkcontacttovariabledefinition | 0.00 |
| tbl_lnkhostdependencytohost_dh | 0.00 |
| tbl_lnkhostdependencytohost_h | 0.00 |
| tbl_lnkhostdependencytohostgroup_dh | 0.00 |
| tbl_lnkhostdependencytohostgroup_h | 0.00 |
| tbl_lnkhostescalationtocontact | 0.00 |
| tbl_lnkhostescalationtocontactgroup | 0.00 |
| tbl_lnkhostescalationtohost | 0.00 |
| tbl_lnkhostescalationtohostgroup | 0.00 |
| tbl_lnkhostgrouptohost | 0.05 |
| tbl_lnkhostgrouptohostgroup | 0.01 |
| tbl_lnkhosttemplatetocontact | 0.00 |
| tbl_lnkhosttemplatetocontactgroup | 0.00 |
| tbl_lnkhosttemplatetohost | 0.00 |
| tbl_lnkhosttemplatetohostgroup | 0.00 |
| tbl_lnkhosttemplatetohosttemplate | 0.00 |
| tbl_lnkhosttemplatetovariabledefinition | 0.00 |
| tbl_lnkhosttocontact | 0.05 |
| tbl_lnkhosttocontactgroup | 0.05 |
| tbl_lnkhosttohost | 0.00 |
| tbl_lnkhosttohostgroup | 0.03 |
| tbl_lnkhosttohosttemplate | 0.05 |
| tbl_lnkhosttovariabledefinition | 0.01 |
| tbl_lnkservicedependencytohost_dh | 0.00 |
| tbl_lnkservicedependencytohost_h | 0.00 |
| tbl_lnkservicedependencytohostgroup_dh | 0.00 |
| tbl_lnkservicedependencytohostgroup_h | 0.00 |
| tbl_lnkservicedependencytoservice_ds | 0.00 |
| tbl_lnkservicedependencytoservice_s | 0.00 |
| tbl_lnkserviceescalationtocontact | 0.00 |
| tbl_lnkserviceescalationtocontactgroup | 0.00 |
| tbl_lnkserviceescalationtohost | 0.00 |
| tbl_lnkserviceescalationtohostgroup | 0.00 |
| tbl_lnkserviceescalationtoservice | 0.00 |
| tbl_lnkservicegrouptoservice | 0.01 |
| tbl_lnkservicegrouptoservicegroup | 0.00 |
| tbl_lnkservicetemplatetocontact | 0.00 |
| tbl_lnkservicetemplatetocontactgroup | 0.00 |
| tbl_lnkservicetemplatetohost | 0.00 |
| tbl_lnkservicetemplatetohostgroup | 0.00 |
| tbl_lnkservicetemplatetoservicegroup | 0.00 |
| tbl_lnkservicetemplatetoservicetemplate | 0.01 |
| tbl_lnkservicetemplatetovariabledefinition | 0.00 |
| tbl_lnkservicetocontact | 2.04 |
| tbl_lnkservicetocontactgroup | 0.14 |
| tbl_lnkservicetohost | 2.61 |
| tbl_lnkservicetohostgroup | 0.00 |
| tbl_lnkservicetoservicegroup | 0.00 |
| tbl_lnkservicetoservicetemplate | 2.60 |
| tbl_lnkservicetovariabledefinition | 2.03 |
| tbl_lnktimeperiodtotimeperiod | 0.00 |
| tbl_logbook | 0.00 |
| tbl_mainmenu | 0.00 |
| tbl_service | 15.99 |
| tbl_servicedependency | 0.00 |
| tbl_serviceescalation | 0.00 |
| tbl_serviceextinfo | 0.00 |
| tbl_servicegroup | 0.01 |
| tbl_servicetemplate | 0.02 |
| tbl_session | 0.00 |
| tbl_session_locks | 0.00 |
| tbl_settings | 0.00 |
| tbl_submenu | 0.00 |
| tbl_timedefinition | 0.01 |
| tbl_timeperiod | 0.01 |
| tbl_user | 0.01 |
| tbl_variabledefinition | 3.69 |
| xi_auditlog | 2.98 |
| xi_commands | 0.00 |
| xi_eventqueue | 0.01 |
| xi_events | 0.32 |
| xi_incidents | 0.00 |
| xi_meta | 12.29 |
| xi_options | 0.02 |
| xi_sysstat | 0.01 |
| xi_usermeta | 0.46 |
| xi_users | 0.01 |
+--------------------------------------------+------------+


echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table | grep NULL
- No output for this

Message queue settings:

# sysctl kernel.msg{max,mni,mnb}
kernel.msgmax = 2097152000
kernel.msgmni = 256000000
kernel.msgmnb = 2097152000


Thanks.

Re: NagiosXI UI 'Last Check' Lagging Behind

Posted: Mon May 06, 2019 10:31 am
by benjaminsmith
Hi @azreenariff,

Thank you for your last reply and uploading the screen shots. I've discussed your system internally here with the support team. What's happening here is that the sever is lagging due to the large number of hosts and services. Generally at 10K total combined host/service checks we recommend that you setup a RAMDisk (you've already done this). At around 20K, we recommend you start looking at adding an additional XI server because they can only process so much. Now this may come sooner or later than 20K depending on what type of checks you are running, how much resources they use, your hardware speed, and what you're doing to mitigate the impact.

Recommendations:

1. Execution Time Plugin. You should run this check profiler script and see what long running checks you have and determine what some of your long running checks are, they consume resources the whole time they are running so reducing those helps a lot:

https://exchange.nagios.org/directory/P ... me/details

2. Mod-Gearman. The next step would be for you to look at offloading the checks using mod gearman to reduce the impact on the XI server (you've already done this as well), this would be my recommendation at what you can do to add more services and alleviate the system issues. There's just so much going with around 20K checks that you will need to do what you can to mitigate the impact such as using mod gearman, please see here for more information:

https://assets.nagios.com/downloads/nag ... ios_XI.pdf
https://support.nagios.com/kb/article.php?id=484

NOTE: Make sure that you follow the "Remote Worker Considerations" and the "Host groups and Service groups" sections from the second link above and then follow the "Disable Worker" section from the first link once you've setup your exclude groups.

Please read through this doc as well, with the number of checks you are running I would leave the DB local though at this point in time because of the large amount of total checks you have, it requires a lot of throughput to the DB (recommended enabling jumbo_frames):

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

3. Adjust Database Settings. Go to Admin > Performance Settings > Databases and adjust your retention settings to the smallest values you can, you're trying to cram way more into a single system than we recommend so you'll need to make some sacrifices somewhere to mitigate that.

4. Truncate Tables. Your nagios_logentries is huge, want a perfomance boost? Truncate your large tables:
- This will likely speed up KMQ processing so try this first

| nagios_logentries | 10757.31 |
| nagios_notifications | 6351.91 |

Follow this guide here:
- Specifically, follow this section "In certain instances, it may be necessary to truncate (empty) one or more tables" on page 5 of the PDF

https://assets.nagios.com/downloads/nag ... tabase.pdf

4. Move your DB back to local. This SHOULD fix the kernel message queue processing quick enough (that's the lag you are seeing)
- That's the only solution I've ever been able to find to this "customer has too many hosts/service checks on a single system to process the kernel message queue fast enough across the network" issue.

5. Additional XI Server. Consider a new XI license and split the load.

Let us know if you have any questions or if we can clarify anything

Re: NagiosXI UI 'Last Check' Lagging Behind

Posted: Mon May 06, 2019 9:47 pm
by azreenariff
Hi Benjamin,

You've been really helpful. Thank you so much for your kind support. We will see what we can do based on all your recommendations.

Thanks again.

Re: NagiosXI UI 'Last Check' Lagging Behind

Posted: Tue May 07, 2019 10:37 am
by benjaminsmith
Hi @azreenariff,

Glad to be of service, and I will keep this post open if you have any other questions.

Thanks!