Page 1 of 2

Nagios Core and XI out of sync

Posted: Tue Feb 09, 2021 12:03 am
by safuanmansor
Hi Support,

We have and incident where the nagiosxi lately is out of sync from the nagios core again. We encounter this issue last time and base on the mariadb logs, it is showing [Warning] Could not increase number of max_open_files to more than 20000 (request: 70011) (Last time the value is 5000). Unfortunately currently we also experiencing "Object doest not exist" issue that occur sometimes when click on any host and service and the issue gone by hitting apply configuration button randomly.

So i was wondering is there any value/guideline/recommendation/benchmark than can be follow to address the openfile limits in nagiosxi databases. Is the 2nd issue related and do you have any idea how to fix it?


Thanks
Safuan

Re: Nagios Core and XI out of sync

Posted: Tue Feb 09, 2021 2:52 pm
by dchurch
Do you know how many hosts / services you're monitoring? What distro are you running? What version of Nagios XI?

If you PM me a system profile I can diagnose further. Get one by going to Admin (top menu) => System Profile (in the left menu), then clicking the blue button.

If you're unable to generate the the profile through the web interface, please try generating it from the command line by running these commands as root:

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile*
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip file.
If the profile script fails, please include the ENTIRE output.

Re: Nagios Core and XI out of sync

Posted: Wed Feb 10, 2021 4:52 am
by safuanmansor
Hi dchurch,

The system info are as below.
XI 5.6.7
Redhat 7.9
Host: 2000+
Services: 31000+
I have pm the profile for review.


Regards,
Safuan

Re: Nagios Core and XI out of sync

Posted: Wed Feb 10, 2021 2:15 pm
by dchurch
The second issue might be related to the max_open_files error, but it's not guaranteed.

It's likely not due to a corrupted database, but it's a good idea to run the database repair script anyway:

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
MariaDB issue re:max_open_files

The error means the limit is being hit somewhere. Let’s resolve that by editing any configured limits. Have a look at the following files:

- /etc/systemd/system/mariadb.service.d/migrated-from-my.cnf-settings.conf
- /etc/systemd/system/mysqld.service.d/limits.conf
- /usr/lib/systemd/system/mariadb.service
- /usr/lib/systemd/system/mariadb.service
- /etc/systemd/system/mysql.service
- /etc/systemd/system/mysqld.service

Look within those files for the following config lines:

Code: Select all

LimitNOFILE=
LimitMEMLOCK=
Change these lines to your new limit. For example:

Code: Select all

LimitNOFILE=100000
LimitMEMLOCK=100000
Other issues

Looks like the PHP process is running into issues because it can't modify the files in your ramdisk:

Code: Select all

[Sun Feb 07 21:19:24.688092 2021] [:error] [pid 21763] [client 10.150.1.143:51757] PHP Warning:  unlink(/usr/local/nagiosramdisk//5/3249178946/3130682211): Permission denied in /usr/local/nagiosxi/html/includes/utils-backend.inc.php on line 0, referer: http://10.103.12.94/nagiosxi/includes/components/birdseye/birdseye.php
I'd inspect the permissions in that directory structure to make sure the apache daemon has access to modify it. OR, since it looks like your config shied away from storing the perf data in the ramdisk, reconfigure birdseye to not use the ramdisk anymore.

Re: Nagios Core and XI out of sync

Posted: Thu Feb 11, 2021 6:03 am
by safuanmansor
Hi dchurch,

The issue sometimes happened 3 to 4 time a day and yes the current workaround that we do is running the repairing database scripts. Running this script multiple times a day is so not efficient.

As for open files limits. It was configure from 5000 to 20000 last 2 month and now it hit the max again. I understand that increasing this number will allow the database to have more room for the openfiles/open files descriptor. Do you have any method or benchmark that we can refer ? So that we can tune the db prior to adding more services in the future.
E.g :
10000 service - 30000 to 50000 limit
30000 service -60000 to 100000 limit


Thanks,
Safuan

Re: Nagios Core and XI out of sync

Posted: Thu Feb 11, 2021 6:28 pm
by ssax
Please send a screenshot of Admin > Performance Settings > Databases (the whole page).

Send the output of these commands as root:

Code: Select all

sar
sysctl -p
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
Additionally, include the output of these commands:
- NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the first command if your DB is offloaded to another server and/or you've changed the root mysql password

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
This next command may fail, that's okay, not all systems run postgresql:

Code: Select all

echo "SELECT relname as Table, pg_size_pretty(pg_total_relation_size(relid)) As Size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as ExternalSize FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" | psql nagiosxi nagiosxi

Re: Nagios Core and XI out of sync

Posted: Wed Feb 17, 2021 9:38 pm
by safuanmansor
Hi Support,

Database performance screenshot.
image.png
Command result:
1 . sar
sar.PNG
2. sysctl -p
sysctl -p.PNG
3. ulimit -a
ulimit -a.PNG
4.ulimit -a nagios & mysql
ulimit.PNG
5 Query
echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
+--------------------------------------------+------------+
| Table | Size in MB |
+--------------------------------------------+------------+
| alc | 0.00 |
| bdc | 0.00 |
| hla | 0.00 |
| hlib | 0.00 |
| hlisb | 0.00 |
| limit_t1 | 0.00 |
| limit_t2 | 0.00 |
| limit_t3 | 0.00 |
| limit_total | 0.00 |
| nagios_acknowledgements | 0.60 |
| nagios_commands | 0.08 |
| nagios_commenthistory | 2730.27 |
| nagios_comments | 1.13 |
| nagios_configfiles | 0.00 |
| nagios_configfilevariables | 0.01 |
| nagios_conninfo | 1.59 |
| nagios_contact_addresses | 0.00 |
| nagios_contact_notificationcommands | 0.25 |
| nagios_contactgroup_members | 0.04 |
| nagios_contactgroups | 0.01 |
| nagios_contactnotificationmethods | 211.60 |
| nagios_contactnotifications | 221.34 |
| nagios_contacts | 0.06 |
| nagios_contactstatus | 0.04 |
| nagios_customvariables | 0.98 |
| nagios_customvariablestatus | 1.37 |
| nagios_dbversion | 0.00 |
| nagios_downtimehistory | 81.36 |
| nagios_eventhandlers | 0.14 |
| nagios_externalcommands | 0.40 |
| nagios_flappinghistory | 9.54 |
| nagios_host_contactgroups | 0.18 |
| nagios_host_contacts | 0.26 |
| nagios_host_parenthosts | 0.00 |
| nagios_hostchecks | 0.00 |
| nagios_hostdependencies | 0.00 |
| nagios_hostescalation_contactgroups | 0.00 |
| nagios_hostescalation_contacts | 0.00 |
| nagios_hostescalations | 0.00 |
| nagios_hostgroup_members | 0.72 |
| nagios_hostgroups | 0.02 |
| nagios_hosts | 0.48 |
| nagios_hoststatus | 1.21 |
| nagios_instances | 0.00 |
| nagios_logentries | 10719.98 |
| nagios_notifications | 3057.69 |
| nagios_objects | 19.33 |
| nagios_processevents | 1.07 |
| nagios_programstatus | 0.00 |
| nagios_runtimevariables | 0.00 |
| nagios_scheduleddowntime | 0.60 |
| nagios_service_contactgroups | 1.78 |
| nagios_service_contacts | 7.24 |
| nagios_service_parentservices | 0.00 |
| nagios_servicechecks | 0.00 |
| nagios_servicedependencies | 0.00 |
| nagios_serviceescalation_contactgroups | 0.00 |
| nagios_serviceescalation_contacts | 0.00 |
| nagios_serviceescalations | 0.00 |
| nagios_servicegroup_members | 0.08 |
| nagios_servicegroups | 0.00 |
| nagios_services | 5.35 |
| nagios_servicestatus | 16.48 |
| nagios_statehistory | 865.41 |
| nagios_systemcommands | 0.11 |
| nagios_timedeventqueue | 0.00 |
| nagios_timedevents | 0.00 |
| nagios_timeperiod_timeranges | 0.04 |
| nagios_timeperiods | 0.01 |
| profile | 0.00 |
| region | 0.02 |
| tc | 0.00 |
| tbl_command | 0.12 |
| tbl_contact | 0.08 |
| tbl_contactgroup | 0.01 |
| tbl_contacttemplate | 0.01 |
| tbl_domain | 0.01 |
| tbl_host | 0.43 |
| tbl_hostdependency | 0.00 |
| tbl_hostescalation | 0.00 |
| tbl_hostextinfo | 0.00 |
| tbl_hostgroup | 0.03 |
| tbl_hosttemplate | 0.01 |
| tbl_info | 0.13 |
| tbl_lnkcontactgrouptocontact | 0.01 |
| tbl_lnkcontactgrouptocontactgroup | 0.00 |
| tbl_lnkcontacttemplatetocommandhost | 0.00 |
| tbl_lnkcontacttemplatetocommandservice | 0.00 |
| tbl_lnkcontacttemplatetocontactgroup | 0.00 |
| tbl_lnkcontacttemplatetocontacttemplate | 0.00 |
| tbl_lnkcontacttemplatetovariabledefinition | 0.00 |
| tbl_lnkcontacttocommandhost | 0.00 |
| tbl_lnkcontacttocommandservice | 0.00 |
| tbl_lnkcontacttocontactgroup | 0.00 |
| tbl_lnkcontacttocontacttemplate | 0.01 |
| tbl_lnkcontacttovariabledefinition | 0.00 |
| tbl_lnkhostdependencytohost_dh | 0.00 |
| tbl_lnkhostdependencytohost_h | 0.00 |
| tbl_lnkhostdependencytohostgroup_dh | 0.00 |
| tbl_lnkhostdependencytohostgroup_h | 0.00 |
| tbl_lnkhostescalationtocontact | 0.00 |
| tbl_lnkhostescalationtocontactgroup | 0.00 |
| tbl_lnkhostescalationtohost | 0.00 |
| tbl_lnkhostescalationtohostgroup | 0.00 |
| tbl_lnkhostgrouptohost | 0.17 |
| tbl_lnkhostgrouptohostgroup | 0.01 |
| tbl_lnkhosttemplatetocontact | 0.00 |
| tbl_lnkhosttemplatetocontactgroup | 0.00 |
| tbl_lnkhosttemplatetohost | 0.00 |
| tbl_lnkhosttemplatetohostgroup | 0.00 |
| tbl_lnkhosttemplatetohosttemplate | 0.00 |
| tbl_lnkhosttemplatetovariabledefinition | 0.00 |
| tbl_lnkhosttocontact | 0.14 |
| tbl_lnkhosttocontactgroup | 0.11 |
| tbl_lnkhosttohost | 0.00 |
| tbl_lnkhosttohostgroup | 0.02 |
| tbl_lnkhosttohosttemplate | 0.06 |
| tbl_lnkhosttovariabledefinition | 0.02 |
| tbl_lnkservicedependencytohost_dh | 0.00 |
| tbl_lnkservicedependencytohost_h | 0.00 |
| tbl_lnkservicedependencytohostgroup_dh | 0.00 |
| tbl_lnkservicedependencytohostgroup_h | 0.00 |
| tbl_lnkservicedependencytoservice_ds | 0.00 |
| tbl_lnkservicedependencytoservice_s | 0.00 |
| tbl_lnkservicedependencytoservicegroup_ds | 0.02 |
| tbl_lnkservicedependencytoservicegroup_s | 0.02 |
| tbl_lnkserviceescalationtocontact | 0.00 |
| tbl_lnkserviceescalationtocontactgroup | 0.00 |
| tbl_lnkserviceescalationtohost | 0.00 |
| tbl_lnkserviceescalationtohostgroup | 0.00 |
| tbl_lnkserviceescalationtoservice | 0.00 |
| tbl_lnkserviceescalationtoservicegroup | 0.02 |
| tbl_lnkservicegrouptoservice | 0.08 |
| tbl_lnkservicegrouptoservicegroup | 0.00 |
| tbl_lnkservicetemplatetocontact | 0.00 |
| tbl_lnkservicetemplatetocontactgroup | 0.00 |
| tbl_lnkservicetemplatetohost | 0.00 |
| tbl_lnkservicetemplatetohostgroup | 0.00 |
| tbl_lnkservicetemplatetoservicegroup | 0.00 |
| tbl_lnkservicetemplatetoservicetemplate | 0.01 |
| tbl_lnkservicetemplatetovariabledefinition | 0.00 |
| tbl_lnkservicetocontact | 0.19 |
| tbl_lnkservicetocontactgroup | 0.19 |
| tbl_lnkservicetohost | 0.66 |
| tbl_lnkservicetohostgroup | 0.00 |
| tbl_lnkservicetoservicegroup | 0.00 |
| tbl_lnkservicetoservicetemplate | 0.20 |
| tbl_lnkservicetovariabledefinition | 0.12 |
| tbl_lnktimeperiodtotimeperiod | 0.00 |
| tbl_logbook | 0.00 |
| tbl_mainmenu | 0.00 |
| tbl_permission | 0.02 |
| tbl_permission_inactive | 0.02 |
| tbl_service | 1.14 |
| tbl_servicedependency | 0.00 |
| tbl_serviceescalation | 0.00 |
| tbl_serviceextinfo | 0.00 |
| tbl_servicegroup | 0.01 |
| tbl_servicetemplate | 0.02 |
| tbl_session | 0.00 |
| tbl_session_locks | 0.00 |
| tbl_settings | 0.00 |
| tbl_submenu | 0.00 |
| tbl_timedefinition | 0.04 |
| tbl_timeperiod | 0.02 |
| tbl_user | 0.01 |
| tbl_variabledefinition | 0.26 |
| xi_auditlog | 14.85 |
| xi_auth_tokens | 0.03 |
| xi_cmp_trapdata | 0.03 |
| xi_cmp_trapdata_log | 0.03 |
| xi_commands | 0.01 |
| xi_eventqueue | 0.02 |
| xi_events | 0.19 |
| xi_incidents | 0.00 |
| xi_meta | 16.53 |
| xi_mibs | 0.05 |
| xi_options | 0.03 |
| xi_sessions | 0.03 |
| xi_sysstat | 0.01 |
| xi_usermeta | 4.34 |
| xi_users | 0.03 |
+--------------------------------------------+------------+

Re: Nagios Core and XI out of sync

Posted: Thu Feb 18, 2021 5:20 pm
by tgriep
What error did you get when you tried to increase the open file limit before?

The number of open file still has to be increased in the server. It is currently set to 100000 but it still needs to be increased.

There is an MTRG process that runs every 5 minutes to gather the bandwidth data for nagios and it has 147305 files.
Depending on how it runs, that may be exceeding the open file limit.

Edit the file /etc/security/limits.conf and define / update the following settings:

Code: Select all

#locked memory 
* hard memlock 128
* soft memlock 128

#open files 
* soft nofile 1000000
* hard nofile 1000000

#max user processes
* hard nproc 100000 
* soft nproc 100000 

#stack size
* hard stack 20480
* soft stack 20480
Once you have made the changes save the file and restart the server to guarantee the changes are loaded and that all of the process are restarted..


In the following folder are the config files for the MRTG process.

Code: Select all

/etc/mrtg/conf.d/
They are typically named with the IP address that you are polling the Bandwitch data from.
If the device is no longer on your network, delete the .cfg file and that should help in dropping the open files on the server.

Open a root shell on the Nagios server and run the following command and post the output so we can see the number of max connections to the MYSQL database as that may need to be increased.

Code: Select all

mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';"

Re: Nagios Core and XI out of sync

Posted: Mon Feb 22, 2021 9:30 pm
by safuanmansor
Hi Tgriep,

1. The command output are as below

+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 599 |
+----------------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 50000 |
+-----------------+-------+

2. 3. The activity to reconfigure the limit is done. No error when trying to increase open file limit. we just want to understand how much actually the openfiles value need to be set. Cannot be just plug it from the sky right?.

3. The activity to reconfigure the limit is scheduled.

4.
tgriep wrote:There is an MTRG process that runs every 5 minutes to gather the bandwidth data for nagios and it has 147305 files.
Depending on how it runs, that may be exceeding the open file limit.
- We notice that when we adding the interface, the wizzard is adding the whole interface at the backend even though we just select a few out of hundred interface per switches (the frontend gui monitored correctly).Is this intended to work that way? I suspect this is also contribute the reason of open files is hitting the limit.

Thanks,
Safuan

Re: Nagios Core and XI out of sync

Posted: Wed Feb 24, 2021 10:32 am
by ssax
To increase performance I would reduce the size of these tables:

Code: Select all

| nagios_commenthistory | 2730.27 |
| nagios_logentries | 10719.98 |
| nagios_notifications | 3057.69 |
Then go to Admin > Performance Settings > Databases and set ALL THREE Optimize Intervals to 300 and click Update Settings.

FAQ: Can I truncate the tables first before proceeding with database repair (if I have crashed tables)?​

You can truncate before repairing the DB, it's up to you. If you want to back it up first, you'll need to repair it. If you don't care, or already have a backup, truncate it first as it will speed up the DB repair process.

NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the commands if your DB is housed/stored/offloaded/contained on a different server and/or you've changed the root mysql password​

If you don't care about the data, or already have a backup, you can just truncate the tables which will essentially drop and recreate the table with zero data in it (removing all historical data for the respective reports):

nagios_logentries - Impacts Event Log report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_logentries;'
nagios_statehistory - Impacts the State History report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_statehistory;'
nagios_notifications - Impacts the Notifications report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_notifications;'
nagios_commenthistory - Impacts the Comment History age

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_commenthistory;'

These should technically work to clean the DB tables up manually (if the tables aren't crashed, if they are crashed, you will need to repair the database FIRST in order to run these queries):

nagios_logentries - Impacts Event Log report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_statehistory - Impacts the State History report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_statehistory WHERE state_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_notifications - Impacts the Notifications report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_notifications WHERE start_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_commenthistory - Impacts the Comment history age

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_commenthistory WHERE entry_time <= (NOW() - INTERVAL 6 MONTH);'
Then you should go to Admin > Performance Settings > Databases tab and adjust ALL of the retention intervals to meet your business data policy standards to keep them cleaned up as these settings are for adjusting the retention on those DB tables.

I would lower them to the smallest possible level and utilize the XI backup/restore process and the Admin > Scheduled Backups process to offload the backups to another server. Since these XI backups contain database backups you can spin them up to grab the data and report on them if needed.

2. I think MRTG is going to be the biggest culprit for this but when you have MRTG running and Nagios running checks and all the other processes running they add up. Given you have 147,000 MRTG configs and depending on the total number of ports each has it can add up quick, I think 1000000 is a good number but a lofty goal for a single XI system.

4. That is the way it was designed. It adds all of the ports that aren't administratively down. The only way around that would be to comment out the ones you don't want to monitor in the /etc/mrtg/conf.d files.

Given the size of your system you're likely getting close to extreme mitigation tactics such as implementing mod_gearman (I don't have access to your profile so I'm unsure of whether you're currently running that or not) in order to offload the processing of the checks (and open files for those checks) to increase the performance of the XI server to handle the things it needs to. You can only do so much on a single system.