Page 1 of 2
RLIMIT_NPROC issue
Posted: Tue Sep 07, 2021 2:16 pm
by hbouma
We are seeing the following error in our logs:
WARNING: RLIMIT_NPROC is 63444, total max estimated processes is 70538! You should increase your limits (ulimit -u, or limits.conf)
This was discovered while investigating why the Nagios instance was not applying commands send to the Nagios Command File.
Nagios XI 5.8.3 running on RHEL 7.9 64bit VM's.
Re: RLIMIT_NPROC issue
Posted: Wed Sep 08, 2021 9:30 am
by pbroste
Hello @hbouma
Thanks for reaching out, and want to start off by looking at current environment values related to limits. Ultimately we want to address why commands are not applied.
First, let's run through this: (note; commands are RHEL/Centos may differ for Debian)
Code: Select all
systemctl stop crond
systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
systemctl restart mariadb
systemctl start ndo2db
systemctl start nagios
systemctl start npcd
systemctl start crond
Then let's retrieve info on the database and also send the System Profile to us:
Run this and provide the results:
Code: Select all
echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
Please PM your updated system profile for us to review.
To send us your system profile.
- Login to the Nagios XI GUI using a web browser.
- Click the "Admin" > "System Profile" Menu
- Click the "Download Profile" button
- Save the profile.zip file and send via Private Message
Please review options to increase the kernel message queue settings and the max connections for the database, then restart and let me know what kind of improvement you see.
1. To increase the kernel message queue settings, follow the steps in the kb article below:
2. To increase the max db connections, follow this guide:
Thanks,
Perry
Re: RLIMIT_NPROC issue
Posted: Wed Sep 08, 2021 9:50 am
by hbouma
Profile and the output of the DB command are sent through a private message. A reboot of the server does allow it to start responding to the command file again.
We do have an offloaded database on this server.
my.cnf settings from the DB server:
Code: Select all
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
bind-address=XXX.XXX.XXX.XXX
port=XXXX
query_cache_size=6M
query_cache_limit=4M
tmp_table_size=64M
max_heap_table_size=64M
key_buffer_size=32M
table_open_cache=32
thread_cache_size = 16
#tmpdir=/var/lib/mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
max_connections=818
[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid
#
# include all files from the config directory
#
!includedir /etc/my.cnf.d
/etc/sysctl.conf file:
Code: Select all
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv6.conf.default.accept_redirects=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.default.secure_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_redirects=0
net.ipv6.conf.all.accept_redirects=0
net.ipv4.tcp_timestamps=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.icmp_ignore_bogus_error_responses=1
net.ipv4.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.default.send_redirects=0
net.ipv4.conf.default.log_martians=1
net.ipv4.ip_forward=0
net.ipv4.tcp_syncookies=1
net.ipv4.conf.all.log_martians=1
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
kernel.randomize_va_space=2
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.default.accept_source_route=0
net.ipv6.conf.all.accept_ra=0
net.ipv6.conf.default.accept_ra=0
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
#kernel.msgmni = 512000
Re: RLIMIT_NPROC issue
Posted: Wed Sep 08, 2021 1:01 pm
by pbroste
Hello @hbouma
Thanks for send over the System Profile quickly.
Before we tackle the 'RLIMIT_NPROC' issue, we want to correct the two duplicate services issues that we see.
1.
Warning: Duplicate definition found for service 'UAT10_MNClaims_Online_APP' on host 'cwladminuat10' (config file '/usr/local/nagios/etc/services/cwladminuat10.cfg'
Want to have you run the following and take a look through and find the duplicate:
Code: Select all
grep -Eir 'UAT10_MNClaims_Online_APP' -A 15 -B 5 --color=always /usr/local/nagios/etc/ | less -SR
Please make the updates in the web console > Core Configuration Manager > [Services] remove dupe and the
ApplyConfig
2.
Warning: Duplicate definition found for service 'EP Process Check' on host 'cepbepre507' (config file '/usr/local/nagios/etc/services/cepbepre507.cfg
Code: Select all
grep -Eir 'EP Process Check' -A 15 -B 5 --color=always /usr/local/nagios/etc/ | less -SR
Please make the updates in the web console > Core Configuration Manager > [Services] remove dupe and the
ApplyConfig
Bounce the nagios service:
Verify CCM config passes:
Code: Select all
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Check to see if you are receiving any 'RLIMIT_NPROC' messages, if so please follow up with an updated System Profile.
Thanks,
Perry
Re: RLIMIT_NPROC issue
Posted: Wed Sep 08, 2021 1:21 pm
by hbouma
Changes made and the check gives no warnings or errors.
Code: Select all
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.4.6
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2020-04-28
License: GPL
Website: https://www.nagios.org
Reading configuration data...
Read main config file okay...
Read object config files okay...
Running pre-flight check on configuration data...
Checking objects...
Checked 7364 services.
Checked 731 hosts.
Checked 177 host groups.
Checked 93 service groups.
Checked 520 contacts.
Checked 103 contact groups.
Checked 175 commands.
Checked 530 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 731 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 530 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
We are still seeing errors for the RLIMIT_NPROC
[1631125154] WARNING: RLIMIT_NPROC is 63444, total max estimated processes is 70554! You should increase your limits (ulimit -u, or limits.conf)
Re: RLIMIT_NPROC issue
Posted: Wed Sep 08, 2021 5:04 pm
by pbroste
Hello @hbouma
Please check to see if Deadpool Settings are enabled? You can find that in the web console > Admin > Monitoring Section == Deadpool Settings.
Thanks,
Perry
Re: RLIMIT_NPROC issue
Posted: Thu Sep 09, 2021 6:48 am
by hbouma
Deadpool settings are not enabled.
Re: RLIMIT_NPROC issue
Posted: Thu Sep 09, 2021 3:30 pm
by pbroste
Hello @hbouma
Thanks for following up; the perplexing part is that I see some unconventional theories on what is going on and how to resolve it. During the research, I see threads that state that this message is basically noise and to disregard, and others have methods to increase ulimit in systemd config.
That leaves us with the next approach to add the following configs to your DB server 'my.cnf' and then reload the database service and nagios.service as well.
Code: Select all
max_connections=1000
open_files_limit = 4096
Wait for a while and grab the Nagios event logs so we can see what is going on there:
Code: Select all
tar -czvf /tmp/events.tar.gz /usr/local/nagiosxi/var/*.log
Would you please send along the events.tar.gz to me via private message.
Thanks,
Perry
Re: RLIMIT_NPROC issue
Posted: Fri Sep 10, 2021 6:59 am
by hbouma
Events.tar.gz will be sent in a PM.
I made the changes and cycled both MariaDB and Nagios, I had multiple errors show up in the database. I ran the database repair again for probably the 4th time in the past week, fully cycled Nagios XI and then rebooted the Nagios XI server before things cleared up.
Looking at the MariaDB logs, I find this started after running the change and restarting MariaDB on the offloaded server:
210910 7:31:15 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:31:48 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:56 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:56 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:56 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
......
210910 7:40:21 [Note] Found 5791662 of 3913273 rows when repairing './nagios/nagios_logentries'
Re: RLIMIT_NPROC issue
Posted: Fri Sep 10, 2021 11:21 am
by pbroste
Hello @hbouma
Thanks for following up, we see in the eventman logs messages that indicate "MySQL server has gone away" appears to be disconnecting. We see instances: 7:20, 7:30, 7:44, and 7:54.
Let's see if increasing the numbers and adding the following in my.cnf will resolve the issue.
Code: Select all
max_connections=10000
max_allowed_packet=64M
Bounce the database service on the DB server and then the nagios.service on the Nagios XI server.
Thanks,
Perry