RLIMIT_NPROC issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

RLIMIT_NPROC issue

Post by hbouma »

We are seeing the following error in our logs:

WARNING: RLIMIT_NPROC is 63444, total max estimated processes is 70538! You should increase your limits (ulimit -u, or limits.conf)

This was discovered while investigating why the Nagios instance was not applying commands send to the Nagios Command File.

Nagios XI 5.8.3 running on RHEL 7.9 64bit VM's.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: RLIMIT_NPROC issue

Post by pbroste »

Hello @hbouma

Thanks for reaching out, and want to start off by looking at current environment values related to limits. Ultimately we want to address why commands are not applied.

First, let's run through this: (note; commands are RHEL/Centos may differ for Debian)

Code: Select all

systemctl stop crond
systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
systemctl restart mariadb
systemctl start ndo2db
systemctl start nagios
systemctl start npcd
systemctl start crond
Then let's retrieve info on the database and also send the System Profile to us:

Run this and provide the results:

Code: Select all

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table
Please PM your updated system profile for us to review.

To send us your system profile.
  • Login to the Nagios XI GUI using a web browser.
  • Click the "Admin" > "System Profile" Menu
  • Click the "Download Profile" button
  • Save the profile.zip file and send via Private Message
Please review options to increase the kernel message queue settings and the max connections for the database, then restart and let me know what kind of improvement you see.

1. To increase the kernel message queue settings, follow the steps in the kb article below: 2. To increase the max db connections, follow this guide:
Thanks,
Perry
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: RLIMIT_NPROC issue

Post by hbouma »

Profile and the output of the DB command are sent through a private message. A reboot of the server does allow it to start responding to the command file again.

We do have an offloaded database on this server.

my.cnf settings from the DB server:

Code: Select all

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
bind-address=XXX.XXX.XXX.XXX
port=XXXX
query_cache_size=6M
query_cache_limit=4M
tmp_table_size=64M
max_heap_table_size=64M
key_buffer_size=32M
table_open_cache=32
thread_cache_size = 16
#tmpdir=/var/lib/mysql

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
max_connections=818

[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid

#
# include all files from the config directory
#
!includedir /etc/my.cnf.d
/etc/sysctl.conf file:

Code: Select all

# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv6.conf.default.accept_redirects=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.default.secure_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_redirects=0
net.ipv6.conf.all.accept_redirects=0
net.ipv4.tcp_timestamps=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.icmp_ignore_bogus_error_responses=1
net.ipv4.conf.all.accept_source_route=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.default.send_redirects=0
net.ipv4.conf.default.log_martians=1
net.ipv4.ip_forward=0
net.ipv4.tcp_syncookies=1
net.ipv4.conf.all.log_martians=1
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
kernel.randomize_va_space=2
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.default.accept_source_route=0
net.ipv6.conf.all.accept_ra=0
net.ipv6.conf.default.accept_ra=0
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
#kernel.msgmni = 512000
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: RLIMIT_NPROC issue

Post by pbroste »

Hello @hbouma

Thanks for send over the System Profile quickly.

Before we tackle the 'RLIMIT_NPROC' issue, we want to correct the two duplicate services issues that we see.

1.
Warning: Duplicate definition found for service 'UAT10_MNClaims_Online_APP' on host 'cwladminuat10' (config file '/usr/local/nagios/etc/services/cwladminuat10.cfg'
Want to have you run the following and take a look through and find the duplicate:

Code: Select all

grep -Eir 'UAT10_MNClaims_Online_APP' -A 15 -B 5 --color=always /usr/local/nagios/etc/ | less -SR
Please make the updates in the web console > Core Configuration Manager > [Services] remove dupe and the ApplyConfig

2.
Warning: Duplicate definition found for service 'EP Process Check' on host 'cepbepre507' (config file '/usr/local/nagios/etc/services/cepbepre507.cfg

Code: Select all

grep -Eir 'EP Process Check' -A 15 -B 5 --color=always /usr/local/nagios/etc/ | less -SR
Please make the updates in the web console > Core Configuration Manager > [Services] remove dupe and the ApplyConfig

Bounce the nagios service:

Code: Select all

systemctl restart nagios
Verify CCM config passes:

Code: Select all

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Check to see if you are receiving any 'RLIMIT_NPROC' messages, if so please follow up with an updated System Profile.

Thanks,
Perry
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: RLIMIT_NPROC issue

Post by hbouma »

Changes made and the check gives no warnings or errors.

Code: Select all

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 4.4.6
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2020-04-28
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
        Checked 7364 services.
        Checked 731 hosts.
        Checked 177 host groups.
        Checked 93 service groups.
        Checked 520 contacts.
        Checked 103 contact groups.
        Checked 175 commands.
        Checked 530 time periods.
        Checked 0 host escalations.
        Checked 0 service escalations.
Checking for circular paths...
        Checked 731 hosts
        Checked 0 service dependencies
        Checked 0 host dependencies
        Checked 530 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
We are still seeing errors for the RLIMIT_NPROC
[1631125154] WARNING: RLIMIT_NPROC is 63444, total max estimated processes is 70554! You should increase your limits (ulimit -u, or limits.conf)
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: RLIMIT_NPROC issue

Post by pbroste »

Hello @hbouma

Please check to see if Deadpool Settings are enabled? You can find that in the web console > Admin > Monitoring Section == Deadpool Settings.

Thanks,
Perry
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: RLIMIT_NPROC issue

Post by hbouma »

Deadpool settings are not enabled.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: RLIMIT_NPROC issue

Post by pbroste »

Hello @hbouma

Thanks for following up; the perplexing part is that I see some unconventional theories on what is going on and how to resolve it. During the research, I see threads that state that this message is basically noise and to disregard, and others have methods to increase ulimit in systemd config.

That leaves us with the next approach to add the following configs to your DB server 'my.cnf' and then reload the database service and nagios.service as well.

Code: Select all

max_connections=1000
open_files_limit = 4096
Wait for a while and grab the Nagios event logs so we can see what is going on there:

Code: Select all

tar -czvf /tmp/events.tar.gz /usr/local/nagiosxi/var/*.log
Would you please send along the events.tar.gz to me via private message.

Thanks,
Perry
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: RLIMIT_NPROC issue

Post by hbouma »

Events.tar.gz will be sent in a PM.

I made the changes and cycled both MariaDB and Nagios, I had multiple errors show up in the database. I ran the database repair again for probably the 4th time in the past week, fully cycled Nagios XI and then rebooted the Nagios XI server before things cleared up.

Looking at the MariaDB logs, I find this started after running the change and restarting MariaDB on the offloaded server:
210910 7:31:15 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:31:48 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:56 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:56 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:56 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
210910 7:35:57 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed
......
210910 7:40:21 [Note] Found 5791662 of 3913273 rows when repairing './nagios/nagios_logentries'
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: RLIMIT_NPROC issue

Post by pbroste »

Hello @hbouma

Thanks for following up, we see in the eventman logs messages that indicate "MySQL server has gone away" appears to be disconnecting. We see instances: 7:20, 7:30, 7:44, and 7:54.

Let's see if increasing the numbers and adding the following in my.cnf will resolve the issue.

Code: Select all

max_connections=10000
max_allowed_packet=64M
Bounce the database service on the DB server and then the nagios.service on the Nagios XI server.

Thanks,
Perry
Locked