WARNING: RLIMIT_NPROC

junkertf · Post by **junkertf** » Mon Oct 07, 2019 2:28 am

Hello,

Had some strange issue last night (snippets from /var/log/messages), also lost performance datas, from services...

ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.

I had found the article as solution:
https://support.nagios.com/kb/article.php?id=139

So i tried to raise a bit on kernel.msgmni to 640000

but then i become the error like in the case (note that after i also fount these message from the time of the original issue...)
eventlog:
WARNING: RLIMIT_NPROC is 95692, total max estimated processes is 200806! You should increase your limits (ulimit -u, or limits.conf)

https://support.nagios.com/forum/viewto ... 16&t=54899

Tried to rise the nrpoc number for nagios unsuccessfull..
/etc/security/limits.d/20-nproc.conf
nagios soft nproc unlimited
nagios hard nproc unlimited

cat /proc/8130/limits
Limit Soft Limit Hard Limit Units
...
Max processes 95692 95692 processes

Deadpool setting are disabled,
We have a bit different settings and sizes....

echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi
+--------------------+----------------------------------------------+------------+
| Database | Table | Size in MB |
+--------------------+----------------------------------------------+------------+
| nagios | nagios_logentries | 10774.24 |
| nagios | nagios_commenthistory | 509.28 |
| nagios | nagios_statehistory | 480.31 |
| nagios | nagios_downtimehistory | 63.59 |
| nagios | nagios_notifications | 24.55 |
| nagios | nagios_servicestatus | 13.54 |
| nagiosxi | xi_meta | 9.80 |
| nagios | nagios_services | 5.58 |
| nagios | nagios_flappinghistory | 4.11 |
| nagios | nagios_objects | 3.44 |
| nagios | nagios_contactnotifications | 3.11 |
| nagios | nagios_contactnotificationmethods | 2.93 |
| nagios | nagios_externalcommands | 1.63 |
| nagiosxi | xi_cmp_trapdata_log | 1.53 |
| nagiosxi | xi_auditlog | 1.18 |

mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';"
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 320 |
+----------------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 512 |
+-----------------+-------+

Apply config output:
APPLYING NAGIOSCORE CONFIG...
CMDLINE=cd /usr/local/nagiosxi/scripts && ./reconfigure_nagios.sh
No entry for terminal type "unknown";
using dumb terminal settings.

--- reset_config_perms.sh ------------
> Setting CCM script permissions
> Setting script permissions
> Setting special component script permissions
> Setting configuration file/directory permissions
> Setting perfdata directory and RRD permissions
/bin/chmod: cannot access ‘/usr/local/nagios/share/perfdata/MYHUSLHQBPAP013/Open_Files.xml.5113’: No such file or directory
/bin/chmod: cannot access ‘/usr/local/nagios/share/perfdata/MYHUSLHQBPAP272/Memory_Usage.xml.5113’: No such file or directory
> Setting NOM checkpoint user:group permissions
> + Setting CCM configuration file user:group permissions
> + Setting Recurring Downtime file user:group permissions
> + Setting BPI configuration file user:group permissions
--------------------------------------

--- ccm_import.php -------------------
> Setting import directory: /usr/local/nagios/etc/import/
> Importing config files into the CCM
No files to import
--------------------------------------

--- ccm_export.php -------------------
> Writing CCM configuration to Nagios files
Finished writing out configuraton
--------------------------------------

--------------------------------------
> Verifying configuration with Nagios Core
> Output:
Nagios Core 4.4.2
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2018-08-16
License: GPL

Website: https://www.nagios.org
Reading configuration data...
Read main config file okay...
Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
Checked 24002 services.
Checked 1459 hosts.
Checked 99 host groups.
Checked 21 service groups.
Checked 65 contacts.
Checked 14 contact groups.
Checked 139 commands.
Checked 70 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 1459 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 70 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors: 0

Things look okay - No serious problems were detected during the pre-flight check
> Return Code: 0
--------------------------------------
OUTPUT=--------------------------------------
RETURNCODE=0
PROCESSING COMMAND ID 5396...
PROCESS COMMAND: CMD=1150, DATA=remove
CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Notice: Undefined variable: err in /app/nagiosxi/html/includes/components/nagiosbpi/api_tool.php on line 146
CMD: syncall
MSG: Could not get data for objects. NDO or Core may not be running.
OUTPUT=MSG: Could not get data for objects. NDO or Core may not be running.
RETURNCODE=0

mariadb.log

191001 14:25:09 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
191001 14:25:09 [Note] /usr/libexec/mysqld (mysqld 5.5.60-MariaDB) starting as process 21644 ...
191001 14:25:09 InnoDB: The InnoDB memory heap is disabled
191001 14:25:09 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191001 14:25:09 InnoDB: Compressed tables use zlib 1.2.7
191001 14:25:09 InnoDB: Using Linux native AIO
191001 14:25:09 InnoDB: Initializing buffer pool, size = 4.0G
191001 14:25:09 InnoDB: Completed initialization of buffer pool
191001 14:25:09 InnoDB: highest supported file format is Barracuda.
191001 14:25:09 InnoDB: Waiting for the background threads to start
191001 14:25:10 Percona XtraDB (http://www.percona.com) 5.5.59-MariaDB-38.11 started; log sequence number 7262361824
191001 14:25:10 [Note] Plugin 'FEEDBACK' is disabled.
191001 14:25:11 [Note] Server socket created on IP: '0.0.0.0'.
191001 14:25:11 [Note] Event Scheduler: Loaded 0 events
191001 14:25:11 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.60-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
191001 14:25:11 [Note] /usr/libexec/mysqld: Normal shutdown
191001 14:25:11 [Note] Event Scheduler: Purging the queue. 0 events
191001 14:25:11 InnoDB: Starting shutdown...
191001 14:25:15 InnoDB: Shutdown completed; log sequence number 7262361824
191001 14:25:15 [Note] /usr/libexec/mysqld: Shutdown complete

191001 14:25:15 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
191001 14:25:15 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
191001 14:25:15 [Note] /usr/libexec/mysqld (mysqld 5.5.60-MariaDB) starting as process 22646 ...
191001 14:25:15 InnoDB: The InnoDB memory heap is disabled
191001 14:25:15 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191001 14:25:15 InnoDB: Compressed tables use zlib 1.2.7
191001 14:25:15 InnoDB: Using Linux native AIO
191001 14:25:15 InnoDB: Initializing buffer pool, size = 4.0G
191001 14:25:16 InnoDB: Completed initialization of buffer pool
191001 14:25:16 InnoDB: highest supported file format is Barracuda.
191001 14:25:16 InnoDB: Waiting for the background threads to start
191001 14:25:17 Percona XtraDB (http://www.percona.com) 5.5.59-MariaDB-38.11 started; log sequence number 7262361824
191001 14:25:17 [Note] Plugin 'FEEDBACK' is disabled.
191001 14:25:17 [Note] Server socket created on IP: '0.0.0.0'.
191001 14:25:17 [Note] Event Scheduler: Loaded 0 events
191001 14:25:17 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.60-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server

Please help solve the issue, best regards,

Ferenc

benjaminsmith · Post by **benjaminsmith** » Mon Oct 07, 2019 10:58 am

Hello @junkertf,

You have 24,000 services and 1459 hosts are this server, so you are getting to the point where it makes sense to split the load across multiple servers. However, you can tune XI to help handle the higher loads.

MSG: Could not get data for objects. NDO or Core may not be running.
OUTPUT=MSG: Could not get data for objects. NDO or Core may not be running.

Looks like ndoutils is not running. Run the following to re-start the whole stack and clear the message queues.

Code: Select all

systemctl stop crond
systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
systemctl restart mariadb
systemctl start ndo2db
systemctl start nagios
systemctl start npcd
systemctl start crond

Thanks for posting the size the tables, the large log_entries table is most likely impacting performance I would recommend truncating it. This will impact report length. For example, the following query will remove anything older than six months.

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'

You'll find additional directions on truncating tables in the following guide.
Nagios XI Repairing The Nagios XI Databases

Take a look at the following guide for increasing performance in Nagios XI. I would recommend following the suggestions for increasing the check_intervals when possible and adjusting the Performance Settings in from Admin menu. Also, the implementing a RAM disk is going to be more beneficial than offloading the DB.

Maximizing Performance In Nagios XI

If you're still experiencing issues, please send us your system profile and the mysql config file.

Keep in mind that with every purchase we offer 3 separate activation of XI licenses. One for production, one for testing, and one for high availability. We always recommend upgrading on a test server first before making changes to the production server.

Nagios License Entitlements
https://support.nagios.com/kb/article.php?id=145

junkertf · Post by **junkertf** » Tue Oct 08, 2019 5:31 am

Hello,

So a one by one answer... (Thank you for sure!)

6 month logentries mysql script not deleted any row from db...

before

Code: Select all

| nagios             | nagios_logentries                            |   10643.98 |

run

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'                                                echo $?
0

after

Code: Select all

| nagios             | nagios_logentries                            |   10643.98 |

So i checked the whole line count in logentries are cca 50Mill, from where cca 15M bethween 1-2 month and cca 13M older than 2 month lines.
Can/should these lines imported somehow into the nagios or can be truncated whitout any consequences?

Code: Select all

MariaDB [nagios]> select count(*) from nagios_logentries where logentry_time <= (NOW() - INTERVAL 1 MONTH);
+----------+
| count(*) |
+----------+
| 27116169 |
+----------+
MariaDB [nagios]> select count(*) from nagios_logentries where logentry_time <= (NOW() - INTERVAL 2 MONTH);
+----------+
| count(*) |
+----------+
| 12786814 |
+----------+

Regarding the RAM-Disk (as described)
tmpfs 6.0G 61M 6.0G 1% /var/nagiosramdisk
ls -l /usr/lib/systemd/system/ramdisk.service
-rw-r--r-- 1 root root 743 Nov 16 2018 /usr/lib/systemd/system/ramdisk.service
tree -pug /var/nagiosramdisk
/var/nagiosramdisk
├── [-rw-r--r-- nagios nagios ] host-perfdata
├── [-rw-r--r-- nagios nagios ] objects.cache
├── [-rw-r--r-- nagios nagios ] service-perfdata
├── [drwxrwxr-x nagios nagios ] spool
│   ├── [drwxrwxr-x nagios nagios ] checkresults
│   ├── [drwxrwxr-x nagios nagios ] perfdata
│   │   ├── [-rw-r--r-- nagios nagios ] 1570526062.perfdata.service
│   │   └── [-rw-r--r-- nagios nagios ] 1570526063.perfdata.host
│   └── [drwxrwxr-x nagios nagios ] xidpe
├── [-rw-r--r-- nagios nagios ] status.dat
└── [drwxrwxr-x nagios nagios ] tmp

grep -i nagiosramdisk /usr/local/nagios/etc/nagios.cfg
service_perfdata_file=/var/nagiosramdisk/service-perfdata
host_perfdata_file=/var/nagiosramdisk/host-perfdata
check_result_path=/var/nagiosramdisk/spool/checkresults
object_cache_file=/var/nagiosramdisk/objects.cache
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp

grep -i nagiosramdisk /usr/local/nrdp/server/config.inc.php /usr/local/nagiosxi/html/config.inc.php /usr/local/nagios/etc/pnp/npcd.cfg /usr/local/nagiosmobile/include.inc.php
/usr/local/nrdp/server/config.inc.php:$cfg["check_results_dir"]="/var/nagiosramdisk/spool/checkresults";
/usr/local/nagiosxi/html/config.inc.php:$cfg['xidpe_dir'] = '/var/nagiosramdisk/spool/xidpe/';
/usr/local/nagiosxi/html/config.inc.php:$cfg['perfdata_spool'] = '/var/nagiosramdisk/spool/perfdata/';
/usr/local/nagios/etc/pnp/npcd.cfg:perfdata_spool_dir = /var/nagiosramdisk/spool/perfdata/
/usr/local/nagiosmobile/include.inc.php:$STATUS_FILE = "/var/nagiosramdisk/status.dat";
/usr/local/nagiosmobile/include.inc.php:$OBJECTS_FILE = "/var/nagiosramdisk/objects.cache";

command's also seems ok
process-service-perfdata-file-bulk /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.service
process-host-perfdata-file-bulk /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.host

Apply configuration output is same (?)
PROCESSING COMMAND ID 5412...
PROCESS COMMAND: CMD=17, DATA=
APPLYING NAGIOSCORE CONFIG...
CMDLINE=cd /usr/local/nagiosxi/scripts && ./reconfigure_nagios.sh
No entry for terminal type "unknown";
using dumb terminal settings.

--- reset_config_perms.sh ------------
> Setting CCM script permissions
> Setting script permissions
> Setting special component script permissions
> Setting configuration file/directory permissions
> Setting perfdata directory and RRD permissions
> Setting NOM checkpoint user:group permissions
> + Setting CCM configuration file user:group permissions
> + Setting Recurring Downtime file user:group permissions
> + Setting BPI configuration file user:group permissions
--------------------------------------

--- ccm_import.php -------------------
> Setting import directory: /usr/local/nagios/etc/import/
> Importing config files into the CCM
No files to import
--------------------------------------

--- ccm_export.php -------------------
> Writing CCM configuration to Nagios files
Finished writing out configuraton
--------------------------------------

--------------------------------------
> Verifying configuration with Nagios Core
> Output:
Nagios Core 4.4.2
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2018-08-16
License: GPL

Website: https://www.nagios.org
Reading configuration data...
Read main config file okay...
Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
Checked 24002 services.
Checked 1459 hosts.
Checked 99 host groups.
Checked 21 service groups.
Checked 65 contacts.
Checked 14 contact groups.
Checked 139 commands.
Checked 70 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 1459 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 70 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors: 0

Things look okay - No serious problems were detected during the pre-flight check
> Return Code: 0
--------------------------------------
OUTPUT=--------------------------------------
RETURNCODE=0
PROCESSING COMMAND ID 5413...
PROCESS COMMAND: CMD=1150, DATA=remove
CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
...........PHP Notice: Undefined variable: err in /app/nagiosxi/html/includes/components/nagiosbpi/api_tool.php on line 146
CMD: syncall
MSG: Could not get data for objects. NDO or Core may not be running.
OUTPUT=MSG: Could not get data for objects. NDO or Core may not be running.
RETURNCODE=0

err in /app/nagiosxi/html/includes/components/nagiosbpi/api_tool.php on line 146:

if (!$is_running) {
echo "CMD: $cmd\n";
echo "MSG: Could not get data for objects. NDO or Core may not be running.\n";
return $err;
}

Actually RLIMIT messages not coming from last day morning... and Core Component status / Monitoring Engine dashlets showing that everything is working well.

Regarding the architecture, licences and possibilities.
Now we have a Test and a Production environment. (One-One node)
The system is running still on VMWare, the project for the next 1 month is to moving to physical HW with more CPU, RAM (RAM-disk), and NVME disk drives, for better performance, we also want to use DRBD solution for automatic failover in an Active-backup XI cluster.

Question that can i use the third licence (HA / Backup) until i have install our new physical HW or not?
Indeed after these moving our currently used XI instance will be destroyed, so the original licence directives would not violated...

The DRBD-NagiosXI documentation (DRBD_8_HA_Nagios_XI_v5_Cluster_on_RHEL7.pdf) give a very good starting point for how to install and configure, but i feel so way, that a working XI environment need a bit other direction... As the documentation describes i must have an installed XI application on the new nodes, but not describe the method if we migrate our environment to that (drbd/xi) architecture. i must install the XI and any of its dependent packages, scripts, addons to the new node(s), test the clustering (as described in the documentation: plug, stonith, failover, etc...) and at last backup / restore (copy?) our xi data's from our original node to that new drbd-ed phisycal architecture. Can you give me a more detailed milestone list /howto how these migration would pass without huge mistakes?

Thank You, best regards,

Ferenc

benjaminsmith · Post by **benjaminsmith** » Tue Oct 08, 2019 11:20 am

Hello Ferenc,

Actually RLIMIT messages not coming from last day morning... and Core Component status / Monitoring Engine dashlets showing that everything is working well.

That's good to hear

Question that can i use the third licence (HA / Backup) until i have install our new physical HW or not?

Yes. You are given 3 instances, but only one license can be used for active monitoring.

The following command should completely truncate the log entries table.

Code: Select all

mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_logentries'

As far as the HA configuration, we recommend using the solution provided by LinBit as this a complex setup and not something we provide in technical support.

https://downloads.linbit.com/ha-nagios- ... -on-rhel7/

junkertf · Post by **junkertf** » Wed Oct 09, 2019 5:58 am

Hello,

I had truncated the logentries well.

The bad news, last day restart did something wrong. No performance data coming in from the last 24 hours...
restart of the all services not solved the case... Every Core Component status are in green also Monitoring Engine process and queue also green.

Also tried a full OS restart, some success, the perfgraphs coming back (just 24 hour data has gone...)

Regarding the HA setup, you'we mentioned. So i can use our third licence for testing the Linbit/HA setup without using that instanco for monitoring and at the last step i exort our datas out from our currently system, shuting it down, import the data's to the new HA environment and starting it...
You mean, that I contact, ask Linbit regarding the steps of the migration? i mean also that these is a more complex like in the pdf you've also mentioned, because the pdf not write about when and how the orginal data to the server arriving...
But anyhow, thanks, i contact linbit in these case...

Thank you, best regards,

Ferenc

benjaminsmith · Post by **benjaminsmith** » Wed Oct 09, 2019 10:43 am

Hi Ferenc,

Below is our guide on troublehshooting performance graphs, you'll want make sure that npcd is running and the files are not spooling up for some reason.

Code: Select all

systemctl status npcd.service

If you haven't already bumped up your load_threshold and timeout, I would recommend doing that. Let me know if you need any assistance.

Nagios XI - Performance Graph Problem

Regarding HA setup, Nagios XI does not have this as a feature, so if you're looking for a solution with support services, LinBit would be an option. Please see our general guide on HA options below.

Nagios XI How To Achieve High Availability

Nagios Support Forum

WARNING: RLIMIT_NPROC

WARNING: RLIMIT_NPROC

Re: WARNING: RLIMIT_NPROC

Re: WARNING: RLIMIT_NPROC

Re: WARNING: RLIMIT_NPROC

Re: WARNING: RLIMIT_NPROC

Re: WARNING: RLIMIT_NPROC