Page 2 of 3

Re: Performance issue

Posted: Tue Oct 24, 2017 6:17 am
by lvaillant
Given information from https://assets.nagios.com/downloads/nag ... ptions.pdf

Code: Select all

[10-24-2017 11:56:12] NPCD: ERROR: Executed command exits with return code '7'
[10-24-2017 11:56:12] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1508838959.perfdata.service'
[10-24-2017 11:57:24] NPCD: WARN: MAX load reached: load 10.640000/10.000000 at i=0
[10-24-2017 11:57:39] NPCD: WARN: MAX load reached: load 10.260000/10.000000 at i=1
[10-24-2017 12:00:11] NPCD: ERROR: Executed command exits with return code '7'
[10-24-2017 12:00:11] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1508839199.perfdata.service'
[10-24-2017 12:00:31] NPCD: ERROR: Executed command exits with return code '7'
[10-24-2017 12:00:31] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1508839215.perfdata.service'
There are lots of errors lines
From the numbering of the existing files in /var/nagiosramdisk/spool/perfdata/, their timestamps and the npcd errors, it's like some other process already managed the perfdata.

Regarding the 'Max load reached', should I really change the default value in configuration file ?
I do not undestrand this detected load as this server seems to run with a CPU average < 50% (8 CPUs).

I also have thoses messages in perfdata.log

Code: Select all

2017-10-24 12:09:18 [14134] [0] *** TIMEOUT: Timeout after 5 secs. ***
2017-10-24 12:09:18 [14134] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2017-10-24 12:09:18 [14134] [0] *** TIMEOUT: Please check your npcd.cfg
2017-10-24 12:09:18 [14134] [0] *** TIMEOUT: /var/nagiosramdisk/spool/perfdata//1508839749.perfdata.service-PID-14134 deleted
2017-10-24 12:09:18 [14134] [0] *** Timeout while processing Host: "<hostname>" Service: "System_-_Network_bandwidth"
2017-10-24 12:09:18 [14134] [0] *** process_perfdata.pl terminated on signal ALRM

Code: Select all

# grep -i time /usr/local/nagios/etc/pnp/process_perfdata.cfg
TIMEOUT = 5
[root@hq-nagios-xi01 var]# grep -ri time /usr/local/nagios/etc/pnp/npcd.cfg
# sleep_time - how many seconds should npcd wait between dirscans
# sleep_time = 15 (default)
sleep_time = 15
Should I change those values as explained in this old thread?
May it have an impact on performance ?

Re: Performance issue

Posted: Tue Oct 24, 2017 8:47 am
by lvaillant
Regarding Optimize task and locks...
Here are some metrics:

Code: Select all

MariaDB [(none)]> SELECT count(*) from nagios.nagios_logentries;
+----------+
| count(*) |
+----------+
| 10168443 |
+----------+
1 row in set (0.00 sec)

Code: Select all

MariaDB [(none)]> SELECT table_name AS table_name, round(((data_length + index_length) / 1024 / 1024), 2) Size_in_MB FROM information_schema.TABLES WHERE table_schema = 'nagios' AND table_name = 'nagios_logentries';
+-------------------+------------+
| table_name        | Size_in_MB |
+-------------------+------------+
| nagios_logentries |    2465.89 |
+-------------------+------------+
1 row in set (0.00 sec)

Code: Select all

MariaDB [nagios]> select logentry_time from nagios_logentries order by logentry_time ASC limit 1;
+---------------------+
| logentry_time       |
+---------------------+
| 2017-07-26 15:10:01 |
+---------------------+
1 row in set (0.00 sec)

Code: Select all

MariaDB [nagios]> SELECT count(*) from nagios.nagios_notifications;
+----------+
| count(*) |
+----------+
|  6077202 |
+----------+
1 row in set (0.00 sec)

Code: Select all

MariaDB [nagios]> select start_time from nagios_notifications ORDER BY notification_id  ASC limit 1;
+---------------------+
| start_time          |
+---------------------+
| 2017-07-26 15:10:01 |
+---------------------+
1 row in set (0.01 sec)
Those data already mentionned in a previous post, I suppose I can gain precious seconds by reducing the size of thoses tables.
I'm considering to reduce retention from 90d to 45d, for example.
What is the impact for nagios data if reducing log entries history ? It impacts the 'Audit Log', isn't it ?

Re: Performance issue

Posted: Tue Oct 24, 2017 2:01 pm
by tgriep
If you change the settings in the following files and increase the timeout amd load values, that will keep the performance data from getting dropped so the Performance Graphs will not have any gaps in it.

Code: Select all

/usr/local/nagios/etc/pnp/process_perfdata.cfg
/usr/local/nagios/etc/pnp/npcd.cfg
Here are the instructions for increasing those values go the graphs will not have gaps in the future.
https://support.nagios.com/kb/article/n ... blems.html


The Audit Log entries are stored in another table so if you do truncate the logentries table, those entries will not be lost.

The logentries table is used for displaying the data in the "Event Log" menu and the Event Log" report.
The Notification table is used for displaying the data in the "Notifications" menu and the "Notifications" report.
If you do not need to keep that information for that length of time, you can decrease the age settings and that will let the DB Maint process run quicker.

Re: Performance issue

Posted: Wed Oct 25, 2017 4:08 am
by lvaillant
This night (CEST time), Nagios went completly out of order...

Message queue was full.

Code: Select all

Oct 25 00:32:12 hq-nagios-xi01 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 384000 of 512000 messages and 393216000 of 393216000 bytes in the queue. See README for kernel tuning options.
All nagios/ndo2db were up, but I had lot of such messages:

Code: Select all

Oct 25 00:40:54 hq-nagios-xi01 nagios: Warning: The check of host 'XXXX' looks like it was orphaned (results never came back).  I'm scheduling an immediate che
ck of the host...
I spend 1 hour to restart it.
I raised message queue limits:

Code: Select all

kernel.msgmnb = 786432000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000
But the message queue fills up quickly...
localhost-message_queue (1).png
And Nagios is stuck again...

In fact, the 'OPTIMIZE TABLE nagios_[logentries|notifications|...]' tasks runs again and again...
As I reduced the max age for this table, it runs around respectively [150|90]s.
I also increased the Optimize interval...

I started the repair_databases.sh script, restarted the nagios/ndo2db/httpd services...

Code: Select all

=======================
nagios offloaded database repair succeeded
nagiosql offloaded database repair succeeded
nagiosxi offloaded database repair succeeded
But Nagios XI see DB state different :
screenshot-monitoring.zodiac.lan-2017-10-25-10-53-16-065.png
During the repair, I saw a concurrent Optimize on another table:

Code: Select all

30539305 - 56 - REPAIR TABLE `nagios_statehistory`
30539363 - 57 - OPTIMIZE TABLE nagios_logentries
A second message queue was created:

Code: Select all

Every 2.0s: ipcs -q                          Wed Oct 25 10:40:46 2017


------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xba000002 0          nagios     600        786432000    768000
0x9f000002 32769      nagios     600        175332352    171223
'Optimize table nagios_xxxx' starts again... and again... and again...

How to stop this behavior ?!

Nagios is now unuseable as queue is permanently full, data do not refresh and server finally goes down.

Re: Performance issue

Posted: Wed Oct 25, 2017 5:10 am
by lvaillant
Tailling in real time the dbmain.log file...

Seems that DB issue is now resolved as the status is now green in Nagios XI dashboard.
Some of the following messages occured before it was restored:

Code: Select all

    <p><pre>SQL Error [nagiosxi] : MySQL server has gone away</pre></p>
Message queue is (very slowly) decreasing.
localhost-message_queue (2).png
Optimize intervals are now the following:
  • Nagios XI Database: 180
  • NDOUtils Database: 90
  • NagiosQL Database: 180
I'm waiting the next 'optimize' task on NDOUtils Database to see the real impact.

Code: Select all

# mysqladmin status
Uptime: 15129774  Threads: 95  Questions: 23546418316  Slow queries: 1808  Opens: 739472  Flush tables: 10  Open tables: 251  Queries per second avg: 1556.296
Is there a way to accelerate ndo2db throughput ?
What drives the number of requests ndo2db sent to DB ?

I see no network latency when using tcpdump/wireshark. I'm sure MySQL/MariaDB can handle more queries/ per second.

Re: Performance issue

Posted: Wed Oct 25, 2017 3:07 pm
by tgriep
Other than the setting in the /etc/sysctl.conf file, there is isn't a way to increase the performance for ndo2db.

All of the checks, statuses, performance data, etc... gets stored in to a MYSQL table and it uses ndo2db to transfer it.

The speed or network latency may be good but the extra steps / time it takes to send the info to the remote MYSQL server could be the issue.
We have seen the kernel message issue on other customer's servers with a large amount of checks and the fix was to move back the MYSQL database to the XI server.

Re: Performance issue

Posted: Mon Oct 30, 2017 8:00 am
by lvaillant
Since I changed the optimize interval and the retention values, behavior is far much better.

Time staked perf:
Message Queue - Time stacked perf.png
Last 7 days:
Message Queue - 7d.png
Last 24 hours:
Message Queue - 24h.png
I can clearly identify the interval of the peaks : 90min.
It is the same as the NDOUtils database's Optimize interval.

As already identified, the logentries & notifications tables are locked during the optimize operation.
That is the reason why ndo2db is no more able to deliver requests to DB.

The issue does not seem to be linked to network time of operation, but long tables' locks.

I'm not sure that putting back MySQL/MariaDB on the XI server itself will solve this kind of issues.
I can consider this option, as I have a major upgrade to do on my servers (OS & Nagios updates): I will stop the DB server so I can sync back data on nagios server.
But it will impact the performances of the master server itself.
And if my master collapse, I will have to stop the whole Nagios infrastructure again, to resync and reactive my offloaded DB.

Re: Performance issue

Posted: Mon Oct 30, 2017 3:52 pm
by tgriep
Moving the MYSQL database back to the Nagios server was an option that worked for other users so that is why it was suggested.

Re: Performance issue

Posted: Tue Oct 31, 2017 5:13 am
by lvaillant
Is it an option to convert somes tables from MyISAM to innoDB engine ?
https://mariadb.com/kb/en/library/conve ... to-innodb/

Code: Select all

MariaDB [mysql]> SELECT table_name AS table_name, engine, table_rows, round(((data_length + index_length) / 1024 / 1024), 2) Size_in_MB FROM information_schema.TABLES WHERE table_rows > 1000000;
+----------------------+--------+------------+------------+
| table_name           | engine | table_rows | Size_in_MB |
+----------------------+--------+------------+------------+
| nagios_logentries    | MyISAM |    5942113 |    1496.32 |
| nagios_notifications | MyISAM |    3402418 |    1205.06 |
| nagios_statehistory  | MyISAM |    5707759 |     937.75 |
+----------------------+--------+------------+------------+
3 rows in set (0.35 sec)
InnoDB has no need for CHECK, OPTIMIZE, or ANALYZE. It may be a substantial gain.
Does it impact the Nagios DB abstraction layer in any way?

According to you, is it a good idea ? Any feedback if somebody already did that convertion?

Re: Performance issue

Posted: Tue Oct 31, 2017 12:59 pm
by tgriep
We do not support changing the database to InnoDB as this has not been tested and is may cause the server to not function correctly.