Page 1 of 3

Performance issue

Posted: Wed Oct 18, 2017 9:12 am
by lvaillant
Hello,

I'm currently running into troubles with several issues, but I think the root cause is common.

Details on installation:
  • RHEL 7.3 64b - minimal install
  • Manual install of Nagios XI
  • Current version: 5.4.4
  • Proxy configured (system & nagios)
  • Using SSL
  • DB offloaded (MariaDB - RHEL 7.3)
  • Mod_Gearman2 installed / 4 pollers
  • Ramdisk (1GB)
  • 8 vCPU
  • 16 GB RAM
  • +1750 hosts
  • +10650 services
1st visible symptom is the following message in Nagvis:
ERROR: Problem (Backend:ndomy_1): NDO Claims that nagios did no status update ... Make sure that nagios and NDO daemons are running.

Restarting the ndo2db service allows Nagvis to work again for few minutes/hours. But the same issue always come back.

This behavior occurs only since few days/weeks ago, when I got some error messges from ndo2db service:

Code: Select all

ndo2db: Message sent to queue.
Warning: queue send error, retrying...
ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
I changed the kernel settings as explained in FAQ (https://support.nagios.com/kb/article/n ... eeded.html)

Code: Select all

# sysctl -a | grep kernel.msgm
kernel.msgmax = 262144000
kernel.msgmnb = 262144000
kernel.msgmni = 512000
I also monitored the message queue (cf. screenshot localhost-message_queue.png).

As I observe some peaks and "plateaux" in the message queue, and because of ndo2db messages like this one:
ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 512000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.

today, i've changed the kernel settings again:

Code: Select all

# sysctl -a | grep kernel.msgm
kernel.msgmax = 262144000
kernel.msgmnb = 393216000
kernel.msgmni = 512000
During the plateaux, my pollers stop checking hosts & services they have to.
I can see the CPU Load and the number of workers decreasing at that time.

Each time there is a peak, nagvis seems to lose the connection to its ndomy backend.
I also observe that Nagios is slow to refresh perfdata during the peaks...

I saw no issue from the DB side.

I read the forum and lots of google links, but I did not found a clear and unique answer to that kind of issue.

1) Can you confirm that theses issues are linked to ndo2db?
2) Have you some recommendations to solve this/these issues?

Thank you.

Re: Performance issue

Posted: Wed Oct 18, 2017 1:41 pm
by dwasswa
Hi @ lvaillant

What version of ndoutils are you using?

Re: Performance issue

Posted: Thu Oct 19, 2017 1:23 am
by lvaillant
Hello
dwasswa wrote:What version of ndoutils are you using?
The one available in the Nagios XI package.

Code: Select all

# /usr/local/nagios/bin/ndo2db --help

NDO2DB 2.1.2
Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Copyright (c) 2005-2008 Ethan Galstad
Last Modified: 11-14-2016
License: GPL v2
I'm running Nagios XI 5.4.4. The last update was done in early april 2017.
The issue only occurs since few weeks.

Re: Performance issue

Posted: Thu Oct 19, 2017 12:03 pm
by dwasswa
Hi @ lvaillant ,

If changing the kernel settings did not fix that issue,its then possible it is something related to MySQL or MariaDB. Can you please make sure that database has enough memory and cpu.

Re: Performance issue

Posted: Fri Oct 20, 2017 2:18 am
by lvaillant
The DB server is:
  • RHEL 7.3 64b
  • MariaDB 5.5.52
  • 2 vCPUs
  • 4 GB RAM
Current MariaDB configuration

Code: Select all

max_connections=250
skip-name-resolve
thread-cache-size=10
query-cache-type=0
join-buffer-size=256
table-open-cache=800
tmp_table_size=512M
max_heap_table_size= 512M
slow-query-log=1
slow-query-log-file=/var/log/mariadb/mariadb-slow.log

Code: Select all

MariaDB [mysql]> SELECT  ENGINE,
    ->         ROUND(SUM(data_length) /1024/1024, 1) AS "Data MB",
    ->         ROUND(SUM(index_length)/1024/1024, 1) AS "Index MB",
    ->         ROUND(SUM(data_length + index_length)/1024/1024, 1) AS "Total MB",
    ->         COUNT(*) "Num Tables"
    ->     FROM  INFORMATION_SCHEMA.TABLES
    ->     WHERE  table_schema not in ("information_schema", "PERFORMANCE_SCHEMA", "SYS_SCHEMA", "ndbinfo")
    ->     GROUP BY  ENGINE;
+--------+---------+----------+----------+------------+
| ENGINE | Data MB | Index MB | Total MB | Num Tables |
+--------+---------+----------+----------+------------+
| CSV    |     0.0 |      0.0 |      0.0 |          2 |
| InnoDB |    19.0 |      1.2 |     20.2 |          9 |
| MyISAM |  4385.5 |   1309.7 |   5695.2 |        173 |
+--------+---------+----------+----------+------------+
3 rows in set (0.01 sec)

Code: Select all

# mysqladmin status
Uptime: 14686008  Threads: 107  Questions: 22669329650  Slow queries: 1671  Opens: 709411  Flush tables: 10  Open tables: 402  Queries per second avg: 1543.600
Find below some information from MySQLTuner

Code: Select all

-------- Performance Metrics -----------------------------------------------------------------------
[--] Up for: 169d 23h 10m 10s (22B q [1K qps], 29M conn, TX: 19216G, RX: 11182G)
[--] Reads / Writes: 11% / 89%
[--] Binary logging is disabled
[--] Physical Memory     : 3.7G
[--] Max MySQL memory    : 1.5G
[--] Other process memory: 51.8M
[--] Total buffers: 912.0M global + 2.7M per thread (250 max threads)
[--] P_S Max memory usage: 0B
[--] Galera GCache Max memory usage: 0B
[!!] Sorts requiring temporary tables: 13% (12M temp sorts / 91M sorts)
[!!] Joins performed without indexes: 26825018
[!!] Table cache hit rate: 0% (361 open / 709K opened)
...
-------- MyISAM Metrics ----------------------------------------------------------------------------
[!!] Key buffer used: 29.6% (39M used / 134M cache)
[!!] Write Key buffer hit rate: 53.7% (46B cached / 24B writes)

-------- InnoDB Metrics ----------------------------------------------------------------------------
[--] InnoDB is enabled.
[--] InnoDB Thread Concurrency: 0
[!!] InnoDB File per table is not activated
[!!] Ratio InnoDB log file size / InnoDB Buffer pool size (7.8125 %): 5.0M * 2/128.0M should be equal 25%
[--] InnoDB Buffer Pool Chunk Size not used or defined in your version
[!!] InnoDB Write Log efficiency: 71.14% (127010173 hits/ 178545423 total)
...
-------- Recommendations ---------------------------------------------------------------------------
General recommendations:
    Reduce or eliminate persistent connections to reduce connection usage
    Adjust your join queries to always utilize indexes
    Increase table_open_cache gradually to avoid file descriptor limits
    Read this before increasing table_open_cache over 64: http://bit.ly/1mi7c4C
    Beware that open_files_limit (1861) variable
    should be greater than table_open_cache (800)
    Consider installing Sys schema from https://github.com/mysql/mysql-sys
    Read this before changing innodb_log_file_size and/or innodb_log_files_in_group: http://bit.ly/2wgkDvS
Variables to adjust:
    max_connections (> 250)
    wait_timeout (< 28800)
    interactive_timeout (< 28800)
    sort_buffer_size (> 2M)
    read_rnd_buffer_size (> 256K)
    join_buffer_size (> 256B, or always use indexes with joins)
    table_open_cache (> 800)
    innodb_file_per_table=ON
    innodb_log_file_size should be (=16M) if possible, so InnoDB total log files size equals to 25% of buffer pool size.
Regarding the Nagios information, I see no particular issue on DB server.
DB-MEM.png
DB-CPU.png
I'll try to reduce the temp tables created on disk to avoid IOs.

Let me know if there is any other information I can provide to diagnose this issue.

Re: Performance issue

Posted: Fri Oct 20, 2017 12:57 pm
by dwasswa
Hi @lvaillant,

Lets try repairing the database and restarting some processes.

Please run the following commands to repair database:

Code: Select all

    cd /usr/local/nagiosxi/scripts
    ./repair_databases.sh
Run the following commands to restart the processes:

Code: Select all

service nagios stop
service ndo2db stop
service httpd restart
service ndo2db start
service nagios start
Also,could you please PM your system profile.

Re: Performance issue

Posted: Mon Oct 23, 2017 1:45 am
by lvaillant
Hello

Code: Select all

# ./repair_databases.sh offloaded
DATABASE: nagios
TABLE:
nagios.nagios_acknowledgements                     OK
nagios.nagios_commands                             OK
nagios.nagios_commenthistory                       OK
nagios.nagios_comments                             OK
nagios.nagios_configfiles                          OK
nagios.nagios_configfilevariables                  OK
nagios.nagios_conninfo                             OK
nagios.nagios_contact_addresses                    OK
nagios.nagios_contact_notificationcommands         OK
nagios.nagios_contactgroup_members                 OK
nagios.nagios_contactgroups                        OK
nagios.nagios_contactnotificationmethods           OK
nagios.nagios_contactnotifications                 OK
nagios.nagios_contacts                             OK
nagios.nagios_contactstatus                        OK
nagios.nagios_customvariables                      OK
nagios.nagios_customvariablestatus                 OK
nagios.nagios_dbversion                            OK
nagios.nagios_downtimehistory                      OK
nagios.nagios_eventhandlers                        OK
nagios.nagios_externalcommands                     OK
nagios.nagios_flappinghistory                      OK
nagios.nagios_host_contactgroups                   OK
nagios.nagios_host_contacts                        OK
nagios.nagios_host_parenthosts                     OK
nagios.nagios_hostchecks                           OK
nagios.nagios_hostdependencies                     OK
nagios.nagios_hostescalation_contactgroups         OK
nagios.nagios_hostescalation_contacts              OK
nagios.nagios_hostescalations                      OK
nagios.nagios_hostgroup_members                    OK
nagios.nagios_hostgroups                           OK
nagios.nagios_hosts                                OK
nagios.nagios_hoststatus                           OK
nagios.nagios_instances                            OK
nagios.nagios_logentries                           OK
nagios.nagios_notifications                        OK
nagios.nagios_objects                              OK
nagios.nagios_processevents                        OK
nagios.nagios_programstatus                        OK
nagios.nagios_runtimevariables                     OK
nagios.nagios_scheduleddowntime                    OK
nagios.nagios_service_contactgroups                OK
nagios.nagios_service_contacts                     OK
nagios.nagios_service_parentservices               OK
nagios.nagios_servicechecks                        OK
nagios.nagios_servicedependencies                  OK
nagios.nagios_serviceescalation_contactgroups      OK
nagios.nagios_serviceescalation_contacts           OK
nagios.nagios_serviceescalations                   OK
nagios.nagios_servicegroup_members                 OK
nagios.nagios_servicegroups                        OK
nagios.nagios_services                             OK
nagios.nagios_servicestatus                        OK
nagios.nagios_statehistory                         OK
nagios.nagios_systemcommands                       OK
nagios.nagios_timedeventqueue                      OK
nagios.nagios_timedevents                          OK
nagios.nagios_timeperiod_timeranges                OK
nagios.nagios_timeperiods                          OK
Issued remote command 'mysqlcheck -f -r -u <user> -p<passwd> -h 11.1.18.117 --databases nagios'
DATABASE: nagiosql
TABLE:
nagiosql.tbl_command                               OK
nagiosql.tbl_contact                               OK
nagiosql.tbl_contactgroup                          OK
nagiosql.tbl_contacttemplate                       OK
nagiosql.tbl_domain                                OK
nagiosql.tbl_host                                  OK
nagiosql.tbl_hostdependency                        OK
nagiosql.tbl_hostescalation                        OK
nagiosql.tbl_hostextinfo                           OK
nagiosql.tbl_hostgroup                             OK
nagiosql.tbl_hosttemplate                          OK
nagiosql.tbl_info                                  OK
nagiosql.tbl_lnkContactToCommandHost               OK
nagiosql.tbl_lnkContactToCommandService            OK
nagiosql.tbl_lnkContactToContactgroup              OK
nagiosql.tbl_lnkContactToContacttemplate           OK
nagiosql.tbl_lnkContactToVariabledefinition        OK
nagiosql.tbl_lnkContactgroupToContact              OK
nagiosql.tbl_lnkContactgroupToContactgroup         OK
nagiosql.tbl_lnkContacttemplateToCommandHost       OK
nagiosql.tbl_lnkContacttemplateToCommandService    OK
nagiosql.tbl_lnkContacttemplateToContactgroup      OK
nagiosql.tbl_lnkContacttemplateToContacttemplate   OK
nagiosql.tbl_lnkContacttemplateToVariabledefinition OK
nagiosql.tbl_lnkHostToContact                      OK
nagiosql.tbl_lnkHostToContactgroup                 OK
nagiosql.tbl_lnkHostToHost                         OK
nagiosql.tbl_lnkHostToHostgroup                    OK
nagiosql.tbl_lnkHostToHosttemplate                 OK
nagiosql.tbl_lnkHostToVariabledefinition           OK
nagiosql.tbl_lnkHostdependencyToHost_DH            OK
nagiosql.tbl_lnkHostdependencyToHost_H             OK
nagiosql.tbl_lnkHostdependencyToHostgroup_DH       OK
nagiosql.tbl_lnkHostdependencyToHostgroup_H        OK
nagiosql.tbl_lnkHostescalationToContact            OK
nagiosql.tbl_lnkHostescalationToContactgroup       OK
nagiosql.tbl_lnkHostescalationToHost               OK
nagiosql.tbl_lnkHostescalationToHostgroup          OK
nagiosql.tbl_lnkHostgroupToHost                    OK
nagiosql.tbl_lnkHostgroupToHostgroup               OK
nagiosql.tbl_lnkHosttemplateToContact              OK
nagiosql.tbl_lnkHosttemplateToContactgroup         OK
nagiosql.tbl_lnkHosttemplateToHost                 OK
nagiosql.tbl_lnkHosttemplateToHostgroup            OK
nagiosql.tbl_lnkHosttemplateToHosttemplate         OK
nagiosql.tbl_lnkHosttemplateToVariabledefinition   OK
nagiosql.tbl_lnkServiceToContact                   OK
nagiosql.tbl_lnkServiceToContactgroup              OK
nagiosql.tbl_lnkServiceToHost                      OK
nagiosql.tbl_lnkServiceToHostgroup                 OK
nagiosql.tbl_lnkServiceToServicegroup              OK
nagiosql.tbl_lnkServiceToServicetemplate           OK
nagiosql.tbl_lnkServiceToVariabledefinition        OK
nagiosql.tbl_lnkServicedependencyToHost_DH         OK
nagiosql.tbl_lnkServicedependencyToHost_H          OK
nagiosql.tbl_lnkServicedependencyToHostgroup_DH    OK
nagiosql.tbl_lnkServicedependencyToHostgroup_H     OK
nagiosql.tbl_lnkServicedependencyToService_DS      OK
nagiosql.tbl_lnkServicedependencyToService_S       OK
nagiosql.tbl_lnkServiceescalationToContact         OK
nagiosql.tbl_lnkServiceescalationToContactgroup    OK
nagiosql.tbl_lnkServiceescalationToHost            OK
nagiosql.tbl_lnkServiceescalationToHostgroup       OK
nagiosql.tbl_lnkServiceescalationToService         OK
nagiosql.tbl_lnkServicegroupToService              OK
nagiosql.tbl_lnkServicegroupToServicegroup         OK
nagiosql.tbl_lnkServicetemplateToContact           OK
nagiosql.tbl_lnkServicetemplateToContactgroup      OK
nagiosql.tbl_lnkServicetemplateToHost              OK
nagiosql.tbl_lnkServicetemplateToHostgroup         OK
nagiosql.tbl_lnkServicetemplateToServicegroup      OK
nagiosql.tbl_lnkServicetemplateToServicetemplate   OK
nagiosql.tbl_lnkServicetemplateToVariabledefinition OK
nagiosql.tbl_lnkTimeperiodToTimeperiod             OK
nagiosql.tbl_logbook                               OK
nagiosql.tbl_mainmenu                              OK
nagiosql.tbl_service                               OK
nagiosql.tbl_servicedependency                     OK
nagiosql.tbl_serviceescalation                     OK
nagiosql.tbl_serviceextinfo                        OK
nagiosql.tbl_servicegroup                          OK
nagiosql.tbl_servicetemplate                       OK
nagiosql.tbl_session                               OK
nagiosql.tbl_session_locks                         OK
nagiosql.tbl_settings                              OK
nagiosql.tbl_submenu                               OK
nagiosql.tbl_timedefinition                        OK
nagiosql.tbl_timeperiod                            OK
nagiosql.tbl_user                                  OK
nagiosql.tbl_variabledefinition                    OK
Issued remote command 'mysqlcheck -f -r -u <user> -p<passwd> -h 11.1.18.117 --databases nagiosql'
DATABASE: nagiosxi
TABLE:
nagiosxi.xi_auditlog
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_commands
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_eventqueue                             OK
nagiosxi.xi_events
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_incidents
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_meta
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_options
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_sysstat
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_usermeta
note     : The storage engine for the table doesn't support repair
nagiosxi.xi_users
note     : The storage engine for the table doesn't support repair
Issued remote command 'mysqlcheck -f -r -u <user> -p<passwd> -h 11.1.18.117 --databases nagiosxi'

=======================
nagios offloaded database repair succeeded
nagiosql offloaded database repair succeeded
nagiosxi offloaded database repair succeeded
The repair took some time on the following tables:
  • nagios.nagios_logentries: 10197026 rows
  • nagios.nagios_notifications: 6050226 rows
  • nagios.nagios_statehistory: 5507836 rows

Code: Select all

# service nagios stop
Stopping nagios (via systemctl):                           [  OK  ]
# service ndo2db stop
Stopping ndo2db (via systemctl):                           [  OK  ]
# service httpd restart
Redirecting to /bin/systemctl restart  httpd.service
# service ndo2db start
Starting ndo2db (via systemctl):                           [  OK  ]
# service nagios start
Starting nagios (via systemctl):                           [  OK  ]
# watch ipcs -q
New message queue created. I dropped the older one (ipcrm).
Now waiting for better performance, but I already did this kind of repair & restart.

Due to the size of the listed tables, could it linked to the hourly-scheduled "optimize" operation ?

Re: Performance issue

Posted: Mon Oct 23, 2017 6:50 am
by lvaillant
Still the same behavior and bad performance.
Nagvis unavailability is the most visible symptom...

Re: Performance issue

Posted: Mon Oct 23, 2017 2:02 pm
by tgriep
We have seen on some larger installs that off-loading the MYSQL database may not be fast enough due to network latency and the speed of the MYSQL server, cause the Kernel Message Queue to not clear out fast enough.
In this case, moving the MYSQL database back to the Nagios server solved the issue.
Is that an option for you?

Re: Performance issue

Posted: Tue Oct 24, 2017 5:11 am
by lvaillant
Given the pattern of the graph, I don't think it's due to network latency.
It is too scheduled/regular. It really looks like a hourly-scheduded task.

I suspected a DB task, but I saw no optimize task or lock during a message queue peak...
Edit:

Code: Select all

MariaDB [nagios]> show processlist;
+----------+----------+----------+----------+------------+------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id       | User     | Host     | db       | Command    | Time | State                           | Info                                                                                                 | Progress |
+----------+----------+----------+----------+------------+------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
| 30384911 | nagios   | xx:56392 | nagios   | Query      |  392 | Waiting for table metadata lock | INSERT INTO nagios_logentries SET instance_id='1', logentry_time=FROM_UNIXTIME(1508847960), entry_ti |    0.000 |
| 30392207 | nagiosxi | xx:54290 | nagiosxi | Sleep      |    2 |                                 | NULL                                                                                                 |    0.000 |
| 30392208 | nagios   | xx:54292 | nagios   | Sleep      |    2 |                                 | NULL                                                                                                 |    0.000 |
| 30392209 | nagiosql | xx:54294 | nagiosql | Sleep      |    3 |                                 | NULL                                                                                                 |    0.000 |
| 30392755 | nagios   | xx:56480 | nagios   | Query      |  400 | Sorting index                   | OPTIMIZE TABLE nagios_logentries                                                                     |    0.000 |
| 30393037 | nagiosxi | xx:57292 | nagiosxi | Sleep      |   10 |                                 | NULL                                                                                                 |    0.000 |
Optimizing nagios_logentries took more than 400s... More than 6min !
Just amazing.

Please refer to next messages with information.
As the Nagios XI server and its DB server are VMs, I will set them on the same ESX to avoid network latency.

I'm also currently auditing the Nagios XI server itself.
httpd & ndo2db are the 2 services that consume CPU, memory and IOs.

I also noticed lots of write in /usr/local directories where Nagios and deps. are installed.
Is there a way to reduce disk IOs ?
As I'm using a ramdisk, may some files from /usr/local/nagios/var be moved to the /var/nagiosramdisk ramdisk without incidence on Nagios/Nagios XI/Nagvis/whatever..., as objects.cache & status.dat files already are?