Offloading mysql - performance tuning - size of servers

junkertf · Post by **junkertf** » Mon Nov 12, 2018 8:53 am

Hello,

Need some clues regarding the previous thread accessed here (adviced FS sizes of XI answered by you)

viewtopic.php?f=6&t=50810

We have a working NagiosXI environment with the following circumstances:

NagiosXi 5.5.2
ESXi Virutalised RHEL 7.5
currently it has 8 vCore and 16GB RAM.
MariaDB and NagiosXI on same host
Hosts count arround 1100 and growing
Services count arround 18500 and growing (min 2-3000 SNMP service checks will come only from our FC switch monitoring, and i not counted our *nix and Win related checks that will come also)
Hosts and services ratio is bethween 1:10 and 1:500! (That last will be snmp check to our 5-10 pcs Fabric switches)

Currently the load on NagiosXI is bethween 2.5 and 9, depending on we do something configuration or not.
On Apply configuration the (global) CPU usage is arround 55% with arround 10% system times. Apply config takes cca 1 min time to come back to normal work. (ops center views, dashboards are hanging for that time) Report running a bit slow but also counts minutes.

I had read many docs for sizeing and tuning, actually we had decided that firstly move our MariaDB instance to a separate VM.

The goal VMware configrations (for nagios and mariadb servers) will be installed on a Hitachi SSD disk on HA environment (datastore latency approximative 5-10 ms)
Also know that to that sized NagiosXI is more preferrable to run on Physical HW, but we want to make that as last step only.
Now i know that 5.5.6 has significant advantages on Apply configuration, but if possible, need answer for the below questions also..

So way:
- how we start the mariadb VM HW specification? (CPU? RAM? Disk? how to size our FS-sizes on MariaDB server bethween the LVM partitions?) regarding the top specified host and service count?
- is there a bit more sophisticated documentation to measure, tune the mariadb (nagiosxi) instance or some calculation method to scale its requrements or performance (parallel cpu usage, thread usage, memory tuning, for example more resident memory usage for mariadb instances - for which DB instance)?
- what is your advices to our currently used NagiosXI environment is it over or under configured from HW side? (more or less CPU / RAM)

Thank you for your time and any help,

Best regards,
Ferenc

ssax · Post by **ssax** » Mon Nov 12, 2018 4:42 pm

Generally at 10K total combined host/service checks we recommend that you setup a RAMDisk, and at around 20K we recommend you start looking at adding an additional XI server because they can only process so much. Now this may come sooner or later than 20K depending on what type of checks you are running, how much resources they use, your hardware speed, and what you're doing to mitigate the impact.

You can read more about setting up a RAMDisk here:

https://assets.nagios.com/downloads/nag ... giosXI.pdf

You should run this check profiler script and see what long running checks you have and determine what some of your long running checks are, they consume resources the whole time they are running so reducing those helps a lot:

https://exchange.nagios.org/directory/P ... me/details

The next step would be for you to look at offloading the checks using mod gearman to reduce the impact on the XI server, this would be my recommendation at what you can do to add more services and alleviate the system issues. There's just so much going with around 20K checks that you will need to do what you can to mitigate the impact such as using mod gearman, please see here for more information:

https://assets.nagios.com/downloads/nag ... ios_XI.pdf
https://support.nagios.com/kb/article.php?id=484

Please read through this doc as well, with the number of checks you are running I would leave the DB local though at this point in time because of the large amount of total checks you have, it requires a lot of throughput to the DB (recommended enabling jumbo_frames):

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

You can only do so much on a single server, you'll need to do what you can to mitigate the impact but you should start looking at adding another XI server soon if you continue to experience load/performance issues after doing the mitigation.

- how we start the mariadb VM HW specification? (CPU? RAM? Disk? how to size our FS-sizes on MariaDB server bethween the LVM partitions?) regarding the top specified host and service count?

I would recommend that you start with at least 2/4 CPU, 8-16GB RAM, and 100GB of disk space. Make sure to look at enabling jumbo frames across your network path/VM if you offload the DB with that many checks, when you restart the nagios service it goes through and updates all of the objects in the DB and it requires a lot of throughput to update them fast enough with that many DB entries and going across the network.

- is there a bit more sophisticated documentation to measure, tune the mariadb (nagiosxi) instance or some calculation method to scale its requrements or performance (parallel cpu usage, thread usage, memory tuning, for example more resident memory usage for mariadb instances - for which DB instance)?

We have no guide on this specifically, a lot of people use mysqltuner.pl but you would want to talk over any changes that you plan to make with your DBA/DB team.

https://github.com/major/MySQLTuner-perl

- what is your advices to our currently used NagiosXI environment is it over or under configured from HW side? (more or less CPU / RAM)

Sounds pretty decent currently but you can PM me a copy of your profile so that I can get a better look at your system specs and your configuration to give you a better idea.

Let me know if you have any questions or if I can clarify anything.

junkertf · Post by **junkertf** » Tue Nov 13, 2018 9:03 am

Hello,

i had PM'ed our profile and mysqltuner output for you with questions.

Also trying to find out where to adjust the parameters wroted in the "Maximizing performance" documentation.

I must test now the "Profiler for plugin execution" documentation!
RAMdisk usage will be good as for first sights on our test system.

Thank you, best regards,

Ferenc

ssax · Post by **ssax** » Tue Nov 13, 2018 3:10 pm

Based on your profile it looks like you have too many kernel message queues for nagios, you should only have one:

Code: Select all

[XXX@XXX ~]$ ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
0xa2000002 294912     nagios     600        0            0           
0x1f000002 327681     nagios     600        1024         1           
0x6e000002 360450     nagios     600        0            0           
0x70000002 2392067    nagios     600        0            0

Please run these commands to fix it:

Code: Select all

service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start

---

Here are my general recommendation based on your profile:

1. I recommend that you upgrade to XI 5.5.7 when you can as there are a number of performance improvements that have been made.

2. Go to Admin > Performance Settings > Databases tab and set all 3 Optimize Intervals to 300, this will help prevent crashing of tables on your larger system in the event one optimize isn't finished before the next one starts (can cause crashing of tables).

3. You are hitting the performance data load_threshold and TIMEOUT from your logs, you shouldn't see an impact but the changes below I recommend for your larger system:

a.) Edit your /usr/local/nagios/etc/pnp/npcd.cfg and change load_threshold = 10.0 to load_threshold = 40.0.

b.) Edit your /usr/local/nagios/etc/pnp/process_perfdata.cfg and change TIMEOUT = 5to TIMEOUT = 20.

Then restart NPCD:

Code: Select all

service npcd restart

4. To limit disk IO issues, implement a RAMDisk as we talked about and edit your /usr/local/nagios/etc/nagios.cfg and change use_syslog=1 to use_syslog=0 and restart the nagios service:

Code: Select all

service nagios restart

That setting logs to /var/log/messages in addition to /usr/local/nagios/var/nagios.log, it's duplicate logging and can slow down the system, any mitigation helps and you should strive to do what you can.

Once you do that stuff, look at implementing mod_gearman to offload the checks to free up resources on the XI server.

---

Your mysqltuner output is pretty normal and I don't have any recommendations on what to change at this point in time, it's something that takes trial and error and is not something that we generally get into as the defaults are generally okay. Changing settings can sometimes have a detrimental impact if you're not sure of what they should be so I hesitate to recommend any modifications without consulting with a DBA first. You may want to spin up a test server and load test any changes you make to make sure that they have a positive instead of negative impact on the mysql processing speed.

junkertf · Post by **junkertf** » Wed Nov 14, 2018 1:35 am

Hello,

Thank you for the many advice!
Need approximative a week or a few days more for testing the settings.
Please let the case open until if possible.

Thank you, best regards,

Ferenc

ssax · Post by **ssax** » Wed Nov 14, 2018 3:54 pm

Sounds good, we'll leave it open, let us know how it goes.

Thank you

junkertf · Post by **junkertf** » Thu Dec 13, 2018 8:57 am

Hello,

A little update for the currently situation:

After the patch the load goes down from the average 3.6 to 3.2 on our PROD environment.

Currently i reading the Mod-Gearman documentation and try to find a good way for implementing it.
Is there a good documentation or use-case how to a fancy HA builded solution must work?

Thanks all the help, best regards,

Ferenc

ssax · Post by **ssax** » Thu Dec 13, 2018 5:53 pm

All of the related documentation for gearman I already posted in my second post.

Are you asking about HA for gearman or XI?

There is no built-in HA functionality for XI, here is the doc for that:
- Most people I see roll-their-own or use DRDB/heartbeat/pacemaker, LinBit has an excellent guide on this

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

I don't have any info on HA gearman though.

junkertf · Post by **junkertf** » Wed Jan 09, 2019 10:19 am

Hello,

I hope everything is going well, and want wish happy new year for you and for the team!
A quick response the currently state of these case.

I had set up a ModGearman server and worker on our XI server and a separate worker on a standalone server.
These is a step back to Core version 4.2.4 (5.4.13)

Actually i want check to working of these solution. Configured some hosts and workgroups on the standalone server.

[root@myhuslhqbpmodgmuat mod_gearman2]# grep -v ^# /etc/mod_gearman2/worker.conf | grep -v ^$
server=OURXITESTSERVERIP:4730
eventhandler=no
services=no
hosts=no
hostgroups=HG_NIX_LINUX_TEST,HG_NIX_LINUX_TEST_AppSrvs,HG_NIX_LINUX_PROD,HG_NIX_LINUX_MAILSRVs
servicegroups=SG_BB_NIX_LINUX_MAILqueue_size,SG_BB_NIX_LINUX_NTPd,SG_BB_NIX_LINUX_TEST_MorningCHK,SG_BB_NIX_Linux_AppSrvFSChk
encryption=yes
key=key
...

our XI side worker config
[root@myhuslhqbpxisrvuat libexec]# grep -v ^# /etc/mod_gearman2/worker.conf | grep -v ^$
debug=0
logfile=/var/log/mod_gearman2/mod_gearman_worker.log
server=localhost:4730
eventhandler=yes
services=yes
hosts=yes
encryption=yes
key=key
...

our XI side server module config looks like
[root@myhuslhqbpxisrvuat libexec]# grep -v ^# /etc/mod_gearman2/module.conf | grep -v ^$
debug=0
logfile=/var/log/mod_gearman2/mod_gearman_neb.log
server=localhost:4730
eventhandler=yes
services=yes
hosts=yes
do_hostchecks=yes
route_eventhandler_like_checks=no
encryption=yes
key=key
...

Actually i will check the gearman workings, because the standalone server worker log shows the following output:
[2019-01-09 16:03:34][11273][INFO ] no checks in 2minutes, restarting all workers
[2019-01-09 16:05:35][11273][INFO ] no checks in 2minutes, restarting all workers
[2019-01-09 16:07:36][11273][INFO ] no checks in 2minutes, restarting all workers
[2019-01-09 16:09:37][11273][INFO ] no checks in 2minutes, restarting all workers
[2019-01-09 16:11:38][11273][INFO ] no checks in 2minutes, restarting all workers
[2019-01-09 16:13:39][11273][INFO ] no checks in 2minutes, restarting all workers

so i tried to monitor the worker with the following commands from the worker machine, but dont see any checks to going on it:
#gearman_top2 -H 3.193.254.1:4730 -i 1
2019-01-09 16:15:15 - 3.193.254.1:4730 - v0.33

Queue Name | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 8 | 0 | 0
host | 8 | 0 | 0
hostgroup_HG_NIX_LINUX_MAILSRVs | 5 | 0 | 0
hostgroup_HG_NIX_LINUX_PROD | 5 | 0 | 0
hostgroup_HG_NIX_LINUX_TEST | 5 | 0 | 0
hostgroup_HG_NIX_LINUX_TEST_AppSrvs | 5 | 0 | 0
localhost | 0 | 1 | 0
myhuslhqbpngmgu | 0 | 1 | 0
service | 8 | 0 | 3
servicegroup_SG_BB_NIX_LINUX_MAILqueue_size | 5 | 0 | 0
servicegroup_SG_BB_NIX_LINUX_NTPd | 5 | 0 | 0
servicegroup_SG_BB_NIX_LINUX_TEST_MorningCHK | 5 | 0 | 0
servicegroup_SG_BB_NIX_Linux_AppSrvFSChk | 5 | 0 | 0
worker_myhuslhqbpinagt.hu.money.ge.com | 1 | 0 | 0
worker_myhuslhqbpngmgu.hu.money.ge.com | 0 | 0 | 0
-----------------------------------------------------------------------------------------------

Also tried the following monitor from our XI server:
[root@myhuslhqbpxisrvuat libexec]# ./check_gearman2 -H localhost -q myhuslhqbpngmgu -t 60 -s check
check_gearman CRITICAL - Job failed: _client_do(GEARMAN_TIMEOUT) occured during gearman_client_run_tasks() -> libgearman/client.cc:174
[root@myhuslhqbpxisrvuat libexec]# ./check_gearman2 -H localhost -q localhost -t 3 -s check
check_gearman CRITICAL - Job failed: _client_do(GEARMAN_TIMEOUT) occured during gearman_client_run_tasks() -> libgearman/client.cc:174

I try to manage so way the service checks that only specified hostgroups checks go to the standalon worker station, because many of our checks works eith ncpa agent, and dont know it si possible to manage so way...

Can you give me some clue how to going forward?

Thank you, best regards,

Ferenc

ssax · Post by **ssax** » Wed Jan 09, 2019 2:45 pm

Please change your external worker (myhuslhqbpmodgmuat) /etc/mod_gearman2/worker.conf to:

Code: Select all

eventhandler=yes
services=yes
hosts=yes

Then restart the worker (on myhuslhqbpmodgmuat):

Code: Select all

systemctl restart mod-gearman2-worker

See if that works for you.

Did you follow this guide as well?

https://support.nagios.com/kb/article.php?id=484

If you follow that guide at the top in the Host groups and Service groups section you will create the groups that should be EXCLUDED from running through gearman, if you do that, once it's up and running just disable worker service ON THE XI SERVER since it will not be needed:

Code: Select all

systemctl stop mod-gearman2-worker
systemctl disable mod-gearman2-worker

Nagios Support Forum

Offloading mysql - performance tuning - size of servers

Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers

Re: Offloading mysql - performance tuning - size of servers