Removing services takes a long time

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
developerx9
Posts: 20
Joined: Sun Jun 05, 2016 9:07 am

Removing services takes a long time

Post by developerx9 »

Hi all,

We have automated our nagios and every time some application or services is removed from the server, nagios removes all services , creates the new template and reloads.
(since we can't remove only particular services over the nagiosql_delete_service.php script).

The thing is, we have 11k checks. Removing all services (15 of them) from 5 different hosts takes a lot of time.

I noticed that the transfer speed is way too low:

Code: Select all

CMDLINE:
/usr/bin/wget --load-cookies=nagiosql.cookies http://localhost/nagiosxi/includes/components/ccm/ --no-check-certificate --post-data 'type=service&cmd=delete&id=15042' -O nagiosql.delete.service
--2016-10-06 08:29:16--  http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘nagiosql.delete.service’

    [  <=>                                                                                                                                                                     ] 101,592     34.5KB/s   in 2.9s   

2016-10-06 08:29:19 (34.5 KB/s) - ‘nagiosql.delete.service’ saved [101592]

Is this an issue with the database? I have optimized mysql but it appears it has made no progress.
What is strange is the fact that this is sometimes very quick, and sometimes it takes 20 minutes to remove 10 hosts and add them again.

Is there something I'm missing? Should I remove some /tmp files before? Reload mysql? Re-index it?
It certainly has to do something with the size of the database as this works perfectly with 10-15 hosts.. Now when we have over 700 of them, the above issue is appearing

Any suggestion is welcomed.
Last edited by tmcdonald on Thu Oct 06, 2016 10:54 am, edited 1 time in total.
Reason: Please use [code][/code] tags around long output
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Removing services takes a long time

Post by dwhitfield »

Just for clarity, are you using https://assets.nagios.com/downloads/nag ... gement.pdf for your automation configuration?

Alternatively, are you using the REST API? Something else?
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Removing services takes a long time

Post by avandemore »

First thing to do is identify the bottleneck. When such a delete operation is in progress, try running top -bcn 1. We are looking for high consumers of any resource, but cpu hogs generally comes from some underlying process bottleneck. If you watch the output of top for awhile, does the information about swap usage change? What is the output of df -h?

Also can you expand on specifically what you did to optimize mysql?
Previous Nagios employee
developerx9
Posts: 20
Joined: Sun Jun 05, 2016 9:07 am

Re: Removing services takes a long time

Post by developerx9 »

Hello,

Command-line script for service deletion is used for automating removing/adding services and/or hosts.
Server has plenty of RAM and CPU free (16gb RAM, 8 CPU, which is an overkill, I know).

Since we're calling PHP scripts, and they are directly or indirectly writing to the database, I immediately though this was the issue and tried to optimize SQL a bit.
(Changing join buffer size, temp table size, key buffers etc etc.)

Also, I enabled slow query logging and only found 2 entries:

# tail -f /var/lib/mysql/mysql-slow.log | grep -ve time -ve \#
use nagios;
SELECT /*!40001 SQL_NO_CACHE */ * FROM `nagios_logentries`;
SELECT /*!40001 SQL_NO_CACHE */ * FROM `nagios_notifications`;

Which are not the issue here I would say.

Has anyone reported anything similar? Last night our cronjob finished applying 3 servers (removing them, then adding them again) in 17 seconds, which is awesome. Usually it lasts couple of minutes.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Removing services takes a long time

Post by dwhitfield »

Are the scripts you are using the ones mentioned in https://assets.nagios.com/downloads/nag ... gement.pdf?

Have you taken a look at https://assets.nagios.com/downloads/nag ... ios-XI.pdf?

What version of XI are you using? 5.3.0 was released earlier this week and there were a lot of improvements to the API, so this may be a better option for you going forward.
developerx9
Posts: 20
Joined: Sun Jun 05, 2016 9:07 am

Re: Removing services takes a long time

Post by developerx9 »

So, what is happening is that I wrote a script called "nuke_everything" that basically removes every services and hosts (so I don't have to restore from a backup and potentially add multiple contacts by hand again).
It looks something like this:

Code: Select all

#!/bin/bash
SCRIPTDIR=/usr/local/nagiosxi/scripts/

cd $SCRIPTDIR
	for SERVICE in \
		$(find /usr/local/nagios/etc/services/\
		-type f \
		-name "*.cfg" \
		| grep '[a-zA-Z0-9]' \
		| grep -v localhost | rev \
		| awk -F'/' '{print $1}' \
		| cut -c 5- | rev);
	do
	echo $SERVICE
	./nagiosql_delete_service.php --config=$SERVICE 
	done
./reconfigure_nagios.sh

	for HOST in \
		$(find /usr/local/nagios/etc/hosts/\
		-type f \
		-name "*.cfg" \
		| grep '[a-zA-Z0-9]' \
		| grep -v localhost | rev \
		| awk -F'/' '{print $1}' \
		| cut -c 5- | rev);
	do
	./nagiosql_delete_host.php --host=$HOST
	done
./reconfigure_nagios.sh
When I comment out "reconfigure_nagios" - this removes around 11k checks and 700 hosts in about few minutes, which is awesome.
After all services are removed, and run reconfigure_nagios, everything slows down so much that it takes almost 1 minute to remove 1 host and 10 of his checks, which is silly.

If i run the script over and over again without running "reconfigure_nagios" (which won't apply anything), it's always fast.

So I'm guessing something happens after running reconfigure_nagios that affects the speed.
We have two potential candidates (indirectly) responsible for this:

./nagiosql_importall.php
./restart_nagios_with_export.sh

These call a bunch of other scripts that call a bunch of other scripts so troubleshooting this will take a bit of time. Do you have any ideas what's happening?
I tried re-indexing the nagios/nagiosql tables, and optimizing MySQL all together to no avail. There are zero slow queries so I have no idea what the hell is going on.

Is reconfigure_nagios causing PHP scripts to have a harder time accessing the DB for a while? Does it create a 'checksum' file so every other delete_service.php has to go through a bunch of sh*t before applying changes?

I even went so far and put /var/lib/mysql to run on tmpfs, to run from the memory itself, everything is still awfully slow. So that rules out MySQL.
Again, my guess would be that running "reconfigure_nagios" writes something to somewhere that delete_service.php and delete_host.php have to check on the next run, and that is taking a lot of time.

Any ideas? This is a huge blocker for us.
Last edited by tmcdonald on Tue Oct 18, 2016 2:37 pm, edited 1 time in total.
Reason: Please use [code][/code] tags around code output
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Removing services takes a long time

Post by avandemore »

You can edit the shebang in /usr/local/nagiosxi/scripts/reconfigure_nagios.sh to #!/bin/bash -x. Then rerun your script to see where the bottleneck occurs. Rinse and repeat for other shell scripts. It would be wise to make a backup before editing anything, and to restore scripts to their original states when done.
Previous Nagios employee
developerx9
Posts: 20
Joined: Sun Jun 05, 2016 9:07 am

Re: Removing services takes a long time

Post by developerx9 »

Hello,

The issue is not with reconfigure_nagios as that finished pretty quickly.
Issue is with nagiosql_delete_service.php and nagiosql_delete_host.php taking a long time after reconfigure_nagios is executed.

So nagiosql_delete_service.php and nagiosql_delete_host.php is the bottleneck here.

Example:

URL: http://localhost/nagiosxi/includes/components/ccm/
CMDLINE:
/usr/bin/wget --load-cookies=nagiosql.cookies http://localhost/nagiosxi/includes/components/ccm/ --no-check-certificate --post-data 'type=service&cmd=delete&id=18230' -O nagiosql.delete.service
--2016-10-17 09:43:23-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘nagiosql.delete.service’
[ <=> ] 101,127 36.2KB/s in 2.7s


So, here it took 2.7 seconds to remove a single service. If there's 10 services per host, that's 27 seconds for the services.
(multiply that with 700 hosts and you'll get a lot of seconds. To be precise - more than 5 hours!) That's ridiculous.


Again, nagiosql_delete_service.php is ONLY slow AFTER reconfigure_nagios is executed. What is reconfigure_nagios doing that is causing nagiosql_delete_service.php to slow down?
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Removing services takes a long time

Post by avandemore »

Your mysql database is slow. This can be caused by a multitude of reasons. Does your setup have the DB offloaded? Any other tuning done to the system? Can you use the mysql wizard to monitor the db? This document may also provide some information.

What is the output from # mysql -h localhost -p -e 'SELECT table_schema "Data Base Name", sum( data_length + index_length ) / 1024 / 1024 "Data Base Size in MB", sum( data_free )/ 1024 / 1024 "Free Space in MB" FROM information_schema.TABLES GROUP BY table_schema ; '

Default password is nagiosxi
Previous Nagios employee
developerx9
Posts: 20
Joined: Sun Jun 05, 2016 9:07 am

Re: Removing services takes a long time

Post by developerx9 »

Hello,

slow query log has no entries (I've set the threshold to 1 second).

Before running 'reconfigure_nagios':

# mysql -h localhost -p -e 'SELECT table_schema "Data Base Name", sum( data_length + index_length ) / 1024 / 1024 "Data Base Size in MB", sum( data_free )/ 1024 / 1024 "Free Space in MB" FROM information_schema.TABLES GROUP BY table_schema ; '
Enter password:
+--------------------+----------------------+------------------+
| Data Base Name | Data Base Size in MB | Free Space in MB |
+--------------------+----------------------+------------------+
| information_schema | 0.07031250 | 0.00000000 |
| mysql | 0.62756157 | 0.00027847 |
| nagios | 718.49285698 | 0.01412964 |
| nagiosql | 2.72396183 | 0.03003693 |
| nagiosxi | 182.31250000 | 387.00000000 |
| performance_schema | 0.00000000 | 0.00000000 |
+--------------------+----------------------+------------------+


After running 'reconfigure_nagios':

# mysql -h localhost -p -e 'SELECT table_schema "Data Base Name", sum( data_length + index_length ) / 1024 / 1024 "Data Base Size in MB", sum( data_free )/ 1024 / 1024 "Free Space in MB" FROM information_schema.TABLES GROUP BY table_schema ; '
Enter password:
+--------------------+----------------------+------------------+
| Data Base Name | Data Base Size in MB | Free Space in MB |
+--------------------+----------------------+------------------+
| information_schema | 0.07031250 | 0.00000000 |
| mysql | 0.62756157 | 0.00027847 |
| nagios | 718.86934376 | 1.65900898 |
| nagiosql | 2.72396183 | 0.03003693 |
| nagiosxi | 182.31250000 | 387.00000000 |
| performance_schema | 0.00000000 | 0.00000000 |
+--------------------+----------------------+------------------+

I don't see anything that much different before and after DB related

Furthermore, when I'm doing delete_service.php I'm watching the queries with "watch -n.1 'mysqladmin proc'" and don't see any queries taking longer than they should.

Another thing worth mentioning - when I leave things alone for an hour or two, and then try to run "delete_service.php" , everything is super fast again. So is there a temp table that 'reconfigure_nagios' creates that is causing this?
Locked