Page 1 of 1

Nagios Queueing (benjaminsmith, tgriep)

Posted: Thu Sep 03, 2020 1:01 am
by FCC_Nagios_Support
Dear sirs.

ipcs -q show a lot messages pendings
The problem is intermittent. We restart, rm *.lock also reboot and the lock status hours later free the queue.
We open too a ticket with the next information:

1)
Hi,

We have the system without processing
And the screen of this commands:

top - 15:09:34 up 16 min, 3 users, load average: 4.15, 2.63, 1.82
Tasks: 608 total, 4 running, 604 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.1 us, 6.4 sy, 0.0 ni, 71.7 id, 9.3 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem : 24504324 total, 17058208 free, 3357628 used, 4088488 buff/cache
KiB Swap: 8388604 total, 8388604 free, 0 used. 20192232 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
117472 mysql 20 0 22.4g 1.8g 8996 S 170.8 7.7 3:33.98 mysqld

Every 2.0s: ipcs -q Thu Aug 27 15:10:41 2020


------ Message Queues --------
key msqid owner perms used-bytes messages
0x8b010080 131072 nagios 600 137211904 133996

Can you help us

1-Answer)

You have higher IO wait:

9.3 wa

Which will cause the CPU to spike because it's waiting on storage:

117472 mysql 20 0 22.4g 1.8g 8996 S 170.8 7.7 3:33.98 mysqld

That could indicate slow storage speed, slow VM host to VM storage speed, crashed database tables, or you could be hitting a bug, the information below will help me determine what it is.

Please send me a copy of your profile, you can download it from Admin > System Profile > Download Profile and upload it to the ticket by clicking the "choose item" link at the bottom of the menu. Make sure to wait until the file is finished uploading before clicking the Post Reply button.

You could also be hitting a bug so send me the output of these commands:

Additionally, please send the output of these commands (as root) so we can check the table sizes:
- NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the first command if your DB is offloaded to another server and/or you've changed the root mysql password

echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table

This next command may fail, that's okay, not all systems have postgresql:

echo "SELECT relname as Table, pg_size_pretty(pg_total_relation_size(relid)) As Size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as ExternalSize FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" | psql nagiosxi nagiosxi

2)
/var/log/mariadb/mariadb-slow.log

Hello we have found a recurrent querys in the database like this:

SELECT
COUNT(*) as total
FROM nagios_servicestatus
LEFT JOIN nagios_objects as obj1 ON nagios_servicestatus.service_object_id=obj1.object_id
LEFT JOIN nagios_services ON nagios_servicestatus.service_object_id=nagios_services.service_object_id
LEFT JOIN nagios_hosts ON nagios_services.host_object_id=nagios_hosts.host_object_id
WHERE TRUE AND obj1.name1 = 'a2sql162p.fcc.intfcc.local' AND nagios_servicestatus.instance_id = '1' AND nagios_servicestatus.service_object_id IN ( 78499,78469,78420,78419,78418,78417,78356,78355,78353,78297,77341,77340, (with thousands of IDs!!!)

This query lock the table nagios_servicestatus and the result of processlist of mysql show this:
MariaDB [nagios]> show processlist;
+------+----------+-----------+----------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+------+----------+-----------+----------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+----------+
| 11 | ndoutils | localhost | nagios | Query | 0 | Waiting for table level lock | INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='25795', status_update_time= | 0.000 |
| 4041 | root | localhost | nagios | Query | 0 | NULL | show processlist | 0.000 |
...
| 4522 | ndoutils | localhost | nagios | Query | 0 | optimizing | SELECT
COUNT(*) as total
FROM nagios_servicestatus
LEFT JOIN nagios_objects as obj1 ON nagios_servic | 0.000 |
...
| 4528 | ndoutils | localhost | nagios | Query | 0 | statistics | SELECT
COUNT(*) as total
FROM nagios_servicestatus
LEFT JOIN nagios_objects as obj1 ON nagios_servic | 0.000 |
...
| 4597 | ndoutils | localhost | nagios | Query | 0 | optimizing | SELECT
COUNT(*) as total
FROM nagios_servicestatus
LEFT JOIN nagios_objects as obj1 ON nagios_servic | 0.000 |
...
+------+----------+-----------+----------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+----------+
60 rows in set (0.00 sec)

Exist a insert in table nagios_servicestatus that continuously is in state "Waiting for table level lock".

We believe that this situation is the one that is impacting the performance of MySQL and consequently the consumption of message queues.

The machine currently has 24 CPUs and 32 GB of RAM, the database is optimized. But the queues are not being consumed correctly.


Is there a parameter to minimize the execution of the query that generates the lock on the table?

2.1)
ps -ef|grep -i panic

The system has this proccess:
[root@a2nagio001p ~]# ps -ef|grep panic
root 10108 1 0 16:25 ? 00:00:00 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody cared IRQ handler type mismatch Kernel panic - not syncing: Machine Check Exception: Machine check events logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD

2.1. - Answer)

Please run these commands and let me know if it resolves your issue:

systemctl stop httpd
systemctl stop crond
systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
pkill -9 -u apache
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -f /usr/local/nagiosxi/var/dbmaint.lock
rm -f /usr/local/nagiosxi/var/event_handler.lock
rm -f /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /usr/local/nagios/var/ndo2db.pid
rm -f /usr/local/nagios/var/ndo2db.sock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /us/local/nagiosxi/var/subsys/ndo2db
rm -f /var/run/nagios/nagios.lock
rm -f /var/run/nagios.lock
rm -f /usr/local/nagios/var/nagios.lock
rm -f /var/run/httpd/httpd.pid
rm -f /usr/local/nagiosxi/var/subsys/npcd.pid
systemctl restart mariadb
systemctl start ndo2db
systemctl start nagios
systemctl start npcd
systemctl start crond
systemctl restart httpd
systemctl restart snmptt


2.2)

We did serveral times that procedure also reboot the system.

3)

Being patient after several hours the queue became empty and processing OK


CONCLUSION:

I REMEMBER ADVISE OF benjaminsmith:
-RAMDISK: DONE
-CPU: 2 CORES OF 12 = 24 CPUS


-USE TWO NAGIOS INSTANCES: We are seriously thinking in it and separate PRODUCTION AND DEVELOMENT IN TOW DIFFERENT HOSTS.



!!!!!!!!!!!!!Many Thanks

I attach what support ticked asked for us:
-Profile
-Select:
[root@a2nagio001p ~]# echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -h 127.0.0.1 -uroot -pnagiosxi --table

Kind Regards and thanks again

Re: Nagios Queueing (benjaminsmith, tgriep)

Posted: Thu Sep 03, 2020 1:05 am
by FCC_Nagios_Support
Select

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

Re: Nagios Queueing (benjaminsmith, tgriep)

Posted: Thu Sep 03, 2020 1:07 am
by FCC_Nagios_Support
Profile


Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

Re: Nagios Queueing (benjaminsmith, tgriep)

Posted: Thu Sep 03, 2020 4:53 pm
by benjaminsmith
Hi,

I see you have another ticket open with Sean right now, so let's continue to work this through the ticket. Please open one ticket per issue so we can focus our efforts and provide the best service.

https://support.nagios.com/tickets/scp/ ... p?id=10870

I have shared the attached files with the rest of the support team.

Thanks!
Benjamin