Performance issue with nagiosxi !!!
Posted: Wed Jun 19, 2019 4:13 am
Hi Team
Due to poor performance issue and regular downtime of nagiosxi we are having doubts on our decision to choose nagiosxi as our monitoring solution. I would appreciate if you can help us in identifying the root cause .
Setup :
Nagios Frontend
Machine A Nagios frontend - RHEL 7.5 , 4 vcpus, 16G memory , 128GB Standard HDD disk attached.
Machine B Nagios frontend - RHEL 7.5 , 4 vcpus, 16G memory , 128GB Standard HDD disk attached.
Machine A & B are in Active-Passive mode using DRBD , pacemeker ( as recommended in nagios official site )
Nagios Backend
Machine C Nagios Backend ( mariadb)- RHEL 7.5 , 4 vcpus, 16G memory , 256GB Premium SSD disk attached. for mariaDB
Total Host configured : 299
Total Services configured : 3119
Problem : Cannot login to nagiosxi, page unresponsive. After rebooting all 3 machines the issue occur again 15-20 mins later.
Observation :
1. On Backend machine , I see high i/o wait ( some time reaching more than 80 % ) . Sometime we cannot even login to machine . All maridb connections are exhausted ( 400 connection )
2. Because of above behavior , fronted active machine some time I see connection exhausted error toward mariadb , some time both machine gets active ( split brain scenario ) . There is no STONITH enabled .
3. I checked the DB table size and xi_meta table was of size 72 GB and nagios_logentries was aroung 12 GB . After spending many days in debugging the issue , we truncated the tables and after that from last 2 days it seems to be working fine .
Question :
1. Truncating table is not a solution whenever the issue comes . Now I know that you will recommend that because of DB issue , the dbmaint Jobs in nagios which optimize and cleans up the table doesn't work . But I would like to also mentioned that many times , after restarting only the DB I also repaired and optimized the complete DB and then started the nagios , but the issue reoccurs after some time again.
2. With the default nagios settings , what is the max size of the xi_meta table you guys expects to grow ???
3. The i/o wait on DB server was mostly I guess because of this HUGE size of xi_meta and logentry table .
4. For enabling STONITH in nagios fronend drbd , pacemaker config , what can be used ?? or your recommendation . Please note the machines are AZURE virtual machines .
Any other suggesstions ??
Due to poor performance issue and regular downtime of nagiosxi we are having doubts on our decision to choose nagiosxi as our monitoring solution. I would appreciate if you can help us in identifying the root cause .
Setup :
Nagios Frontend
Machine A Nagios frontend - RHEL 7.5 , 4 vcpus, 16G memory , 128GB Standard HDD disk attached.
Machine B Nagios frontend - RHEL 7.5 , 4 vcpus, 16G memory , 128GB Standard HDD disk attached.
Machine A & B are in Active-Passive mode using DRBD , pacemeker ( as recommended in nagios official site )
Nagios Backend
Machine C Nagios Backend ( mariadb)- RHEL 7.5 , 4 vcpus, 16G memory , 256GB Premium SSD disk attached. for mariaDB
Total Host configured : 299
Total Services configured : 3119
Problem : Cannot login to nagiosxi, page unresponsive. After rebooting all 3 machines the issue occur again 15-20 mins later.
Observation :
1. On Backend machine , I see high i/o wait ( some time reaching more than 80 % ) . Sometime we cannot even login to machine . All maridb connections are exhausted ( 400 connection )
2. Because of above behavior , fronted active machine some time I see connection exhausted error toward mariadb , some time both machine gets active ( split brain scenario ) . There is no STONITH enabled .
3. I checked the DB table size and xi_meta table was of size 72 GB and nagios_logentries was aroung 12 GB . After spending many days in debugging the issue , we truncated the tables and after that from last 2 days it seems to be working fine .
Question :
1. Truncating table is not a solution whenever the issue comes . Now I know that you will recommend that because of DB issue , the dbmaint Jobs in nagios which optimize and cleans up the table doesn't work . But I would like to also mentioned that many times , after restarting only the DB I also repaired and optimized the complete DB and then started the nagios , but the issue reoccurs after some time again.
2. With the default nagios settings , what is the max size of the xi_meta table you guys expects to grow ???
3. The i/o wait on DB server was mostly I guess because of this HUGE size of xi_meta and logentry table .
4. For enabling STONITH in nagios fronend drbd , pacemaker config , what can be used ?? or your recommendation . Please note the machines are AZURE virtual machines .
Any other suggesstions ??