XI: NDO3 -- instability / poor performance at scale
Posted: Wed Mar 17, 2021 2:35 pm
Ahoy folks,
We run multiple instances of Nagios XI to monitor my customer environments.
This includes a split in an attempt to distribute load (EG: ACTIVE vs PASSIVE checks), where each instance is self-contained.
These were all initially installed on the XI 5.6.x release series, on-top of RHEL 7.x virtual machines, with off-box dBs (for the two PROD ones, the third is vendor provided / on-box).
It is understood that hosting the 3 dBs off-box will add some for various actions (EG: apply_config will take a little longer, due to the involved interchange across the network vs local traffic within a box).
For a variety of reasons we have been keen to upgrade away from XI 5.6.x.
Given this is for enterprise monitoring, we are of course interested in keeping abreast with security updates and bug fixes, to say nothing of benefitting from new features & functionality.
Also the APP to dB interchange subsystems included upto the end of XI 5.6.x release series has historical led to significant grief and instability (all day impacts, unhappy users & MGMT, etc.)
Through review of the forums, contact with SUPP (via tickets), and independent research on the KB, we have made some extensive attempts at performance tuning to accommodate / optimize the OS and APP for NDO2db through-put.
It was with no small measure of delight when XI 5.7.x release series was announced (with a much anticipated replacement, NDO3).
Of course XI 5.7.x release series has not been without challenges of its own.
Three attempts made, with all ultimately failing just short of the finish line, for a variety of reasons.
The accelerated release of XI 5.8.x series had given hope, as previous bugs continued to get fixed and the new sub-systems saw version upgrades (NDO3, CCM, and most recently NRDP).
However, one issue remains throughout that "kills" our upgrade attempts every time.
Specifically, it would appear that NDO3 sub-system is not scaling well under load in the largest of our PROD deployments (seems to run fine in our smaller deployments).
Having reviewed the forums, I continue to see references to downgrade back to NDO2db as a recommended "quick fix".
However, I have not observed much in the way of advancement on the topic of improving NDO3 stability / performance (a worthy long-term goal one might think).
Thus, I wished to seek guidance on how to go about tuning / making NDO3 operate better at scale?
In our most recent attempt to upgrade (last night, 5.6.9 --> 5.8.1 as such was available in your [vendor repos](https://repo.nagios.com/?repo=rpm-rhel)), we had challenges but "got there" with regards to XI being upgraded.
This includes the following:
- XI apparently started successfully
- reporting version 5.8.1
- "survived" APP restarts
- dB transactions apparent on (off-box) dB host
- monitor engine running
- monitors scheduled in queue
- monitors updating with results when checks run
However, the following was noted / observed:
- server statistics
- CPU would spike (~10.x+), than bottom-out (sub ~0.5x) for load
- CPU Stats would also indicate spikes
- spikes / troughs coincide with next major point
- monitoring engine check statistics
- monitor queues would process briefly, than "drain"
- for example the "1-min" queues for both ACTIVE HOST && PASSIVE SERVICE checks would "zero out"
- followed by the same happening to the "5-min" queues
- would eventually "self-recover" without intervention, run for a time, than manifest all over again
- monitoring engine process
- held steady (green board) throughout
- system component status
- held steady (green board) throughout
- monitoring engine event queue
- scheduled events over time
- the so-called "banana road" would show peaks and troughs
- would appear to indicate "bursts" or "spurts" of monitors scheduled / executed
- conversely, in XI 5.6.x, this "runs steady" with a generally consistent load average of "250" (not accounting for bursting of events)
- service status
- using test PASSIVE service monitors, it was noted that XI would become extremely "sluggish" and all-together "miss" PASSIVE service state changes
Restarting `nagios` core proc did little to correct the issue.
Nagios, XI, & NDO config files were reviewed and confirmed settings to off-box dB were correct.
Confirmed the NDO3 "broker_module" string was properly defined.
We eventually "rolled-back" on the change (revert to snap-shots for APP && dB nodes, taken prior to the start of change), and XI was quickly back in PROD service (at XI 5.6.9 levels).
Prior to rolling back, I collected "XI Profile" and full APP dumps (~13 GB tar-balls, via `backup_xi.sh`) of the faulty XI 5.8.1 state, and the previous functional XI 5.6.9 state.
In this way, we are preparing now to conduct RCA / post-mortem (as best able) on what went wrong and how we might achieve *lasting success* on the next attempt.
All of this said, what knowledge / solutions / tuning opportunities exist to improve the performance of the NDO3 sub-system for use in a "large" deployment scenario?
---
For clarification, in my particular ENV, the "problem" large instance is consistent of the following general points:
- VM hosts:
- XI: 6 vCPUs, 32 GB MEM, SAN disks
- dB: 4 vCPUs, 32 GB MEM, SAN disks
- OS: RHEL 7.x without any special customizations, patched regularly
- dB: MariaDB 5.5.64.x series (off-box, managed instance by in-house DBA folks)
- Monitors:
- ACTIVE: ~4,200 HOST monitors (`check-host-alive`, which uses `check_icmp`, for UP/DOWN monitors)
- PASSIVE: ~35,000 SERVICE monitors (`check_dummy`, PASSIVE monitors that receive events from NCPA/NRDP deployed on our monitored hosts plant)
- Notes:
- we are not using `mod_gearman`, as it was found to be ill-suited for our ENV
We run multiple instances of Nagios XI to monitor my customer environments.
This includes a split in an attempt to distribute load (EG: ACTIVE vs PASSIVE checks), where each instance is self-contained.
These were all initially installed on the XI 5.6.x release series, on-top of RHEL 7.x virtual machines, with off-box dBs (for the two PROD ones, the third is vendor provided / on-box).
It is understood that hosting the 3 dBs off-box will add some for various actions (EG: apply_config will take a little longer, due to the involved interchange across the network vs local traffic within a box).
For a variety of reasons we have been keen to upgrade away from XI 5.6.x.
Given this is for enterprise monitoring, we are of course interested in keeping abreast with security updates and bug fixes, to say nothing of benefitting from new features & functionality.
Also the APP to dB interchange subsystems included upto the end of XI 5.6.x release series has historical led to significant grief and instability (all day impacts, unhappy users & MGMT, etc.)
Through review of the forums, contact with SUPP (via tickets), and independent research on the KB, we have made some extensive attempts at performance tuning to accommodate / optimize the OS and APP for NDO2db through-put.
It was with no small measure of delight when XI 5.7.x release series was announced (with a much anticipated replacement, NDO3).
Of course XI 5.7.x release series has not been without challenges of its own.
Three attempts made, with all ultimately failing just short of the finish line, for a variety of reasons.
The accelerated release of XI 5.8.x series had given hope, as previous bugs continued to get fixed and the new sub-systems saw version upgrades (NDO3, CCM, and most recently NRDP).
However, one issue remains throughout that "kills" our upgrade attempts every time.
Specifically, it would appear that NDO3 sub-system is not scaling well under load in the largest of our PROD deployments (seems to run fine in our smaller deployments).
Having reviewed the forums, I continue to see references to downgrade back to NDO2db as a recommended "quick fix".
However, I have not observed much in the way of advancement on the topic of improving NDO3 stability / performance (a worthy long-term goal one might think).
Thus, I wished to seek guidance on how to go about tuning / making NDO3 operate better at scale?
In our most recent attempt to upgrade (last night, 5.6.9 --> 5.8.1 as such was available in your [vendor repos](https://repo.nagios.com/?repo=rpm-rhel)), we had challenges but "got there" with regards to XI being upgraded.
This includes the following:
- XI apparently started successfully
- reporting version 5.8.1
- "survived" APP restarts
- dB transactions apparent on (off-box) dB host
- monitor engine running
- monitors scheduled in queue
- monitors updating with results when checks run
However, the following was noted / observed:
- server statistics
- CPU would spike (~10.x+), than bottom-out (sub ~0.5x) for load
- CPU Stats would also indicate spikes
- spikes / troughs coincide with next major point
- monitoring engine check statistics
- monitor queues would process briefly, than "drain"
- for example the "1-min" queues for both ACTIVE HOST && PASSIVE SERVICE checks would "zero out"
- followed by the same happening to the "5-min" queues
- would eventually "self-recover" without intervention, run for a time, than manifest all over again
- monitoring engine process
- held steady (green board) throughout
- system component status
- held steady (green board) throughout
- monitoring engine event queue
- scheduled events over time
- the so-called "banana road" would show peaks and troughs
- would appear to indicate "bursts" or "spurts" of monitors scheduled / executed
- conversely, in XI 5.6.x, this "runs steady" with a generally consistent load average of "250" (not accounting for bursting of events)
- service status
- using test PASSIVE service monitors, it was noted that XI would become extremely "sluggish" and all-together "miss" PASSIVE service state changes
Restarting `nagios` core proc did little to correct the issue.
Nagios, XI, & NDO config files were reviewed and confirmed settings to off-box dB were correct.
Confirmed the NDO3 "broker_module" string was properly defined.
We eventually "rolled-back" on the change (revert to snap-shots for APP && dB nodes, taken prior to the start of change), and XI was quickly back in PROD service (at XI 5.6.9 levels).
Prior to rolling back, I collected "XI Profile" and full APP dumps (~13 GB tar-balls, via `backup_xi.sh`) of the faulty XI 5.8.1 state, and the previous functional XI 5.6.9 state.
In this way, we are preparing now to conduct RCA / post-mortem (as best able) on what went wrong and how we might achieve *lasting success* on the next attempt.
All of this said, what knowledge / solutions / tuning opportunities exist to improve the performance of the NDO3 sub-system for use in a "large" deployment scenario?
---
For clarification, in my particular ENV, the "problem" large instance is consistent of the following general points:
- VM hosts:
- XI: 6 vCPUs, 32 GB MEM, SAN disks
- dB: 4 vCPUs, 32 GB MEM, SAN disks
- OS: RHEL 7.x without any special customizations, patched regularly
- dB: MariaDB 5.5.64.x series (off-box, managed instance by in-house DBA folks)
- Monitors:
- ACTIVE: ~4,200 HOST monitors (`check-host-alive`, which uses `check_icmp`, for UP/DOWN monitors)
- PASSIVE: ~35,000 SERVICE monitors (`check_dummy`, PASSIVE monitors that receive events from NCPA/NRDP deployed on our monitored hosts plant)
- Notes:
- we are not using `mod_gearman`, as it was found to be ill-suited for our ENV