Performance issue
Posted: Wed Oct 18, 2017 9:12 am
Hello,
I'm currently running into troubles with several issues, but I think the root cause is common.
Details on installation:
ERROR: Problem (Backend:ndomy_1): NDO Claims that nagios did no status update ... Make sure that nagios and NDO daemons are running.
Restarting the ndo2db service allows Nagvis to work again for few minutes/hours. But the same issue always come back.
This behavior occurs only since few days/weeks ago, when I got some error messges from ndo2db service:
I changed the kernel settings as explained in FAQ (https://support.nagios.com/kb/article/n ... eeded.html)
I also monitored the message queue (cf. screenshot localhost-message_queue.png).
As I observe some peaks and "plateaux" in the message queue, and because of ndo2db messages like this one:
ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 512000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.
today, i've changed the kernel settings again:
During the plateaux, my pollers stop checking hosts & services they have to.
I can see the CPU Load and the number of workers decreasing at that time.
Each time there is a peak, nagvis seems to lose the connection to its ndomy backend.
I also observe that Nagios is slow to refresh perfdata during the peaks...
I saw no issue from the DB side.
I read the forum and lots of google links, but I did not found a clear and unique answer to that kind of issue.
1) Can you confirm that theses issues are linked to ndo2db?
2) Have you some recommendations to solve this/these issues?
Thank you.
I'm currently running into troubles with several issues, but I think the root cause is common.
Details on installation:
- RHEL 7.3 64b - minimal install
- Manual install of Nagios XI
- Current version: 5.4.4
- Proxy configured (system & nagios)
- Using SSL
- DB offloaded (MariaDB - RHEL 7.3)
- Mod_Gearman2 installed / 4 pollers
- Ramdisk (1GB)
- 8 vCPU
- 16 GB RAM
- +1750 hosts
- +10650 services
ERROR: Problem (Backend:ndomy_1): NDO Claims that nagios did no status update ... Make sure that nagios and NDO daemons are running.
Restarting the ndo2db service allows Nagvis to work again for few minutes/hours. But the same issue always come back.
This behavior occurs only since few days/weeks ago, when I got some error messges from ndo2db service:
Code: Select all
ndo2db: Message sent to queue.
Warning: queue send error, retrying...
ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.Code: Select all
# sysctl -a | grep kernel.msgm
kernel.msgmax = 262144000
kernel.msgmnb = 262144000
kernel.msgmni = 512000As I observe some peaks and "plateaux" in the message queue, and because of ndo2db messages like this one:
ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 512000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.
today, i've changed the kernel settings again:
Code: Select all
# sysctl -a | grep kernel.msgm
kernel.msgmax = 262144000
kernel.msgmnb = 393216000
kernel.msgmni = 512000I can see the CPU Load and the number of workers decreasing at that time.
Each time there is a peak, nagvis seems to lose the connection to its ndomy backend.
I also observe that Nagios is slow to refresh perfdata during the peaks...
I saw no issue from the DB side.
I read the forum and lots of google links, but I did not found a clear and unique answer to that kind of issue.
1) Can you confirm that theses issues are linked to ndo2db?
2) Have you some recommendations to solve this/these issues?
Thank you.