DNX configuration not working [RESOLVED]
Posted: Mon Jul 09, 2012 5:58 pm
Hello all,
I have a Nagios XI installation with 72 hosts and 1220 services (mostly NRPE, check_nt and a good amount of SNMP monitoring of perfmon data). As we have added more things our Nagios XI machine has become unusable. Load is averaging around 15-17 and it can take as long as 2 or 3 min to load a graph. I decided to install DNX over ModGearman because of the simplistic installation on XI. That has proven to not be the case though as I cant get it to work.
Before I started messing with DNX the CPU usage was ALWAYS red on the Nagios XI box and had an average load of 15-17. Shortly after I started messing with DNX the load dropped to 2-3 and stayed there for a day or two so I initially thought it was working. Since then however, the load has continued to increase steadily and while I do feel like it's better (maybe its just in my head?) we still have lots of times when the system is unusable. This combined with the error logs and the fact that the client machine has a load of 0 at all times and TOP shows no activity related to Nagios or DNX leads me to conclude there is a problem with the DNX installation. IS there a way to tell definitively that DNX is or is not working?
Here are the steps I have taken so far to troubleshoot the issue.
Installation Steps:
1. Updated to latest NagiosXI version
2. Installed DNX on the Nagios Core server (DNX client) using the instructions in Using_DNX.pdf
3. Installed DNX on Nagiox XI server (DNX Master) using the instructions in Using_DNX.pdf
4. Noticed no drop in responsiveness and pondered how I could possibly screw up such a simple installation
Things of note/Troubleshooting steps thus far:
1. Both the NagiosXI server and the Nagios Core server have been rebooted
2. I have disabled IPtables on both machines and both machines are on the same subnet with no firewall between
3. I have reinstalled DNX on the client side be re-running the script NagiosXI-DNX.sh with the -c argument
4. On the client machine, even though I have set debug level to 3 the dnxplg.debug.log is never created.
--Permissions on the directory are drwxr-xr-x 2 nagios nagios 4.0K Jun 30 16:07 log
5. The Nagios XI server (DNX master) does create a dnxplg.debug.log file and it indicates no resource is found, I have included a tail of that log below.
6. Before I started messing with DNX the CPU usage was ALWAYS red on the Nagios XI box and had an average load of 12-15.
<----supporting information ---->
DNX Server information
Nagios XI 2011R3.1
CentOS Release 6.2 (Final)
64-bit Manual Install
No special configuration, no GUI, no Proxy and no SSL
DNX Client Information
Nagios Core
CentOS 5.7 (Final)
64-bit, Manual Install
No special configuration, no GUI, no Proxy and no SSL
Tail of dnxplg.debug.log on Nagios main XI server
[Mon Jul 9 16:54:33.208 2012] Post failed: Resource was not found. Service check [56716] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
[Mon Jul 9 16:54:33.217 2012] Post failed: Resource was not found. Service check [56717] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
[Mon Jul 9 16:54:38.152 2012] Post failed: Resource was not found. Service check [56718] will execute locally: /usr/local/nagios/libexec/check_nt <check info removed for security>
[Mon Jul 9 16:54:38.154 2012] Post failed: Resource was not found. Service check [56719] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
[Mon Jul 9 16:54:38.155 2012] Post failed: Resource was not found. Service check [56720] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
Tail of dnxcld.log on CLIENT
[Mon Jul 9 16:40:36.551 2012] -------- DNX Client Daemon Version 0.20.1 Startup --------
[Mon Jul 9 16:40:36.560 2012] Copyright (c) 2006-2010 Intellectual Reserve. All rights reserved.
[Mon Jul 9 16:40:36.560 2012] Configuration file: /usr/local/nagios/etc/dnxClient.cfg.
[Mon Jul 9 16:40:36.560 2012] Dispatcher: udp://10.100.17.17:12480.
[Mon Jul 9 16:40:36.560 2012] Collector: udp://10.100.17.17:12481.
[Mon Jul 9 16:40:36.560 2012] Agent: udp://10.100.17.7:12482.
[Mon Jul 9 16:40:36.560 2012] Debug logging enabled at level 3 to /usr/local/nagios/var/log/dnxcld.debug.log.
[Mon Jul 9 16:40:36.570 2012] Changed working directory to /usr/local/nagios/var/run/dnx
[Mon Jul 9 16:40:36.572 2012] Running as root; attempting to drop privileges...
I have a Nagios XI installation with 72 hosts and 1220 services (mostly NRPE, check_nt and a good amount of SNMP monitoring of perfmon data). As we have added more things our Nagios XI machine has become unusable. Load is averaging around 15-17 and it can take as long as 2 or 3 min to load a graph. I decided to install DNX over ModGearman because of the simplistic installation on XI. That has proven to not be the case though as I cant get it to work.
Before I started messing with DNX the CPU usage was ALWAYS red on the Nagios XI box and had an average load of 15-17. Shortly after I started messing with DNX the load dropped to 2-3 and stayed there for a day or two so I initially thought it was working. Since then however, the load has continued to increase steadily and while I do feel like it's better (maybe its just in my head?) we still have lots of times when the system is unusable. This combined with the error logs and the fact that the client machine has a load of 0 at all times and TOP shows no activity related to Nagios or DNX leads me to conclude there is a problem with the DNX installation. IS there a way to tell definitively that DNX is or is not working?
Here are the steps I have taken so far to troubleshoot the issue.
Installation Steps:
1. Updated to latest NagiosXI version
2. Installed DNX on the Nagios Core server (DNX client) using the instructions in Using_DNX.pdf
3. Installed DNX on Nagiox XI server (DNX Master) using the instructions in Using_DNX.pdf
4. Noticed no drop in responsiveness and pondered how I could possibly screw up such a simple installation
Things of note/Troubleshooting steps thus far:
1. Both the NagiosXI server and the Nagios Core server have been rebooted
2. I have disabled IPtables on both machines and both machines are on the same subnet with no firewall between
3. I have reinstalled DNX on the client side be re-running the script NagiosXI-DNX.sh with the -c argument
4. On the client machine, even though I have set debug level to 3 the dnxplg.debug.log is never created.
--Permissions on the directory are drwxr-xr-x 2 nagios nagios 4.0K Jun 30 16:07 log
5. The Nagios XI server (DNX master) does create a dnxplg.debug.log file and it indicates no resource is found, I have included a tail of that log below.
6. Before I started messing with DNX the CPU usage was ALWAYS red on the Nagios XI box and had an average load of 12-15.
<----supporting information ---->
DNX Server information
Nagios XI 2011R3.1
CentOS Release 6.2 (Final)
64-bit Manual Install
No special configuration, no GUI, no Proxy and no SSL
DNX Client Information
Nagios Core
CentOS 5.7 (Final)
64-bit, Manual Install
No special configuration, no GUI, no Proxy and no SSL
Tail of dnxplg.debug.log on Nagios main XI server
[Mon Jul 9 16:54:33.208 2012] Post failed: Resource was not found. Service check [56716] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
[Mon Jul 9 16:54:33.217 2012] Post failed: Resource was not found. Service check [56717] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
[Mon Jul 9 16:54:38.152 2012] Post failed: Resource was not found. Service check [56718] will execute locally: /usr/local/nagios/libexec/check_nt <check info removed for security>
[Mon Jul 9 16:54:38.154 2012] Post failed: Resource was not found. Service check [56719] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
[Mon Jul 9 16:54:38.155 2012] Post failed: Resource was not found. Service check [56720] will execute locally: /usr/local/nagios/libexec/check_nrpe <check info removed for security>
Tail of dnxcld.log on CLIENT
[Mon Jul 9 16:40:36.551 2012] -------- DNX Client Daemon Version 0.20.1 Startup --------
[Mon Jul 9 16:40:36.560 2012] Copyright (c) 2006-2010 Intellectual Reserve. All rights reserved.
[Mon Jul 9 16:40:36.560 2012] Configuration file: /usr/local/nagios/etc/dnxClient.cfg.
[Mon Jul 9 16:40:36.560 2012] Dispatcher: udp://10.100.17.17:12480.
[Mon Jul 9 16:40:36.560 2012] Collector: udp://10.100.17.17:12481.
[Mon Jul 9 16:40:36.560 2012] Agent: udp://10.100.17.7:12482.
[Mon Jul 9 16:40:36.560 2012] Debug logging enabled at level 3 to /usr/local/nagios/var/log/dnxcld.debug.log.
[Mon Jul 9 16:40:36.570 2012] Changed working directory to /usr/local/nagios/var/run/dnx
[Mon Jul 9 16:40:36.572 2012] Running as root; attempting to drop privileges...