Inconsistent NRDP performance
Posted: Thu Mar 08, 2018 6:01 pm
I recently setup multiple Nagios XI (currrently version 5.4.12) servers and wanted to pull the results of the servers into a single passive host where my end users could view the results from instead of having to find their hosts in the various servers. Looking through the documentation, NRDP looked like an obvious choice since it's supported out of the box with Nagios XI.
On the passive server:
- Configured Passive Hosts for each Host that I wanted to pull from the Active Nagios XI server
- Configured Passive Service Template
-- Set check to check_dummy to goto state of 2 (critical) instead of the example built in
-- Enable freshness check
-- Set freshness to 15 minutes (900 seconds) - which I think is FAR too long but I wanted to give the server a fair chance
- Created the NRDP Authentication Token in the Admin -> Inbound Check Transfer Settings -> NRDP
- Tested the configured Authentication Token using the GUI link using the built in example and verified that the test showed up in the unconfigured objects
Config specs: CentOs 7.4.1708 with 4 CPUs, 8GB of memory and plenty-o-disk space. I also configured a 2GB (because I was going for overkill just in case) ramdisk using the automated script.
On the active servers (total of 3 servers containing 21,943 services in total):
- Enabled Outbound Transfers and eliminated the filter and left the filter mode at exclude
- Configured NRDP settings to define the IP address of the Passive server and the Authentication token for the Passive Server
Config specs: CentOs 7.4.1708 with 8 CPUs, 16GB of memory and plenty-o-disk space. I also configured a 500MB ramdisk using the automated script.
Seeing that the configuration was pretty simple, I was fairly confident this would work, however, I was sadly mistaken. What I found instead was that several hundred services ended up in a critical state (better than 700.) I then move on to change the freshness setting to 45 minutes, cleaned up the critical alerts, and then waited. This resulted in fewer critical checks (between 250 and 350), but still VASTLY more non-OK checks than on the 3 active servers (I literally have 11 warnings in my environment right now.) What's interesting is that the number varies over time going to as few as a couple of hundred to as many as over 1k.
I started troubleshooting by watching the /var/log/httpd/error_log (since I'm using http to send data) but nothing showed up that corresponded with the time of a service hitting a critical state. In looking through the nagios.log on the passive server, I noticed lots of messages like this:
[1520548225] Warning: The results of service 'Process - Puppet Memory usage' on host 'SERVERNAMEHERE' are stale by 0d 0h 0m 13s (threshold=0d 0h 15m 0s). I'm forcing an immediate check of the service.
Each of the lines points out a host service combination that has hit a critical state on the passive server. I can see on the active server side that the "/usr/local/nrdp/clients/send_nrdp.sh" script is being called with multiple checks at a time, example:
Processing perfdata file '/var/nagiosramdisk/spool/xidpe/1520549179.perfdata.service'
Sending passive check data to NRDP server(s)...
Sending to NRDP target host: 10.150.134.222
CMDLINE: cat /tmp/NRDPOUTmAWNHu | /usr/local/nrdp/clients/send_nrdp.sh -u http://10.150.134.222/nrdp/ -t XJiZFVgnQreK
STDOUT: Sent 251 checks to http://10.150.134.222/nrdp/
RETURN CODE: 0
While I could technically modify the send_nrdp.sh script to log the results, I figured I'd start by asking if there were any baked in logs that I should be looking at instead? Also, is it worth investing time in changing the send_nrdp plugin being used to the python script instead of the bash script in perfdataproc.php?
Better question than those, what's the best method for troubleshooting this kind of setup?
On the passive server:
- Configured Passive Hosts for each Host that I wanted to pull from the Active Nagios XI server
- Configured Passive Service Template
-- Set check to check_dummy to goto state of 2 (critical) instead of the example built in
-- Enable freshness check
-- Set freshness to 15 minutes (900 seconds) - which I think is FAR too long but I wanted to give the server a fair chance
- Created the NRDP Authentication Token in the Admin -> Inbound Check Transfer Settings -> NRDP
- Tested the configured Authentication Token using the GUI link using the built in example and verified that the test showed up in the unconfigured objects
Config specs: CentOs 7.4.1708 with 4 CPUs, 8GB of memory and plenty-o-disk space. I also configured a 2GB (because I was going for overkill just in case) ramdisk using the automated script.
On the active servers (total of 3 servers containing 21,943 services in total):
- Enabled Outbound Transfers and eliminated the filter and left the filter mode at exclude
- Configured NRDP settings to define the IP address of the Passive server and the Authentication token for the Passive Server
Config specs: CentOs 7.4.1708 with 8 CPUs, 16GB of memory and plenty-o-disk space. I also configured a 500MB ramdisk using the automated script.
Seeing that the configuration was pretty simple, I was fairly confident this would work, however, I was sadly mistaken. What I found instead was that several hundred services ended up in a critical state (better than 700.) I then move on to change the freshness setting to 45 minutes, cleaned up the critical alerts, and then waited. This resulted in fewer critical checks (between 250 and 350), but still VASTLY more non-OK checks than on the 3 active servers (I literally have 11 warnings in my environment right now.) What's interesting is that the number varies over time going to as few as a couple of hundred to as many as over 1k.
I started troubleshooting by watching the /var/log/httpd/error_log (since I'm using http to send data) but nothing showed up that corresponded with the time of a service hitting a critical state. In looking through the nagios.log on the passive server, I noticed lots of messages like this:
[1520548225] Warning: The results of service 'Process - Puppet Memory usage' on host 'SERVERNAMEHERE' are stale by 0d 0h 0m 13s (threshold=0d 0h 15m 0s). I'm forcing an immediate check of the service.
Each of the lines points out a host service combination that has hit a critical state on the passive server. I can see on the active server side that the "/usr/local/nrdp/clients/send_nrdp.sh" script is being called with multiple checks at a time, example:
Processing perfdata file '/var/nagiosramdisk/spool/xidpe/1520549179.perfdata.service'
Sending passive check data to NRDP server(s)...
Sending to NRDP target host: 10.150.134.222
CMDLINE: cat /tmp/NRDPOUTmAWNHu | /usr/local/nrdp/clients/send_nrdp.sh -u http://10.150.134.222/nrdp/ -t XJiZFVgnQreK
STDOUT: Sent 251 checks to http://10.150.134.222/nrdp/
RETURN CODE: 0
While I could technically modify the send_nrdp.sh script to log the results, I figured I'd start by asking if there were any baked in logs that I should be looking at instead? Also, is it worth investing time in changing the send_nrdp plugin being used to the python script instead of the bash script in perfdataproc.php?
Better question than those, what's the best method for troubleshooting this kind of setup?