Inconsistent NRDP performance

krutaw · Post by **krutaw** » Thu Mar 08, 2018 6:01 pm

I recently setup multiple Nagios XI (currrently version 5.4.12) servers and wanted to pull the results of the servers into a single passive host where my end users could view the results from instead of having to find their hosts in the various servers. Looking through the documentation, NRDP looked like an obvious choice since it's supported out of the box with Nagios XI.

On the passive server:
- Configured Passive Hosts for each Host that I wanted to pull from the Active Nagios XI server
- Configured Passive Service Template
-- Set check to check_dummy to goto state of 2 (critical) instead of the example built in
-- Enable freshness check
-- Set freshness to 15 minutes (900 seconds) - which I think is FAR too long but I wanted to give the server a fair chance
- Created the NRDP Authentication Token in the Admin -> Inbound Check Transfer Settings -> NRDP
- Tested the configured Authentication Token using the GUI link using the built in example and verified that the test showed up in the unconfigured objects

Config specs: CentOs 7.4.1708 with 4 CPUs, 8GB of memory and plenty-o-disk space. I also configured a 2GB (because I was going for overkill just in case) ramdisk using the automated script.

On the active servers (total of 3 servers containing 21,943 services in total):
- Enabled Outbound Transfers and eliminated the filter and left the filter mode at exclude
- Configured NRDP settings to define the IP address of the Passive server and the Authentication token for the Passive Server

Config specs: CentOs 7.4.1708 with 8 CPUs, 16GB of memory and plenty-o-disk space. I also configured a 500MB ramdisk using the automated script.

Seeing that the configuration was pretty simple, I was fairly confident this would work, however, I was sadly mistaken. What I found instead was that several hundred services ended up in a critical state (better than 700.) I then move on to change the freshness setting to 45 minutes, cleaned up the critical alerts, and then waited. This resulted in fewer critical checks (between 250 and 350), but still VASTLY more non-OK checks than on the 3 active servers (I literally have 11 warnings in my environment right now.) What's interesting is that the number varies over time going to as few as a couple of hundred to as many as over 1k.

I started troubleshooting by watching the /var/log/httpd/error_log (since I'm using http to send data) but nothing showed up that corresponded with the time of a service hitting a critical state. In looking through the nagios.log on the passive server, I noticed lots of messages like this:

[1520548225] Warning: The results of service 'Process - Puppet Memory usage' on host 'SERVERNAMEHERE' are stale by 0d 0h 0m 13s (threshold=0d 0h 15m 0s). I'm forcing an immediate check of the service.

Each of the lines points out a host service combination that has hit a critical state on the passive server. I can see on the active server side that the "/usr/local/nrdp/clients/send_nrdp.sh" script is being called with multiple checks at a time, example:

Processing perfdata file '/var/nagiosramdisk/spool/xidpe/1520549179.perfdata.service'
Sending passive check data to NRDP server(s)...

Sending to NRDP target host: 10.150.134.222
CMDLINE: cat /tmp/NRDPOUTmAWNHu | /usr/local/nrdp/clients/send_nrdp.sh -u http://10.150.134.222/nrdp/ -t XJiZFVgnQreK
STDOUT: Sent 251 checks to http://10.150.134.222/nrdp/
RETURN CODE: 0

While I could technically modify the send_nrdp.sh script to log the results, I figured I'd start by asking if there were any baked in logs that I should be looking at instead? Also, is it worth investing time in changing the send_nrdp plugin being used to the python script instead of the bash script in perfdataproc.php?

Better question than those, what's the best method for troubleshooting this kind of setup?

Post by **cdienger** » Fri Mar 09, 2018 2:49 pm

I would look to the access_log on the XI server first to make sure the checks are coming in. You should be seeing request along the lines of:

192.168.4.8 - - [09/Mar/2018:17:48:27 +0000] "POST /nrdp//?token=welcome&cmd=token=welcome&cmd=submitcheck&XMLDATA=<?xml%20version='1.0'?><checkresults><checkresult%20t

Where the IP address would be the IP address of your active server. You could easily grep for these:

grep a.b.c.d /var/log/httpd/access_log

The POSTed data should contain hostname and service name information. It may be a bit difficult to decipher but https://meyerweb.com/eric/tools/dencoder/ can help clear that up. Once we can confirm data is either not coming in or it is coming but not updating the status on the passive server, we can focus attention where it is needed.

There are three send_nrdp scripts - is the command.cfg cofnigured to use the php, py, or sh? https://assets.nagios.com/downloads/nag ... h-NRDP.pdf may be of help if you haven't seen it yet.

scottwilkerson · Post by **scottwilkerson** » Fri Mar 09, 2018 3:14 pm

cdienger wrote:While I could technically modify the send_nrdp.sh script to log the results, I figured I'd start by asking if there were any baked in logs that I should be looking at instead?

Nothing baked in for logging, besides what you already have found.

cdienger wrote:Also, is it worth investing time in changing the send_nrdp plugin being used to the python script instead of the bash script in perfdataproc.php?

I would say no, I doubt there is much if any performance change in between the 2.

One thing you may need to do is increase the amount of httpd threads in the receiving side

krutaw · Post by **krutaw** » Fri Mar 09, 2018 6:50 pm

cdienger wrote:I would look to the access_log on the XI server first to make sure the checks are coming in. You should be seeing request along the lines of:

192.168.4.8 - - [09/Mar/2018:17:48:27 +0000] "POST /nrdp//?token=welcome&cmd=token=welcome&cmd=submitcheck&XMLDATA=<?xml%20version='1.0'?><checkresults><checkresult%20t

Where the IP address would be the IP address of your active server. You could easily grep for these:

grep a.b.c.d /var/log/httpd/access_log

The POSTed data should contain hostname and service name information. It may be a bit difficult to decipher but https://meyerweb.com/eric/tools/dencoder/ can help clear that up. Once we can confirm data is either not coming in or it is coming but not updating the status on the passive server, we can focus attention where it is needed.

There are three send_nrdp scripts - is the command.cfg cofnigured to use the php, py, or sh? https://assets.nagios.com/downloads/nag ... h-NRDP.pdf may be of help if you haven't seen it yet.

That's actually not where that's configured. In NagiosXI, the send_nrdp script is defined as part of /usr/local/nagiosxi/cron/perfdataproc.php and is hardcoded (by default) to use send_nrdp.sh. Also, by default the aforementioned script handles the perfmon data (which is what kicks off the script to send data via either NRDP or NSCA) happens en mass which means you can't actually track whether or not a single host/service was sent.

krutaw · Post by **krutaw** » Fri Mar 09, 2018 6:54 pm

scottwilkerson wrote:
cdienger wrote:While I could technically modify the send_nrdp.sh script to log the results, I figured I'd start by asking if there were any baked in logs that I should be looking at instead?
Nothing baked in for logging, besides what you already have found.

cdienger wrote:Also, is it worth investing time in changing the send_nrdp plugin being used to the python script instead of the bash script in perfdataproc.php?
I would say no, I doubt there is much if any performance change in between the 2.

One thing you may need to do is increase the amount of httpd threads in the receiving side

Thanks Scott, great idea. I had managed to get it somewhat stable by enabling both the NRDP and NSCA transfers and setting the default freshness setting on the passive services to double the service check interval on the active servers. It had settled things down for the most part but it's far from optimal. I will definitely go down the httpd threads route instead. Thanks for that.

krutaw · Post by **krutaw** » Sat Mar 10, 2018 2:29 pm

krutaw wrote: Thanks Scott, great idea. I had managed to get it somewhat stable by enabling both the NRDP and NSCA transfers and setting the default freshness setting on the passive services to double the service check interval on the active servers. It had settled things down for the most part but it's far from optimal. I will definitely go down the httpd threads route instead. Thanks for that.

I attempted changing the number of forked threads using the MPM prefork handler (default in CentOs 7) as high as 70 threads (default starts at 5) and it had absolutely no impact. interestingly, the apache process on the receiving server is not showing any overt signs of stress (almost no CPU or memory usage.) I'm starting to think that the problem is actually on the sender side of things given that the results are grouped when doing the performance graphing.

Post by **cdienger** » Mon Mar 12, 2018 1:29 pm

Step through https://support.nagios.com/kb/article/n ... e-611.html and specifically the max_input_vars and memory option. There have been cases where the dashboards are not updated because the amount of data sent via NRDP is too large for the php defaults.

krutaw · Post by **krutaw** » Mon Mar 12, 2018 6:10 pm

cdienger wrote:Step through https://support.nagios.com/kb/article/n ... e-611.html and specifically the max_input_vars and memory option. There have been cases where the dashboards are not updated because the amount of data sent via NRDP is too large for the php defaults.

Your answer was close enough that I may very well have solved it. Turns out the actual problem was the size of the data packets being posted to Apache from the Active Nagios servers.

Error from the httpd error log:
POST Content-Length of 8632304 bytes exceeds the limit of 8388608 bytes in Unknown on line 0

To mitigate, I had to update the following php settings:
post_max_size = 20M
upload_max_filesize = 20M

So looks like I spoke too soon. If I leave only NRDP enabled, then checks from the various active servers simply don't get posted often enough to the passive server. I'm not seeing anything in the error_log for httpd and the server shows no signs of stress in CPU, Memory, Load, IO, you name it. Also, message queue betweeen Nagios Core and NDOUtils is also holding steady (at or near 0.) Any other ideas?

Post by **cdienger** » Tue Mar 13, 2018 10:44 am

How far behind are the posting? The perfdataproc.php cron calling send_nrdp should be running every minute. /etc/cron.d/nagiosxi:

* * * * * nagios /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log 2>&1

krutaw · Post by **krutaw** » Wed Mar 14, 2018 8:37 am

cdienger wrote:How far behind are the posting? The perfdataproc.php cron calling send_nrdp should be running every minute. /etc/cron.d/nagiosxi:

* * * * * nagios /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log 2>&1

I've currently got the freshness threshold set to 1500, which by my math is 25 minutes. The individual checks themselves are no more than 10 minutes between active checks (except for rare cases like SSL certificate verification which happens once a day and has it's own freshness threshold.) Also, looks like the feeds are failing a bit more than that. I just noticed a check that had been in duration for 57 minutes that has been checked at least 5 times in that same timeframe. In short, something is still not right here. The only error I'm seeing in the error_log for httpd at this point is:

PHP Notice: Undefined index: service_description in /usr/local/nagiosxi/html/includes/components/ccm/classes/data_class.php on line 1587, referer: http://ashentlmon-p99.advisory.com/nagi ... 26page%3D1

Nagios Support Forum

Inconsistent NRDP performance

Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance

Re: Inconsistent NRDP performance