Sudden Check_NRPE failures on monitored hosts

agenerette · Post by **agenerette** » Thu Aug 14, 2014 10:29 pm

Issue: On 8/12/2014, Our Chef/Nagios server kicked out a few alerts like the ones shown in the attached (alert) screen-shots. The "disk space – critical" message started showing up, around every 30 minutes or so, on the problem hosts.

I next ran "df -h" on a number of the hosts that were kicking out the alerts and found that disk space was fine on all of them. So, it seemed, at the very least, we weren't dealing with a true emergency.

The "Service Status" screen-shot shows what we're now getting. I'm wondering if anyone can help me figure out why CHECK_NRPE is suddenly failing.

I checked /var/log/syslog on the problem hosts. One showed:

Aug 13 18:46:33 ip-10-171-91-234 kernel: [23473304.143760] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.171.91.234 DST=50.31.164.240 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=39425 DF PROTO=TCP SPT=56777 DPT=443 SEQ=4014266564 ACK=0 WINDOW=14600 RES=0x00 SYN URGP=0 UID=1009 GID=1010
Aug 13 18:46:55 ip-10-171-91-234 kernel: [23473326.748359] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.171.91.234 DST=172.16.0.23 LEN=72 TOS=0x00 PREC=0x00 TTL=64 ID=22744 DF PROTO=UDP SPT=57388 DPT=53 LEN=52 UID=1009 GID=1010
Aug 13 18:46:55 ip-10-171-91-234 kernel: [23473326.755547] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.171.91.234 DST=172.16.0.23 LEN=99 TOS=0x00 PREC=0x00 TTL=64 ID=22746 DF PROTO=UDP SPT=49184 DPT=53 LEN=79 UID=1009 GID=1010
Aug 13 18:46:55 ip-10-171-91-234 kernel: [23473326.755890] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.171.91.234 DST=172.16.0.23 LEN=72 TOS=0x00 PREC=0x00 TTL=64 ID=22746 DF PROTO=UDP SPT=38722 DPT=53 LEN=52 UID=1009 GID=1010
Aug 13 18:46:55 ip-10-171-91-234 kernel: [23473326.756225] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.171.91.234 DST=50.31.164.240 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=19674 DF PROTO=TCP SPT=56778 DPT=443 SEQ=149754880 ACK=0 WINDOW=14600 RES=0x00 SYN URGP=0 UID=1009 GID=1010
Aug 13 18:47:00 ip-10-171-91-234 kernel: [23473330.942884] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.171.91.234 DST=172.16.0.23 LEN=73 TOS=0x00 PREC=0x00 TTL=64 ID=23792 DF PROTO=UDP SPT=60985 DPT=53 LEN=53 UID=0 GID=0
Aug 13 18:47:20 ip-10-171-91-234 nrpe[1845]: Host 54.245.143.104 is not allowed to talk to us!

That last line definitely seemed significant. That IP belongs to the Chef/Nagios host. It left me thinking that something might be wrong with /etc/nagios/nrpe.cfg on the problem hosts, but the Chef/Nagios server's IP hasn't changed and everything was fine on 8/11.

-Anthony

Post by **eloyd** » Fri Aug 15, 2014 8:45 am

So something obviously changed. Maybe something you weren't expecting. Can we see the nrpe.cfg file that's being sent out to your servers from Chef?

agenerette · Post by **agenerette** » Fri Aug 15, 2014 5:39 pm

Yeah, I figured that something must have changed, but I inherited this setup from another sysadmin. Everyone else working for the organization is either a Developer or administrative staff. I've not been able to uncover any recent change that might have been made that would cause the issue that we're seeing.

I'm new to Chef, so I'm not sure exactly what I need to be posting here. Where you see "atlassian-jira" among the hosts on the "Service Status" page that I showed, I did a "chef-client" run. Then, I searched the output for nrpe.cfg. This yielded:

[2014-08-14T22:21:58+00:00] INFO: Processing template[/etc/nagios/nrpe.cfg] action create (nagios::client line 48)

We're running Chef Server version 10 on our own EC2 instance. So, logging into the console on that host, I went to Cookbooks => Recipes and located this block of code, starting at line 48, in the client "recipe":

template "#{node['nagios']['nrpe']['conf_dir']}/nrpe.cfg" do
source "nrpe.cfg.erb"
owner node['nagios']['user']
group node['nagios']['group']
mode 00644
variables(
:mon_host => mon_host,
:nrpe_directory => "#{node['nagios']['nrpe']['conf_dir']}/nrpe.d"
)
notifies :restart, "service[#{node['nagios']['nrpe']['service_name']}]"
end

I went to Cookbooks => Templates and located this, under nrpe.cfg.erb:

# Autogenerated by Chef.

pid_file=<%= node['nagios']['nrpe']['pidfile'] %>
server_port=5666
nrpe_user=<%= node['nagios']['user'] %>
nrpe_group=<%= node['nagios']['group'] %>
dont_blame_nrpe=<%= node['nagios']['nrpe']['dont_blame_nrpe'] %>
debug=0
command_timeout=<%= node['nagios']['nrpe']['command_timeout'] %>
allowed_hosts=<%= @mon_host.join(',') %>
include_dir=<%= @nrpe_directory %>

Now, I just realized, as I was putting together the notes for this posting, that when I look through the "chef-client" output for the "transform-production" host, listed on our Nagios console. I find no instances of "nrpe.cfg": this, though, nagios::client is also in the list of recipes that will/should be applied to this node, in Chef.

So, maybe this is, at least in part, a Chef issue. I also just noticed, though, that the output for "portal-production" does show the "nagios::client" stuff being processed.

-Anthony

Post by **eloyd** » Sun Aug 17, 2014 9:26 am

Would it be possible to see the /etc/nagios/nrpe.cfg file that Chef generates on one of the nodes that affected?

agenerette · Post by **agenerette** » Sun Aug 17, 2014 9:21 pm

Hey, 'not a problem. Thank you for getting back to me...

root@ip-10-170-213-72:~# cat /etc/nagios/nrpe.cfg
# Autogenerated by Chef.

pid_file=/var/run/nagios/nrpe.pid
server_port=5666
nrpe_user=nagios
nrpe_group=nagios
dont_blame_nrpe=0
debug=0
command_timeout=60
allowed_hosts=127.0.0.1,<private ip of nagios server>
include_dir=/etc/nagios/nrpe.d

Now, the public IP of the Nagios server is 54.245.143.104, but that has not been in the monitored nodes' nrpe.cfg file, up to now, and, again, everything worked up until early last week.

From the Nagios server, I just ran the following and got the output shown on the 2nd line:
root@ip-<private ip of Nagios server>:~# /usr/lib/nagios/plugins/check_nrpe -H <ip of monitored host>
CHECK_NRPE: Error - Could not complete SSL handshake.

the last line from "tail /var/log/syslog" run against the monitored node shows:
Aug 18 02:12:00 ip-10-170-213-72 kernel: [23762917.477003] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=50.31.164.240 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=9448 DF PROTO=TCP SPT=60467 DPT=443 SEQ=767911798 ACK=0 WINDOW=14600 RES=0x00 SYN URGP=0 UID=1009 GID=1010

Post by **eloyd** » Mon Aug 18, 2014 7:53 am

Look VERY closely at that firewall output in your syslog. Specifically, "DPT=443." That says that your check_nrpe is trying to connect to port 443. NRPE should be trying to connect on port 5666. I don't know why that is the case, but try this to see if it fixes it:

Code: Select all

/usr/lib/nagios/plugins/check_nrpe -H <ip> -p 5666

You may also need to add a "-n" to skip using SSL.

I'm pretty sure one or both of those options will make things start working. As for why it stopped working, I couldn't tell you.

agenerette · Post by **agenerette** » Mon Aug 18, 2014 12:34 pm

Ah, that makes sense. The check_nrpe call is just something that I was using for testing, though, of course. My primary concern is with why Nagios is suddenly kicking out alerts for the hosts in question. I'm guessing that you saw nothing wrong with the nrpe.cfg file. So, where should I be looking, next?

On that check_nrpe test, with the options that you mentioned, I'm now getting:
root@ip-<ip of nagios server>:~# /usr/lib/nagios/plugins/check_nrpe -n -H <ip of monitored host> -p 5666
CHECK_NRPE: Error receiving data from daemon.

And syslog on the monitored host shows:

Aug 18 17:18:13 ip-10-170-213-72 kernel: [23817290.852666] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=172.16.0.23 LEN=91 TOS=0x00 PREC=0x00 TTL=64 ID=44702 DF PROTO=UDP SPT=48913 DPT=53 LEN=71 UID=0 GID=0
Aug 18 17:18:13 ip-10-170-213-72 kernel: [23817290.904119] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=172.16.0.23 LEN=91 TOS=0x00 PREC=0x00 TTL=64 ID=44715 DF PROTO=UDP SPT=39880 DPT=53 LEN=71 UID=0 GID=0
Aug 18 17:18:21 ip-10-170-213-72 nrpe[32745]: Host 54.245.143.104 is not allowed to talk to us!

-Anthony

Post by **eloyd** » Mon Aug 18, 2014 12:42 pm

Dude, something is really really whacked out if your firewall is saying "DPT=53" when you explicitly said "-p 5666" Port 53 is DNS traffic. So either these firewall reports are not related (most likely), or else your firewall is messing with you.

The bigger problem is:

Aug 18 17:18:21 ip-10-170-213-72 nrpe[32745]: Host 54.245.143.104 is not allowed to talk to us!

Edit: This implies that your allowed_hosts line in your nrpe.cfg does not have the IP that your Nagios server is using listed in it.

agenerette · Post by **agenerette** » Mon Aug 18, 2014 2:07 pm

Yeah, this makes no sense. You'll notice that the server_port directive, in both nodes' nrpe.cfg files is set to 5666. Where you see 10.244.20.90 in the monitored host's nrpe.cfg file, that, again, is the Nagios server's private IP. 54.245.143.104 is the server's public IP. I just added the latter address to the host's nrpe.cfg file, restart the Nagios services on both, and ran check_nrpe, again.

Now, I'm seeing the following in /var/log/syslog on the monitored host:

Aug 18 18:37:31 ip-10-170-213-72 kernel: [23822048.653570] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=172.16.0.23 LEN=72 TOS=0x00 PREC=0x00 TTL=64 ID=54504 DF PROTO=UDP SPT=41068 DPT=53 LEN=52 UID=1009 GID=1010
Aug 18 18:37:31 ip-10-170-213-72 kernel: [23822048.653840] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=50.31.164.240 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=60846 DF PROTO=TCP SPT=34303 DPT=443 SEQ=951138630 ACK=0 WINDOW=14600 RES=0x00 SYN URGP=0 UID=1009 GID=1010
Aug 18 18:38:00 ip-10-170-213-72 kernel: [23822077.475928] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=172.16.0.23 LEN=72 TOS=0x00 PREC=0x00 TTL=64 ID=61710 DF PROTO=UDP SPT=60009 DPT=53 LEN=52 UID=1009 GID=1010
Aug 18 18:38:00 ip-10-170-213-72 kernel: [23822077.476386] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=172.16.0.23 LEN=99 TOS=0x00 PREC=0x00 TTL=64 ID=61710 DF PROTO=UDP SPT=53162 DPT=53 LEN=79 UID=1009 GID=1010
Aug 18 18:38:00 ip-10-170-213-72 kernel: [23822077.476648] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=172.16.0.23 LEN=72 TOS=0x00 PREC=0x00 TTL=64 ID=61710 DF PROTO=UDP SPT=43826 DPT=53 LEN=52 UID=1009 GID=1010
Aug 18 18:38:00 ip-10-170-213-72 kernel: [23822077.476930] DROP_AFW_OUTPUT IN= OUT=eth0 SRC=10.170.213.72 DST=50.31.164.240 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=43289 DF PROTO=TCP SPT=34304 DPT=443 SEQ=1599232023 ACK=0 WINDOW=14600 RES=0x00 SYN URGP=0 UID=1009 GID=1010
Aug 18 18:38:16 ip-10-170-213-72 nrpe[4252]: Error: Could not complete SSL handshake. 1

Beyond the question of why the server's public address isn't getting auto-added to the hosts' nrpe.cfg file, there's this port question. I know of know other way to tell the NRPE utilities which port to use other than using that server_port directive.

Do you happen to know anything about controlling Nagios/NRPE settings and, for that matter, iptables settings, via Chef? Especially with iptables, I'm thinking that something must have changed, but I'm not sure what to look at changing.

-Anthony

Post by **eloyd** » Mon Aug 18, 2014 2:13 pm

Yes, I do know a bit about Chef and controlling Nagios with it. I also know that it doesn't just "break" unless someone mucks with the files. Are your recipes under any form of revision control? It's starting to sound like someone changed/upgraded your Chef recipes without regard for your Nagios configuration.

If that's the case, then it's beyond this forum, and I'd be happy to try to help you out via email. PM me if you want to go that route.

Nagios Support Forum

Sudden Check_NRPE failures on monitored hosts

Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts

Re: Sudden Check_NRPE failures on monitored hosts