Hi.
I am using nagios-4.2.0 as the Nagios Core Server, it runs CentOS6.X - all latest patches
Most servers I can monitor but one, all monitored servers run the same CentOS6.X (all patches) using nrpe 3.0.
If I downgrade that one server to nrpe 2.14 (with ssl), it works, happily.
If I upgrade that one server to nrpe 3.0 (with ssl), it does NOT work.
The Nagios server sits on a local network running NRPE 3.0 behind a firewall.
The Nagios Server happily talks to all other servers on various networks on port 5666 - I know the port is open on ALL machines for access from the Nagios Core Server.
I, too, know the port 5666 is open on the machine that has problem using nrpe 3.0 - I can telnet to it on port 5666, but it than bails out when using 3.0 but does not bail out when using 2.4
It would not work at all if the port is NOT open.
If I downgrade that machine NRPE_CLIENT to nrpe 2.14 it happily works, but the server logs are full with
Aug 25 16:43:47 CORESERVER check_nrpe: Remote NRPE_CLIENT does not support Version 3 Packets
Aug 25 16:43:49 CORESERVER check_nrpe: Remote NRPE_CLIENT accepted a Version 2 Packet
and the logs of that NRPE_CLIENT report
Aug 25 16:45:05 NRPE_CLIENT nrpe[23013]: Error: Request packet type/version was invalid!
Aug 25 16:45:05 NRPE_CLIENT nrpe[23013]: Client request was invalid, bailing out...
I get when testing it:
[root@NAGIOSCORE /var/log] #>/usr/local/nagios/libexec/check_nrpe -H NRPE_CLIENT -c check_procs
PROCS OK : count 1706 |count=706;2000;2100 runqueue=1 blocked=0 running=1 new=0.00
Now if I upgrade the NRPE_CLIENT to nrpe 3.0, the logs are filled with
Aug 25 16:56:15 NAGIOSCORE check_nrpe: Error: Could not complete SSL handshake with NRPE_CLIENT: rc=0 SSL-error=5
and the logs of that NRPE_CLIENT report
Aug 25 16:55:17 NRPE_CLIENT nrpe[23815]: Host NAGIOSCORE is not allowed to talk to us!
What I see on the firewall is a "deny tcp (no connection) from NAGIOSCORE to NRPE_CLIENT flags on interface", which clearly means that the SSL handshake flow attempting to go through it does not seem to be following the correct TCP session flow (SYN, SYN ACK etc.) - It sees the second on but not the first one.
But my issue is that this does NOT happen when using nrpe-2.4
So what's makes this SSL thingo so different in 3.0 with respect to 2.4?
Does anybody have an idea how I can fix this?
thanks
Jobst
For one server nrpe-2.14 works, nrpe-3.0 does not
Re: For one server nrpe-2.14 works, nrpe-3.0 does not
First, add -n to the check_nrpe command line for that host, and the nrpe command line on that host. With no encryption, that will tell us if it's an SSL problem or not.
If it works with -n, remove the -n from both ends. Add -s -1 to the check_nrpe command line, and set ssl_logging=0x2f in the nrpe.cfg file. This will log extra SSL information to syslog on both the Nagios host and the remote host. Post the resulting logfile lines from both ends so I can take a look.
EDIT: had -x -1, should be -s -1. Corrected above.
If it works with -n, remove the -n from both ends. Add -s -1 to the check_nrpe command line, and set ssl_logging=0x2f in the nrpe.cfg file. This will log extra SSL information to syslog on both the Nagios host and the remote host. Post the resulting logfile lines from both ends so I can take a look.
EDIT: had -x -1, should be -s -1. Corrected above.
Re: For one server nrpe-2.14 works, nrpe-3.0 does not
EDIT: I heavily edited this reply!
I figured out the problem - the error messages are oh so very misleading.
In the moment I believe it's a bug - I will need to do some more debugging before I will say this for sure.
EDIT: it IS a bug, I now know where it comes from.
The Nagios core server is a multi homed host: outside interface facing the internet and an inside interface connected to DMZ.
Obvioulsy it talks to many NRPE clients through both interfaces. I know that the routing table on the Nagios Server Host is 100% correct, I also know that it works for everything else BUT nrpe-3.0 only. I also know it works for nrpe < 3.0.
There is something VERY odd in the setup sequence of the SSL handshake were it picks up one of the IP ADDRESSES from the allowed_hosts lists.
EDIT: section below heavily changed after first submission
If I have this:
server_address=NAGIOS_CORE_SERVER
allowed_hosts=127.0.0.1,NAGIOS_CORE_SERVER
it works, if I have this:
server_address=NAGIOS_CORE_SERVER
allowed_hosts=127.0.0.1,localhost, NAGIOS_CORE_SERVER
it does not.
I has to do with the "parse_allowed_hosts(char *)" which uses the following three functions:
If you use IP_ADDESSES ONLY, it does NOT break.
I am launching a couple of bugs @ github.
Jobst
I figured out the problem - the error messages are oh so very misleading.
In the moment I believe it's a bug - I will need to do some more debugging before I will say this for sure.
EDIT: it IS a bug, I now know where it comes from.
The Nagios core server is a multi homed host: outside interface facing the internet and an inside interface connected to DMZ.
Obvioulsy it talks to many NRPE clients through both interfaces. I know that the routing table on the Nagios Server Host is 100% correct, I also know that it works for everything else BUT nrpe-3.0 only. I also know it works for nrpe < 3.0.
There is something VERY odd in the setup sequence of the SSL handshake were it picks up one of the IP ADDRESSES from the allowed_hosts lists.
EDIT: section below heavily changed after first submission
If I have this:
server_address=NAGIOS_CORE_SERVER
allowed_hosts=127.0.0.1,NAGIOS_CORE_SERVER
it works, if I have this:
server_address=NAGIOS_CORE_SERVER
allowed_hosts=127.0.0.1,localhost, NAGIOS_CORE_SERVER
it does not.
I has to do with the "parse_allowed_hosts(char *)" which uses the following three functions:
- add_domain_to_acl()
- add_ipv6_to_acl()
- add_ipv4_to_acl()
If you use IP_ADDESSES ONLY, it does NOT break.
I am launching a couple of bugs @ github.
Jobst
Last edited by jobst on Mon Aug 29, 2016 12:50 am, edited 4 times in total.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: For one server nrpe-2.14 works, nrpe-3.0 does not
allowed_hosts=127.0.0.1,NAGIOS_CORE_SERVER, ANOTHER_HOST
There is a space after the comma, does this make a difference if you remove the space?
Are you using IP addresses or dns entires? Are they IPv6 addresses?
There is a space after the comma, does this make a difference if you remove the space?
Are you using IP addresses or dns entires? Are they IPv6 addresses?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: For one server nrpe-2.14 works, nrpe-3.0 does not
Box293 wrote:allowed_hosts=127.0.0.1,NAGIOS_CORE_SERVER, ANOTHER_HOST
There is a space after the comma, does this make a difference if you remove the space?
Are you using IP addresses or dns entires? Are they IPv6 addresses?
Hi, thanks for the reply.
I edited my original reply as I found the issue.
The space is not the problem - also if you read the code for nrpe.c you see that the spaces are dealt with when the configuration filr is read.
Re: For one server nrpe-2.14 works, nrpe-3.0 does not
This is an explanation of the issue.
The problem only occurs for nrpe > 3.X, it does not occur for example using nrpe 2.14
If you have this on the NRPE client
and issue a command on the machine 192.168.1.1 like so
it will work. If you add a HOSTNAME i.e. not an ip-address to the list of allowed hosts on the NRPE CLIENT like so:
it will not work, the nrpe client will display in the log file
The reason for this is that an issue with strtok in "parse_allowed_hosts" stops processing AFTER "localhost" and "192.168.1.1" is NOT added to the allowed acls.
"strtok" is NOT re-entrant, so once the function "add_ipv4_to_acl" is called and cannot add "localhost" (which is correct BTW) it hands over the flow to "add_domain_to_acl" (which is correct as well), but because "strtok" looses the PTR (due to the added function call) processing STOPS after "localhost" is added and "192.168.1.1" is NEVER added.
I have suggested on github's nrpe forum to add a function called "str_split" which prevents this and makes the entire things more poartable.
My problem, too, is that although my system has "strtok_r" (shown when I run the "configure" command) it is not called during the function call of "parse_allowed_hosts".
As the function "strtok" uses and internal function pointer calls to "strtok" from another function looses that pointer.
So in the code
the INTERNAL pointer that "strtok" keeps is LOST between these 3 functions calls
it will never work.
The problem only occurs for nrpe > 3.X, it does not occur for example using nrpe 2.14
If you have this on the NRPE client
Code: Select all
server_address=172.16.1.1
allowed_hosts=127.0.0.1,192.168.1.1
and issue a command on the machine 192.168.1.1 like so
Code: Select all
check_nrpe -H 172.16.1.1 -c check_some-disk
Code: Select all
server_address=172.16.1.1
allowed_hosts=127.0.0.1,localhost,192.168.1.1
it will not work, the nrpe client will display in the log file
Code: Select all
Aug 26 17:07:10 172.16.1.1 nrpe[15514]: Connection from 192.168.1.1 port 38593
Aug 26 17:07:10 172.16.1.1 nrpe[15514]: Host 192.168.1.1 is not allowed to talk to us!
"strtok" is NOT re-entrant, so once the function "add_ipv4_to_acl" is called and cannot add "localhost" (which is correct BTW) it hands over the flow to "add_domain_to_acl" (which is correct as well), but because "strtok" looses the PTR (due to the added function call) processing STOPS after "localhost" is added and "192.168.1.1" is NEVER added.
I have suggested on github's nrpe forum to add a function called "str_split" which prevents this and makes the entire things more poartable.
My problem, too, is that although my system has "strtok_r" (shown when I run the "configure" command) it is not called during the function call of "parse_allowed_hosts".
As the function "strtok" uses and internal function pointer calls to "strtok" from another function looses that pointer.
So in the code
Code: Select all
#ifdef HAVE_STRTOK_R
tok = strtok_r(hosts, delim, &saveptr);
#else
tok = strtok(hosts, delim);
#endif
while( tok) {
trimmed_tok = malloc( sizeof( char) * ( strlen( tok) + 1));
trim( tok, trimmed_tok);
if( strlen( trimmed_tok) > 0) {
if (!add_ipv4_to_acl(trimmed_tok) && !add_ipv6_to_acl(trimmed_tok)
&& !add_domain_to_acl(trimmed_tok)) {
syslog(LOG_ERR,"Can't add to ACL this record (%s). Check allowed_hosts option!\n",trimmed_tok);
}
}
free( trimmed_tok);
#ifdef HAVE_STRTOK_R
tok = strtok_r(NULL, delim, &saveptr);
#else
tok = strtok(NULL, delim);
#endif
}
Code: Select all
!add_ipv4_to_acl(trimmed_tok)
!add_ipv6_to_acl(trimmed_tok)
!add_domain_to_acl(trimmed_tok)
Re: For one server nrpe-2.14 works, nrpe-3.0 does not
Nice work! Sorry you had to put so much time into it, though. And thanks for entering the issues on github.
There are been several problems with parse_allowed_hosts so it doesn't surprise me that it's the culprit here.
I'll get to work on these and hopefully have them done in time for the 3.0.1 release coming up shortly.
There are been several problems with parse_allowed_hosts so it doesn't surprise me that it's the culprit here.
I'll get to work on these and hopefully have them done in time for the 3.0.1 release coming up shortly.