ssh vs nrpe

jdalrymple · Post by **jdalrymple** » Mon Mar 30, 2015 2:54 pm

The clash of the titans... ssh vs. nrpe

My opinion will be obvious

Lots of support out there for both Puppet and Chef for automation of the deployment, not to mention NFS mounting configuration directories for nrpe.cfg is just as trivial as NFS mounting plugin directories.

I leave dont_blame_nrpe=0 on my configs, I prefer to NOT pass arguments for security's sake, then I don't need to worry about whether my communications are in cleartext or whatever. I guess that assumes that there is no iffy data in your Nagios check results, I can't think of any reason to have any though.

Here is my strongest sales pitch though:

Code: Select all

[jdalrymple@localhost ~]$ time /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -c check_dummy
OK

real    0m0.009s
user    0m0.003s
sys     0m0.000s
[jdalrymple@localhost ~]$ time ssh localhost /usr/local/nagios/libexec/check_dummy 0
OK

real    0m0.089s
user    0m0.014s
sys     0m0.004s

Post by **Box293** » Wed Apr 01, 2015 12:35 am

There are two reasons why I like check_by_ssh over check_nrpe.

#1 is that check_by_ssh does not have a limit on how much data can be returned. This is the primary reason why I chose this method for box293_check_vmware as some of my checks return a lot of status text and performance data. This can be overcome with check_npre but requires re-compiling using a patch at both ends and all participants must be on the same patch.

#2 is that check_nrpe requires compiling on the client side whereas check_by_ssh does not require this on the client side.

BanditBBS wrote:Thousands of ssh connections can be a big load hog though...that plus this darn ssh issue we are facing here is just killing me. We can not find the cause or resolution! But yeah, having keys in place already makes ssh so nice and not having to actually install anything on the remote servers is great.

Leland Lammert did a great presentation at the Nagios World Conference 2013 called "Nagios in a Multi-Platform Environment".

https://www.youtube.com/watch?v=hjyJi-PsHw4

It's been a long time since I was in that room listening to his presentation however if I remember correctly, he talked about establishing permanent SSH tunnels between Nagios and the hosts. This in turn should reduce the number of connections.

mp4783 · Post by **mp4783** » Wed Apr 01, 2015 11:15 am

Permanent tunnels is a very compelling idea. If you use a muxed (shared) connection, then it will stay up until you tear it down. How many of these connections your server will support is most likely based upon the server resources. I think I've spun up a couple hundred. Most of them will be quiescent.

There is a technology I have played around with that is part of a very interesting product called GateOne (http://liftoffsoftware.com/Products/GateOne). This is a Python based tool that does some really cool things. Inside it is something called Tornado (I think) that sounds like it's the type of thing to support thousands of simultaneous sessions.

One thing I have not been able to get working is the use of an encrypted private key on the Nagios XI server, accessed via an ssh-agent process. I use this all the time and I know that Apache supports it, but any time I try it, it fails. The only thing I've ever gotten to work was an unencrypted key. By the way, the ssh-agent carries the unencrypted key in memory and a utility called keychain (http://www.funtoo.org/Keychain) creates a small file with the two environment variables necessary for Apache to access the ssh-agent process. With proper configuration, these environment variable will be set whenever the user logs on or can be sourced like this: . ~/.keychain/$(uname -n)-sh. Here are examples of the environment variables:

SSH_AUTH_SOCK=/tmp/ssh-BeD438an5648/agent.5648; export SSH_AUTH_SOCK;
SSH_AGENT_PID=5649; export SSH_AGENT_PID;

Post by **BanditBBS** » Wed Apr 01, 2015 11:29 am

My issue with the permanent ssh tunnels is what happens when I need to start using gearman to distribute the load. We're going to eventually need to use gearman, so when that happens, it'd be a nightmare to maintain, no? Plus it would be 1100 tunnels right now and eventually many more, that might be a bit much.

ssax · Post by **ssax** » Wed Apr 01, 2015 4:22 pm

Seems like 1100+ tunnels is a lot, I wonder what type of resources that would consume?

Post by **Box293** » Wed Apr 01, 2015 6:41 pm

BanditBBS wrote:My issue with the permanent ssh tunnels is what happens when I need to start using gearman to distribute the load. We're going to eventually need to use gearman, so when that happens, it'd be a nightmare to maintain, no? Plus it would be 1100 tunnels right now and eventually many more, that might be a bit much.

Lets say you had the XI server running gearman as the master and a worker. You also have two other external workers. So that's 3 workers.

From each worker you would need a tunnel to all remote hosts. Thats a lot of tunnels from the workers perspective. From the remote hosts perspective, there's only 3 tunnels established. Thats less overhead than establishing an ssh or nrpe session each time a check needs to run.

BanditBBS wrote:We're going to eventually need to use gearman, so when that happens, it'd be a nightmare to maintain, no?

Yeah that's the part I've not had to actually do. But I'm sure it's just as much work as maintaining check_nrpe when upgraded versions come out. I'm sure there is a scripted way to manage it.

mp4783 · Post by **mp4783** » Thu Apr 02, 2015 5:49 pm

There isn't a clear answer here. You're going to have to judge system load and overhead with respect to how many tunnels you can support. I think I've had a couple hundred shared connections up and the server didn't seem to mind. Most of the time, they're not doing anything. When all is said and done, the shared connection just buys you some speed in terms of not having to build the connection and tear it down each time. This time saving may not be worth the hassle.

As for the management aspects of it, I don't know enough about Mod Gearman to comment, but like anything, you can probably script a solution. Sorry I don't have a better answer.

Post by **BanditBBS** » Thu Apr 02, 2015 7:23 pm

I appreciate all the replies. I think I am going to go the nrpe route. I am just in charge of all the monitoring here and this ssh issue we have every so often I just can't get enough traction behind investigating and fixing it. I know I can get nrpe to work and work well. I also have automation for deployment and updating all figured out in my head, just need to put pen to paper

rajasegar · Post by **rajasegar** » Fri Apr 03, 2015 2:34 am

In enterprise environment, nrpe will be flagged out in every audit.
There is no way around this. For these cases, we will use check_by_ssh

We rarely have to modify nrpe.cfg because all is parameterised.
This is another auditor's favourite topic which we counter by saying we have sufficient mitigation in place via firewall rules and nrpe whitelist.

Our nrpe is not running under inetd or xinetd. It is run using normal nagios user account.
Never had any problem so far using this method. Easier to get access to nagios account than root.
Any permission issues can be handled via sudo or RBAC (Solaris).

Tons of ways to automate but since most of our nrpe.cfg is different by server we do it manually.

Post by **BanditBBS** » Fri Apr 03, 2015 8:36 am

Yeah, I've had to deal with the SOC audits myself in the past. I was always able to shut them up by showing the security and other stuff setup.

I was planning on writing a check that runs once a day on all my linux servers that checks a folder for a file. If it exists, then a script will be kicked off that will grab new plugins and new nrpe.cfg. It will then restart xinetd. We centrally manage sudo files, so can easily add the line to allow the nagios user to restart that server globally. So basically, I'll throw an update file in a folder, and remove it 24 hours later. In that time all servers should run the once/1440 minutes check.

I need to automate it as we are constantly writing new plugins/expanding our monitoring. That, plus bug fixes or updates to existing plugins. Of course, this plan may all get thrown away and we may do it another way, but that's what is in my mind at the moment.

Nagios Support Forum

ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe

Re: ssh vs nrpe