Fixing damaged and/or partial installs of Nagios

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
agenerette
Posts: 50
Joined: Wed Jul 25, 2012 5:09 pm

Fixing damaged and/or partial installs of Nagios

Post by agenerette »

Hi,

I have a group of 5 linux hosts that are being monitored by a Nagios Core 3.2.3 server. Starting a couple of weeks ago, I started having trouble with bogus "Disk Space -- Critical" alerts being generated for all of the hosts. I've spent a bit of time troubleshooting and found that, at the very least, all but one of the VMs has an empty /etc/nagios/nrpe.d directory. Even when I get basic NRPE communications to work between the server and a particular host, I get "NRPE: Command 'check_disk' not defined" on the "Service Status Details for All Hosts" screen, for that monitored host.

Could someone please help me with insuring that each of my nodes has all of the Nagios and NRPE stuff installed that they need? The systems are being managed by Chef Server 10. I know that help with configuration changes on that system is not something that I can look to this forum for. I have a dialogue running with the opscode.com folks, though, so if I can just get precise command-calls for getting the nodes setup properly, they should be able to help me with translating this into Chef/ruby code.

Thanks,

-Anthony
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Fixing damaged and/or partial installs of Nagios

Post by abrist »

1) First, how was nrpe and nagios-plugins installed on the remote hosts? Through a repo, package, source, etc?
2) Were your nrpe configs always stored in /etc/nagios/nrpe.d/ ?
3) Which checks were originally configured?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
agenerette
Posts: 50
Joined: Wed Jul 25, 2012 5:09 pm

Re: Fixing damaged and/or partial installs of Nagios

Post by agenerette »

The attached screen-shot shows which checks are currently configured. These are as they have been, at the very least, since I inherited the sysadmin role on this collection of hosts. The setup was completed by my predecessor, so I'm not able to say, for certain, how things were installed. Chef Server 10 is managing the machines, though, so if you're familiar with that system, I could look there for more information.

With regard to the /etc/nagios/nrpe.d directory, though, it's true that the Nagios/Chef server itself and one of the monitored hosts have a number of .cfg files located there. So, I'm guessing that the other hosts were not empty, before things went south, recently. The Chef configuration will tell us more, though, I suspect. I just don't know exactly where to look. I believe I'll try running "chef-client" on one of the nodes, redirecting the output to a file and searching that file for occurrences of "nrpe" and other likely strings.

-Anthony
Attachments
Screen Shot 2014-08-28 at 9.03.32 AM.png
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Fixing damaged and/or partial installs of Nagios

Post by slansing »

Chef Server 10 is managing the machines, though, so if you're familiar with that system, I could look there for more information.
Was all of that copied from an email or something? Some of it is a bit out of context...

Have you looked on those remote hosts and checked their nrpe.cfg files to make sure you actually have a check_disk defined? If so, please share it here.
agenerette
Posts: 50
Joined: Wed Jul 25, 2012 5:09 pm

Re: Fixing damaged and/or partial installs of Nagios

Post by agenerette »

slansing wrote:
Chef Server 10 is managing the machines, though, so if you're familiar with that system, I could look there for more information.
Was all of that copied from an email or something? Some of it is a bit out of context...

Have you looked on those remote hosts and checked their nrpe.cfg files to make sure you actually have a check_disk defined? If so, please share it here.

Where you've mentioned something being out of context, please clarify. My original posting wasn't copied from an email thread or anything like that.

On each of the monitored hosts, /etc/nagios/nrpe.cfg looks like this:

# Autogenerated by Chef.

pid_file=/var/run/nrpe.pid
server_port=5666
nrpe_user=nagios
nrpe_group=nagios
dont_blame_nrpe=0
debug=0
command_timeout=60
allowed_hosts=127.0.0.1,10.244.20.90
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Fixing damaged and/or partial installs of Nagios

Post by tmcdonald »

Is that the *entire* nrpe.cfg file? If so it is definitely missing some things. For reference, I am linking the default nrpe.cfg file. In particular, you do not have any commands at all defined.

http://sourceforge.net/p/nagios/nrpe/ci ... rpe.cfg.in

Bear in mind that the @something@ directives are not to be included literally, but rather replaced with their appropriate values during build-time. Look more at the format than the literal contents.
Former Nagios employee
agenerette
Posts: 50
Joined: Wed Jul 25, 2012 5:09 pm

Re: Fixing damaged and/or partial installs of Nagios

Post by agenerette »

Ok, but I'm not able to determine, from the replies that I've seen on the forum, so far, what my next steps need to be, precisely.

I know of no way that changes would have been made to the setup on any of these hosts. Yet, suddenly, a few weeks ago, bogus alerts started coming out related to the "Disk Space" service; messages saying things like "CHECK_NRPE: Error - Could not complete SSL handshake".

If I add the Nagios server's public IP to /etc/nagios/nrpe.cfg on each of the monitored hosts, that SSL handshake message goes away. I believe getting this sorted out, such that the address is read from the Nagios host by Chef is something that I'll be able to sort out.

The SSL handshake error, though, goes away only to be replaced by the "Command 'check_disk' not defined" alert. So, do you happen to know the steps that I would need to take to add this command?
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Fixing damaged and/or partial installs of Nagios

Post by eloyd »

Something changed (for better or worse) in your Chef configuration. If Chef was truly managing these servers before (and I'm sure it was) then it would ALWAYS have been putting the same things out there. Something changed in your Chef configuration (or role configuration within Chef, most likely) to remove the NRPE stuff from the hosts.

Do you have access to knife? If so, can you do this:

Code: Select all

knife node show <nodename>
Where <nodename> is one of the Chef node names for the servers that you're having problems with?

Please copy/paste the output. I'm guessing that one of the recipes, roles, or run list for the node is whacked. Has anything changed with your Chef configuration lately? Not the server itself, but the cookbooks on it?
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
agenerette
Posts: 50
Joined: Wed Jul 25, 2012 5:09 pm

Re: Fixing damaged and/or partial installs of Nagios

Post by agenerette »

Yeah, it really looks like something must have changed, but I'm the only person in our organization who knows anything about Chef and it's been many months since I last did an upload of any kind.

"portal-production" is a node that's showing the "NRPE: Command 'check_disk' not defined" error. Here's it's "knife node show" output:

ageneretteair2:chef-repo abgenerette$ knife node show portal-production
Node Name: portal-production
Environment: production
FQDN:
IP: 50.18.168.192
Run List: role[portal]
Roles: portal, php_app, server, chef-client, base
Recipes: chef-client::config, chef-client::service, vim, build-essential, git, apt-if-appropriate, sudo, users, users::sysadmins, user::data_bag, nagios::client, application, mysql::client, mysql::ruby, portal::default, portal::backup
Platform: amazon 2013.09
Tags:
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Fixing damaged and/or partial installs of Nagios

Post by eloyd »

I don't suppose the nagios::client recipe has been changed lately?
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
Locked