Page 1 of 1
Remove nodes marked as down
Posted: Thu Nov 16, 2017 8:05 pm
by dlcrites
I was fed a list of nearly 2k hosts to add to Nagios. In theory they were all linux hosts, but there was no guarantee of that. So I added them with just ping until I could figure out which OS each of them had. The problem is that the list included decommissioned hosts and a plethora of typos, so I now have hundreds of nodes which are showing as DOWN. I spent the better part of a day going through the list of services and removing the ping for these bad hosts, but it is taking more time than I can spend on it.
Please tell me there is a way to say "just remove everything that is marked as down -- the services and nodes, both." If it had to be done in two passes, one for services and one for the hosts/nodes, then that would be okay. I'm not worried about losing something "good" because all of the hosts/nodes marked as "down" are really not there.
Any tips/tricks/ideas would be greatly appreciated!
Re: Remove nodes marked as down
Posted: Fri Nov 17, 2017 10:57 am
by mcapra
Ooof, that stings.
The
REST API in Nagios XI is a handy tool for this sort of thing, but I can't think of a "quick and easy" way to do this in Nagios Core.
There's various methods to get a list of all hosts/services in a "DOWN/CRITICAL" state. Some lazy greps can get you there. Step 2 would involve some sort of script that works through your configuration files based on that list (this is slightly less trivial).
For the lazy greps, here's a command that parses status.dat to get a list of all host's in a "DOWN" state:
Code: Select all
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
In action:
Code: Select all
[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
host_name=CENESPROD00
And, the inverse, a list of every host in a "UP" state:
Code: Select all
[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=0' | grep host_name
host_name=CENESPROD01
host_name=CENESPROD02
host_name=CENESPROD03
host_name=COLPROD02
host_name=COLPROD03
host_name=ESMARVELPROD00
host_name=KIBPROD00
host_name=PRODDATAMART
host_name=PRODDELPHISCRIPTS
host_name=PRODHERMES1
host_name=PRODHERMES2
host_name=PRODHERMES3
You can apply roughly the same logic to get a list of services in a "non-OK" state:
Code: Select all
[root@capra_nag tmp]# grep 'servicestatus' -A 15 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=[1,2,3]' | grep 'host_name\|service_description'
host_name=CENESPROD00
service_description=Heap Usage
host_name=COLPROD02
service_description=Fail Count
host_name=COLPROD03
service_description=Fail Count
host_name=ESMARVELPROD00
service_description=Cluster Status
For the question of "how do I identify X objects with Y attributes" generally, parsing status.dat is a really quick-and-easy to do that.
Re: Remove nodes marked as down
Posted: Fri Nov 17, 2017 1:03 pm
by dwasswa
Thanks
@mcapra.
@dlcrites, have you tried what
@mcapra suggested?
Re: Remove nodes marked as down
Posted: Tue Nov 28, 2017 2:37 pm
by kyang
Hey dlcrites, just checking in to see if your issue is resolved?
Are we okay to close this thread? Or did you have any more questions?
Re: Remove nodes marked as down
Posted: Tue Jan 02, 2018 12:54 pm
by dlcrites
Thanks for the info -- it looks like what I need. Sorry for the delay -- things got way busy on this end because of our peak season, so I just had to deal with having them all hanging around.
I am working through the suggestions at this time. I will reply with my results from each one.
My first one is that I took the initial one-liner, and changed it a bit to come up with a list of hostnames to work with:
Code: Select all
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name | cut -d= -f2 | sort -u 1>/tmp/find.out
Re: Remove nodes marked as down
Posted: Tue Jan 02, 2018 3:53 pm
by dlcrites
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This is a highly specific script which will probably
NOT
WARNING ! ! ! ! ! work for you! It only worked for me because of some specific
WARNING ! ! ! ! ! things about my situation:
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 1) Every node in question only had the PING service check running.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 2) Every node in a failure condition needed to be removed.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This will not work for any other situation!
WARNING ! ! ! ! !
WARNING ! ! ! ! ! I am just providing this as an example of what worked for me, and
WARNING ! ! ! ! ! an not suggesting you use this without significant modifications.
WARNING ! ! ! ! !
Code: Select all
#!/usr/bin/env bash
# ============================================================================
readonly bzero=$(basename $0 .sh)
readonly now=$(date '+%Y%m%d.%H%M%S')
readonly tfile=/tmp/$bzero.$now.out
echo "Processing $bzero"
echo "-- temp file: $tfile"
# ============================================================================
# ============================================================================
# Build a list of the nodes that are down.
# ==============================================
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | \
grep -B 15 'current_state=1' | \
grep host_name | \
cut -d= -f2 | \
sort -u 1>$tfile 2>&1
# ============================================================================
# ============================================================================
# Remove the PING service and the host.
# ==============================================
readonly me='http://localhost/nagiosxi/api/v1/config'
readonly svc='service_description=PING&applyconfig=1'
readonly hst='applyconfig=1'
readonly apikey='apikey=put-your-api-key-here'
# ==============================================
function cmRemoveService
{
node=$1
echo "-- removing PING for $node"
curl -XDELETE "${me}/service?${apikey}&pretty=1&host_name=${node}&${svc}"
}
# ==============================================
function cmRemoveNode
{
node=$1
echo "-- removing NODE $node"
curl -XDELETE "${me}/host?${apikey}&pretty=1&host_name=${node}&${hst}"
}
# ============================================================================
# ============================================================================
cat $tfile | head -5 | while read line ; do
echo "Processing: [$line]"
cmRemoveService "$line"
cmRemoveNode "$line"
# give it time to restart...
sleep 30
done
head -10 $tfile
rm -f $tfile
# ============================================================================
After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one. Actually, the only "problem" was that it attempted to remove the one it just removed, so that might not have been a "problem," per se, but I didn't want to risk it.
I thought of trying to remove all of the services in one pass, then doing the applyconfig, then removing all of the hosts, followed by another applyconfig, but since the one-at-a-time process like I started with was working, and I don't need to do this more than once, I decided to just let it go. Yes, with hundreds of hosts in the list, this will take a while -- but nowhere as long as doing this manually.
THANKS ! ! ! ! ! !
Re: Remove nodes marked as down
Posted: Tue Jan 02, 2018 3:56 pm
by dlcrites
Another point is that the "head -5" and such was during my testing... I forgot to remove them. Obviously once this is tested and working, that you'd run with the whole list.
I am satisfied with the results of this thread, so unless someone else had something to add, I'm done.
Thanks muchly,
DL
Re: Remove nodes marked as down
Posted: Tue Jan 02, 2018 5:10 pm
by dwhitfield
dlcrites wrote:
After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one.
It seems like your issue is resolved, and I don't know a ton about the size of the system other than what was added, but those numbers make me wonder...do you have a ramdisk?
The script in
https://assets.nagios.com/downloads/nag ... giosXI.pdf isn't going to work because it modifies at least one XI-specific file, but the document runs you through setting up a ramdisk manually and as long as you skip /usr/local/nagiosxi/html/config.inc.php (and I think that's it), then you should be good. There's also a NRDP file, but you could be using that with Core. Alway, I don't think it should be too hard to use that for a Core setup. If you aren't already doing that, I suspect you'll see some performance improvement.