I was fed a list of nearly 2k hosts to add to Nagios. In theory they were all linux hosts, but there was no guarantee of that. So I added them with just ping until I could figure out which OS each of them had. The problem is that the list included decommissioned hosts and a plethora of typos, so I now have hundreds of nodes which are showing as DOWN. I spent the better part of a day going through the list of services and removing the ping for these bad hosts, but it is taking more time than I can spend on it.
Please tell me there is a way to say "just remove everything that is marked as down -- the services and nodes, both." If it had to be done in two passes, one for services and one for the hosts/nodes, then that would be okay. I'm not worried about losing something "good" because all of the hosts/nodes marked as "down" are really not there.
Any tips/tricks/ideas would be greatly appreciated!
Remove nodes marked as down
Re: Remove nodes marked as down
Ooof, that stings.
The REST API in Nagios XI is a handy tool for this sort of thing, but I can't think of a "quick and easy" way to do this in Nagios Core.
There's various methods to get a list of all hosts/services in a "DOWN/CRITICAL" state. Some lazy greps can get you there. Step 2 would involve some sort of script that works through your configuration files based on that list (this is slightly less trivial).
For the lazy greps, here's a command that parses status.dat to get a list of all host's in a "DOWN" state:
In action:
And, the inverse, a list of every host in a "UP" state:
You can apply roughly the same logic to get a list of services in a "non-OK" state:
For the question of "how do I identify X objects with Y attributes" generally, parsing status.dat is a really quick-and-easy to do that.
The REST API in Nagios XI is a handy tool for this sort of thing, but I can't think of a "quick and easy" way to do this in Nagios Core.
There's various methods to get a list of all hosts/services in a "DOWN/CRITICAL" state. Some lazy greps can get you there. Step 2 would involve some sort of script that works through your configuration files based on that list (this is slightly less trivial).
For the lazy greps, here's a command that parses status.dat to get a list of all host's in a "DOWN" state:
Code: Select all
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
Code: Select all
[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
host_name=CENESPROD00
Code: Select all
[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=0' | grep host_name
host_name=CENESPROD01
host_name=CENESPROD02
host_name=CENESPROD03
host_name=COLPROD02
host_name=COLPROD03
host_name=ESMARVELPROD00
host_name=KIBPROD00
host_name=PRODDATAMART
host_name=PRODDELPHISCRIPTS
host_name=PRODHERMES1
host_name=PRODHERMES2
host_name=PRODHERMES3
Code: Select all
[root@capra_nag tmp]# grep 'servicestatus' -A 15 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=[1,2,3]' | grep 'host_name\|service_description'
host_name=CENESPROD00
service_description=Heap Usage
host_name=COLPROD02
service_description=Fail Count
host_name=COLPROD03
service_description=Fail Count
host_name=ESMARVELPROD00
service_description=Cluster Status
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Remove nodes marked as down
Hey dlcrites, just checking in to see if your issue is resolved?
Are we okay to close this thread? Or did you have any more questions?
Are we okay to close this thread? Or did you have any more questions?
Re: Remove nodes marked as down
Thanks for the info -- it looks like what I need. Sorry for the delay -- things got way busy on this end because of our peak season, so I just had to deal with having them all hanging around.
I am working through the suggestions at this time. I will reply with my results from each one.
My first one is that I took the initial one-liner, and changed it a bit to come up with a list of hostnames to work with:
I am working through the suggestions at this time. I will reply with my results from each one.
My first one is that I took the initial one-liner, and changed it a bit to come up with a list of hostnames to work with:
Code: Select all
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name | cut -d= -f2 | sort -u 1>/tmp/find.out
Re: Remove nodes marked as down
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This is a highly specific script which will probably NOT
WARNING ! ! ! ! ! work for you! It only worked for me because of some specific
WARNING ! ! ! ! ! things about my situation:
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 1) Every node in question only had the PING service check running.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 2) Every node in a failure condition needed to be removed.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This will not work for any other situation!
WARNING ! ! ! ! !
WARNING ! ! ! ! ! I am just providing this as an example of what worked for me, and
WARNING ! ! ! ! ! an not suggesting you use this without significant modifications.
WARNING ! ! ! ! !
After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one. Actually, the only "problem" was that it attempted to remove the one it just removed, so that might not have been a "problem," per se, but I didn't want to risk it.
I thought of trying to remove all of the services in one pass, then doing the applyconfig, then removing all of the hosts, followed by another applyconfig, but since the one-at-a-time process like I started with was working, and I don't need to do this more than once, I decided to just let it go. Yes, with hundreds of hosts in the list, this will take a while -- but nowhere as long as doing this manually.
THANKS ! ! ! ! ! !
WARNING ! ! ! ! ! This is a highly specific script which will probably NOT
WARNING ! ! ! ! ! work for you! It only worked for me because of some specific
WARNING ! ! ! ! ! things about my situation:
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 1) Every node in question only had the PING service check running.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 2) Every node in a failure condition needed to be removed.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This will not work for any other situation!
WARNING ! ! ! ! !
WARNING ! ! ! ! ! I am just providing this as an example of what worked for me, and
WARNING ! ! ! ! ! an not suggesting you use this without significant modifications.
WARNING ! ! ! ! !
Code: Select all
#!/usr/bin/env bash
# ============================================================================
readonly bzero=$(basename $0 .sh)
readonly now=$(date '+%Y%m%d.%H%M%S')
readonly tfile=/tmp/$bzero.$now.out
echo "Processing $bzero"
echo "-- temp file: $tfile"
# ============================================================================
# ============================================================================
# Build a list of the nodes that are down.
# ==============================================
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | \
grep -B 15 'current_state=1' | \
grep host_name | \
cut -d= -f2 | \
sort -u 1>$tfile 2>&1
# ============================================================================
# ============================================================================
# Remove the PING service and the host.
# ==============================================
readonly me='http://localhost/nagiosxi/api/v1/config'
readonly svc='service_description=PING&applyconfig=1'
readonly hst='applyconfig=1'
readonly apikey='apikey=put-your-api-key-here'
# ==============================================
function cmRemoveService
{
node=$1
echo "-- removing PING for $node"
curl -XDELETE "${me}/service?${apikey}&pretty=1&host_name=${node}&${svc}"
}
# ==============================================
function cmRemoveNode
{
node=$1
echo "-- removing NODE $node"
curl -XDELETE "${me}/host?${apikey}&pretty=1&host_name=${node}&${hst}"
}
# ============================================================================
# ============================================================================
cat $tfile | head -5 | while read line ; do
echo "Processing: [$line]"
cmRemoveService "$line"
cmRemoveNode "$line"
# give it time to restart...
sleep 30
done
head -10 $tfile
rm -f $tfile
# ============================================================================
I thought of trying to remove all of the services in one pass, then doing the applyconfig, then removing all of the hosts, followed by another applyconfig, but since the one-at-a-time process like I started with was working, and I don't need to do this more than once, I decided to just let it go. Yes, with hundreds of hosts in the list, this will take a while -- but nowhere as long as doing this manually.
THANKS ! ! ! ! ! !
Re: Remove nodes marked as down
Another point is that the "head -5" and such was during my testing... I forgot to remove them. Obviously once this is tested and working, that you'd run with the whole list.
I am satisfied with the results of this thread, so unless someone else had something to add, I'm done.
Thanks muchly,
DL
I am satisfied with the results of this thread, so unless someone else had something to add, I'm done.
Thanks muchly,
DL
-
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Remove nodes marked as down
It seems like your issue is resolved, and I don't know a ton about the size of the system other than what was added, but those numbers make me wonder...do you have a ramdisk?dlcrites wrote: After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one.
The script in https://assets.nagios.com/downloads/nag ... giosXI.pdf isn't going to work because it modifies at least one XI-specific file, but the document runs you through setting up a ramdisk manually and as long as you skip /usr/local/nagiosxi/html/config.inc.php (and I think that's it), then you should be good. There's also a NRDP file, but you could be using that with Core. Alway, I don't think it should be too hard to use that for a Core setup. If you aren't already doing that, I suspect you'll see some performance improvement.