Remove nodes marked as down

dlcrites · Post by **dlcrites** » Thu Nov 16, 2017 8:05 pm

I was fed a list of nearly 2k hosts to add to Nagios. In theory they were all linux hosts, but there was no guarantee of that. So I added them with just ping until I could figure out which OS each of them had. The problem is that the list included decommissioned hosts and a plethora of typos, so I now have hundreds of nodes which are showing as DOWN. I spent the better part of a day going through the list of services and removing the ping for these bad hosts, but it is taking more time than I can spend on it.

Please tell me there is a way to say "just remove everything that is marked as down -- the services and nodes, both." If it had to be done in two passes, one for services and one for the hosts/nodes, then that would be okay. I'm not worried about losing something "good" because all of the hosts/nodes marked as "down" are really not there.

Any tips/tricks/ideas would be greatly appreciated!

Post by **mcapra** » Fri Nov 17, 2017 10:57 am

Ooof, that stings.

The REST API in Nagios XI is a handy tool for this sort of thing, but I can't think of a "quick and easy" way to do this in Nagios Core.

There's various methods to get a list of all hosts/services in a "DOWN/CRITICAL" state. Some lazy greps can get you there. Step 2 would involve some sort of script that works through your configuration files based on that list (this is slightly less trivial).

For the lazy greps, here's a command that parses status.dat to get a list of all host's in a "DOWN" state:

Code: Select all

grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name

In action:

Code: Select all

[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
        host_name=CENESPROD00

And, the inverse, a list of every host in a "UP" state:

Code: Select all

[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=0' | grep host_name
        host_name=CENESPROD01
        host_name=CENESPROD02
        host_name=CENESPROD03
        host_name=COLPROD02
        host_name=COLPROD03
        host_name=ESMARVELPROD00
        host_name=KIBPROD00
        host_name=PRODDATAMART
        host_name=PRODDELPHISCRIPTS
        host_name=PRODHERMES1
        host_name=PRODHERMES2
        host_name=PRODHERMES3

You can apply roughly the same logic to get a list of services in a "non-OK" state:

Code: Select all

[root@capra_nag tmp]# grep 'servicestatus' -A 15 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=[1,2,3]' | grep 'host_name\|service_description'
        host_name=CENESPROD00
        service_description=Heap Usage
        host_name=COLPROD02
        service_description=Fail Count
        host_name=COLPROD03
        service_description=Fail Count
        host_name=ESMARVELPROD00
        service_description=Cluster Status

For the question of "how do I identify X objects with Y attributes" generally, parsing status.dat is a really quick-and-easy to do that.

dwasswa · Post by **dwasswa** » Fri Nov 17, 2017 1:03 pm

Thanks @mcapra.

@dlcrites, have you tried what @mcapra suggested?

kyang · Post by **kyang** » Tue Nov 28, 2017 2:37 pm

Hey dlcrites, just checking in to see if your issue is resolved?

Are we okay to close this thread? Or did you have any more questions?

dlcrites · Post by **dlcrites** » Tue Jan 02, 2018 12:54 pm

Thanks for the info -- it looks like what I need. Sorry for the delay -- things got way busy on this end because of our peak season, so I just had to deal with having them all hanging around.

I am working through the suggestions at this time. I will reply with my results from each one.

My first one is that I took the initial one-liner, and changed it a bit to come up with a list of hostnames to work with:

Code: Select all

grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name | cut -d= -f2 | sort -u 1>/tmp/find.out

dlcrites · Post by **dlcrites** » Tue Jan 02, 2018 3:53 pm

WARNING ! ! ! ! !
WARNING ! ! ! ! ! This is a highly specific script which will probably NOT
WARNING ! ! ! ! ! work for you! It only worked for me because of some specific
WARNING ! ! ! ! ! things about my situation:
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 1) Every node in question only had the PING service check running.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 2) Every node in a failure condition needed to be removed.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This will not work for any other situation!
WARNING ! ! ! ! !
WARNING ! ! ! ! ! I am just providing this as an example of what worked for me, and
WARNING ! ! ! ! ! an not suggesting you use this without significant modifications.
WARNING ! ! ! ! !

Code: Select all

#!/usr/bin/env bash
# ============================================================================
readonly bzero=$(basename $0 .sh)
readonly now=$(date '+%Y%m%d.%H%M%S')
readonly tfile=/tmp/$bzero.$now.out

echo "Processing $bzero"
echo "-- temp file: $tfile"
# ============================================================================

# ============================================================================
# Build a list of the nodes that are down.
# ==============================================
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | \
    grep -B 15 'current_state=1' | \
    grep host_name | \
    cut -d= -f2 | \
    sort -u 1>$tfile 2>&1
# ============================================================================

# ============================================================================
# Remove the PING service and the host.
# ==============================================
readonly me='http://localhost/nagiosxi/api/v1/config'
readonly svc='service_description=PING&applyconfig=1'
readonly hst='applyconfig=1'
readonly apikey='apikey=put-your-api-key-here'
# ==============================================
function cmRemoveService
{
    node=$1
    echo "-- removing PING for $node"
    curl -XDELETE "${me}/service?${apikey}&pretty=1&host_name=${node}&${svc}"
}
# ==============================================
function cmRemoveNode
{
    node=$1
    echo "-- removing NODE $node"
    curl -XDELETE "${me}/host?${apikey}&pretty=1&host_name=${node}&${hst}"
}
# ============================================================================

# ============================================================================
cat $tfile | head -5 | while read line ; do
    echo "Processing: [$line]"
    cmRemoveService "$line"
    cmRemoveNode "$line"

    # give it time to restart...
    sleep 30

    done
head -10 $tfile
rm -f $tfile
# ============================================================================

After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one. Actually, the only "problem" was that it attempted to remove the one it just removed, so that might not have been a "problem," per se, but I didn't want to risk it.

I thought of trying to remove all of the services in one pass, then doing the applyconfig, then removing all of the hosts, followed by another applyconfig, but since the one-at-a-time process like I started with was working, and I don't need to do this more than once, I decided to just let it go. Yes, with hundreds of hosts in the list, this will take a while -- but nowhere as long as doing this manually.

THANKS ! ! ! ! ! !

dlcrites · Post by **dlcrites** » Tue Jan 02, 2018 3:56 pm

Another point is that the "head -5" and such was during my testing... I forgot to remove them. Obviously once this is tested and working, that you'd run with the whole list.

I am satisfied with the results of this thread, so unless someone else had something to add, I'm done.

Thanks muchly,

DL

dwhitfield · Post by **dwhitfield** » Tue Jan 02, 2018 5:10 pm

dlcrites wrote: After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one.

It seems like your issue is resolved, and I don't know a ton about the size of the system other than what was added, but those numbers make me wonder...do you have a ramdisk?

The script in https://assets.nagios.com/downloads/nag ... giosXI.pdf isn't going to work because it modifies at least one XI-specific file, but the document runs you through setting up a ramdisk manually and as long as you skip /usr/local/nagiosxi/html/config.inc.php (and I think that's it), then you should be good. There's also a NRDP file, but you could be using that with Core. Alway, I don't think it should be too hard to use that for a Core setup. If you aren't already doing that, I suspect you'll see some performance improvement.

Nagios Support Forum

Remove nodes marked as down

Remove nodes marked as down

Re: Remove nodes marked as down

Re: Remove nodes marked as down

Re: Remove nodes marked as down

Re: Remove nodes marked as down

Re: Remove nodes marked as down

Re: Remove nodes marked as down

Re: Remove nodes marked as down