Remove nodes marked as down

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
dlcrites
Posts: 5
Joined: Fri Jun 16, 2017 2:36 pm

Remove nodes marked as down

Post by dlcrites »

I was fed a list of nearly 2k hosts to add to Nagios. In theory they were all linux hosts, but there was no guarantee of that. So I added them with just ping until I could figure out which OS each of them had. The problem is that the list included decommissioned hosts and a plethora of typos, so I now have hundreds of nodes which are showing as DOWN. I spent the better part of a day going through the list of services and removing the ping for these bad hosts, but it is taking more time than I can spend on it.

Please tell me there is a way to say "just remove everything that is marked as down -- the services and nodes, both." If it had to be done in two passes, one for services and one for the hosts/nodes, then that would be okay. I'm not worried about losing something "good" because all of the hosts/nodes marked as "down" are really not there.

Any tips/tricks/ideas would be greatly appreciated!
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Remove nodes marked as down

Post by mcapra »

Ooof, that stings.

The REST API in Nagios XI is a handy tool for this sort of thing, but I can't think of a "quick and easy" way to do this in Nagios Core.

There's various methods to get a list of all hosts/services in a "DOWN/CRITICAL" state. Some lazy greps can get you there. Step 2 would involve some sort of script that works through your configuration files based on that list (this is slightly less trivial).

For the lazy greps, here's a command that parses status.dat to get a list of all host's in a "DOWN" state:

Code: Select all

grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
In action:

Code: Select all

[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name
        host_name=CENESPROD00
And, the inverse, a list of every host in a "UP" state:

Code: Select all

[root@capra_nag tmp]# grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=0' | grep host_name
        host_name=CENESPROD01
        host_name=CENESPROD02
        host_name=CENESPROD03
        host_name=COLPROD02
        host_name=COLPROD03
        host_name=ESMARVELPROD00
        host_name=KIBPROD00
        host_name=PRODDATAMART
        host_name=PRODDELPHISCRIPTS
        host_name=PRODHERMES1
        host_name=PRODHERMES2
        host_name=PRODHERMES3
You can apply roughly the same logic to get a list of services in a "non-OK" state:

Code: Select all

[root@capra_nag tmp]# grep 'servicestatus' -A 15 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=[1,2,3]' | grep 'host_name\|service_description'
        host_name=CENESPROD00
        service_description=Heap Usage
        host_name=COLPROD02
        service_description=Fail Count
        host_name=COLPROD03
        service_description=Fail Count
        host_name=ESMARVELPROD00
        service_description=Cluster Status
For the question of "how do I identify X objects with Y attributes" generally, parsing status.dat is a really quick-and-easy to do that.
Former Nagios employee
https://www.mcapra.com/
dwasswa

Re: Remove nodes marked as down

Post by dwasswa »

Thanks @mcapra.

@dlcrites, have you tried what @mcapra suggested?
kyang

Re: Remove nodes marked as down

Post by kyang »

Hey dlcrites, just checking in to see if your issue is resolved?

Are we okay to close this thread? Or did you have any more questions?
dlcrites
Posts: 5
Joined: Fri Jun 16, 2017 2:36 pm

Re: Remove nodes marked as down

Post by dlcrites »

Thanks for the info -- it looks like what I need. Sorry for the delay -- things got way busy on this end because of our peak season, so I just had to deal with having them all hanging around.

I am working through the suggestions at this time. I will reply with my results from each one.

My first one is that I took the initial one-liner, and changed it a bit to come up with a list of hostnames to work with:

Code: Select all

grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | grep -B 15 'current_state=1' | grep host_name | cut -d= -f2 | sort -u 1>/tmp/find.out
dlcrites
Posts: 5
Joined: Fri Jun 16, 2017 2:36 pm

Re: Remove nodes marked as down

Post by dlcrites »

WARNING ! ! ! ! !
WARNING ! ! ! ! ! This is a highly specific script which will probably NOT
WARNING ! ! ! ! ! work for you! It only worked for me because of some specific
WARNING ! ! ! ! ! things about my situation:
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 1) Every node in question only had the PING service check running.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! 2) Every node in a failure condition needed to be removed.
WARNING ! ! ! ! !
WARNING ! ! ! ! ! This will not work for any other situation!
WARNING ! ! ! ! !
WARNING ! ! ! ! ! I am just providing this as an example of what worked for me, and
WARNING ! ! ! ! ! an not suggesting you use this without significant modifications.
WARNING ! ! ! ! !

Code: Select all

#!/usr/bin/env bash
# ============================================================================
readonly bzero=$(basename $0 .sh)
readonly now=$(date '+%Y%m%d.%H%M%S')
readonly tfile=/tmp/$bzero.$now.out

echo "Processing $bzero"
echo "-- temp file: $tfile"
# ============================================================================

# ============================================================================
# Build a list of the nodes that are down.
# ==============================================
grep 'hoststatus' -A 14 /usr/local/nagios/var/status.dat | \
    grep -B 15 'current_state=1' | \
    grep host_name | \
    cut -d= -f2 | \
    sort -u 1>$tfile 2>&1
# ============================================================================

# ============================================================================
# Remove the PING service and the host.
# ==============================================
readonly me='http://localhost/nagiosxi/api/v1/config'
readonly svc='service_description=PING&applyconfig=1'
readonly hst='applyconfig=1'
readonly apikey='apikey=put-your-api-key-here'
# ==============================================
function cmRemoveService
{
    node=$1
    echo "-- removing PING for $node"
    curl -XDELETE "${me}/service?${apikey}&pretty=1&host_name=${node}&${svc}"
}
# ==============================================
function cmRemoveNode
{
    node=$1
    echo "-- removing NODE $node"
    curl -XDELETE "${me}/host?${apikey}&pretty=1&host_name=${node}&${hst}"
}
# ============================================================================

# ============================================================================
cat $tfile | head -5 | while read line ; do
    echo "Processing: [$line]"
    cmRemoveService "$line"
    cmRemoveNode "$line"

    # give it time to restart...
    sleep 30

    done
head -10 $tfile
rm -f $tfile
# ============================================================================
After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one. Actually, the only "problem" was that it attempted to remove the one it just removed, so that might not have been a "problem," per se, but I didn't want to risk it.

I thought of trying to remove all of the services in one pass, then doing the applyconfig, then removing all of the hosts, followed by another applyconfig, but since the one-at-a-time process like I started with was working, and I don't need to do this more than once, I decided to just let it go. Yes, with hundreds of hosts in the list, this will take a while -- but nowhere as long as doing this manually.

THANKS ! ! ! ! ! !
dlcrites
Posts: 5
Joined: Fri Jun 16, 2017 2:36 pm

Re: Remove nodes marked as down

Post by dlcrites »

Another point is that the "head -5" and such was during my testing... I forgot to remove them. Obviously once this is tested and working, that you'd run with the whole list.

I am satisfied with the results of this thread, so unless someone else had something to add, I'm done.

Thanks muchly,

DL
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Remove nodes marked as down

Post by dwhitfield »

dlcrites wrote: After tinkering around with this script, I added the "sleep 30" so Nagios had a chance to get restarted and stable before the next iteration. I tried 15 seconds, but that wasn't long enough; neither was 20. So I tried 30, and it never "missed" one.
It seems like your issue is resolved, and I don't know a ton about the size of the system other than what was added, but those numbers make me wonder...do you have a ramdisk?

The script in https://assets.nagios.com/downloads/nag ... giosXI.pdf isn't going to work because it modifies at least one XI-specific file, but the document runs you through setting up a ramdisk manually and as long as you skip /usr/local/nagiosxi/html/config.inc.php (and I think that's it), then you should be good. There's also a NRDP file, but you could be using that with Core. Alway, I don't think it should be too hard to use that for a Core setup. If you aren't already doing that, I suspect you'll see some performance improvement.
Locked