Nagios reporting host down but they are not down

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
usitech
Posts: 5
Joined: Fri Oct 22, 2010 4:10 pm

Nagios reporting host down but they are not down

Post by usitech »

Got a very new Nagios XI server (never used any version at all before this) and I am having some problems understanding why some things are happening with host down alerts.
Aside from the initial setup of NagiosXI (fresh install 2009R1.3E) all configuration additions, changes, etc have been thru the UI (no CLI modifications).

I have several hosts (not all) that are showing as "Host Down" but all of the respective services are all up. The hosts are of course not actually down and are fully pingable from the Nagios server.

When viewing the host detail I see "CRITICAL - 10.255.244.182: rta nan, lost 100%" for the ones that are showing as down

When checking the log at /usr/local/nagios/var/nagios.log I see various entries along these lines and I am not sure if they are pertinent or maybe otherwise indicative of some other misconfiguration.
[1289861795] Warning: Check result queue contained results for host 'FS001D001A-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289861805] Warning: Check result queue contained results for host 'FS001R001-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289861805] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?

How do I troubleshoot this?

I have some more "how do configure it to do xyz?" type questions related to these hosts but I'll save those for another more general thread at another time.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios reporting host down but they are not down

Post by mguthrie »

Would you be able to show us the configurations for the host definitions for these 'down' hosts? I would imagine the error messages in the log are related to the output you're seeing on the host details, but we'll just have to figure out where specifically the problem is. The host definitions can be found in /usr/local/nagios/etc/hosts.
usitech
Posts: 5
Joined: Fri Oct 22, 2010 4:10 pm

Re: Nagios reporting host down but they are not down

Post by usitech »

I obviously sanitized a couple of entries on the FS001R001-MPLS host (IP and SNMP community). The rest is untouched.

Code: Select all

###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date:       2010-11-15 16:56:32
# Version:    Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND --- 
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################

define host {
        host_name                       FS001S001A-MPLS
        use                             xiwizard_switch_host
        address                         10.21.1.101
        parents                         FS001R001-MPLS
        hostgroups                      FTTX-Switches
        max_check_attempts              2
        check_interval                  1
        retry_interval                  1
        contacts                        Justin Krejci
        notification_interval           60
        icon_image                      switch.png
        statusmap_image                 switch.png
        _xiwizard                       switch
        register                        1
        }

###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################

Code: Select all

###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date:       2010-11-15 16:56:32
# Version:    Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND --- 
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################

define host {
        host_name                       FS001R001-MPLS
        use                             xiwizard_switch_host
        address                         x.x.x.11
        parents                         FS001D001B-MPLS
        hostgroups                      FTTX-Routers
        check_command                   Check_IOS!x.x.x.11!xxxx!!!!!!
        max_check_attempts              1
        check_interval                  1
        retry_interval                  1
        contacts                        Justin Krejci
        notification_interval           60
        first_notification_delay        0
        notification_options            d,u,r,f,s
        notifications_enabled           1
        icon_image                      switch.png
        statusmap_image                 switch.png
        _xiwizard                       switch
        register                        1
        }

###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################

Code: Select all

###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date:       2010-11-15 16:56:32
# Version:    Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND --- 
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################

define host {
        host_name                       FS001D001B-MPLS
        use                             xiwizard_genericnetdevice_host
        address                         10.255.244.182
        parents                         FS001D001A-MPLS
        hostgroups                      Egress
        max_check_attempts              2
        check_interval                  1
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        contacts                        Justin Krejci
        notification_interval           60
        notification_period             xi_timeperiod_24x7
        icon_image                      snmp.png                                                                                                                                                                                           
        statusmap_image                 snmp.png                                                                                                                                                                                           
        _xiwizard                       snmp                                                                                                                                                                                               
        register                        1                                                                                                                                                                                                  
        }                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                           
###############################################################################                                                                                                                                                            
#                                                                                                                                                                                                                                          
# Host configuration file                                                                                                                                                                                                                  
#                                                                                                                                                                                                                                          
# END OF FILE                                                                                                                                                                                                                              
#                                                                                                                                                                                                                                          
###############################################################################
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios reporting host down but they are not down

Post by mguthrie »

I might have you do a little bit of hunting to see if we can narrow this down. What I'd first like to find it is the "check_command" being used for each of these hosts. I see the second one you listed has it defined in the host file:
check_command Check_IOS!x.x.x.11!xxxx!!!!!!
But I'm guessing the others have the check_command defined in the template. I'm wondering if there's either some bad output, or a misconfig with host check that's taking place. If the output is bad, or unexpected, it will probably show up as the host being down.

Try these steps.
1. Find the check_command that's being used for that host
2. Look up the full definition of that check_command in the Core Config Manager->Commands page.
3. Try your full check command with the host's arguments from the command line to see what kind of output you get. For example.

Code: Select all

cd /usr/local/nagios/libexec
./check_icmp -H 192.168.5.1
OK - 192.168.5.1: rta 2.365ms, lost 0%|rta=2.365ms;200.000;500.000;0; pl=0%;40;80;;
usitech
Posts: 5
Joined: Fri Oct 22, 2010 4:10 pm

Re: Nagios reporting host down but they are not down

Post by usitech »

The others do not have the IOS check command and are not even Cisco IOS devices.

In the Host Management > Common Settings there is section that shows
Command view check_snmp_cisco_ios -h $ARG1$ -c $ARG2$
$ARG1$ field -- i've filled in the host IP address
$ARG2$ field -- i've filled in the SNMP community

Code: Select all

[root@nagios libexec]# ./check_snmp_cisco_ios -h x.x.x.11 -c xxxx
12.2(54)SG
[root@nagios libexec]# 
The other hosts do not or at least should not have this check command at all. I manually added this check to this particular host.

Here are some additional nagios.log entries for this particular host and the rest. The "plugin may be missing" seems like a clue but I don't know how to identify what is returning a "127" code.

Code: Select all

[1289973600] CURRENT HOST STATE: FS001R001-MPLS;DOWN;HARD;1;(Return code of 127 is out of bounds - plugin may be missing)                                                                                                                  
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Ping;OK;HARD;1;OK - 216.17.70.11: rta 6.944ms, lost 0%                                                                                                                                  
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 2 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps                                                                                                                         
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 34 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps                                                                                                                        
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 44 Bandwidth;OK;HARD;1;OK - Current BW in: .01Mbps Out: 0Mbps                                                                                                                      
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 58 Bandwidth;OK;HARD;1;OK - Current BW in: .01Mbps Out: 0Mbps                                                                                                                      
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 59 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps                                                                                                                        
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 60 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps                                                                                                                        
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 60 Status;OK;HARD;1;OK: Interface Vlan900 (index 60) is up.                                                                                                                        
[1289973992] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...                                            
[1289973995] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?                      
[1289974712] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...                                            
[1289974715] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?                      
[1289975432] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1289975435] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?
[1289976152] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1289976155] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?
Here is additional logs for FS001S001A-MPLS

Code: Select all

[1289973600] CURRENT HOST STATE: FS001S001A-MPLS;DOWN;HARD;2;CRITICAL - 10.21.1.101: rta nan, lost 100%
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Egress Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Egress Status;OK;HARD;1;OK: Interface Ethernet Port on unit 1, port 28 (index 28) is up.
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Ping;OK;HARD;1;OK - 10.21.1.101: rta 5.002ms, lost 0%
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 1 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 10 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 11 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 12 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 13 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 14 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 15 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 16 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 17 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 18 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 19 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 2 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 20 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 21 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 22 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 23 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 24 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 25 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 26 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 27 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 3 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 4 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 5 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 6 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 7 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 8 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 9 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289974112] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1289974115] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?
[1289974832] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1289974835] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?
[1289975552] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
[1289975555] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found!  Perhaps you forgot to define the service in your config files?
[1289976272] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
Here is the third one FS001D001B-MPLS

Code: Select all

[root@nagios var]# grep FS001D001B-MPLS nagios.log | head
[1289973600] CURRENT HOST STATE: FS001D001B-MPLS;DOWN;HARD;1;CRITICAL - x.x.x.182: rta nan, lost 100%
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;CRC Errors;OK;HARD;1;Errors OK - 0 number
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;Eth1 Traffic In;OK;HARD;1;Traffic OK - 2878901419 bytes in
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;Eth1 Traffic Out;OK;HARD;1;Traffic OK - 1608813497 bytes out
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;RSL;OK;HARD;1;Status OK - 532 RSL
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;Uptime;OK;HARD;1;SNMP OK - Timeticks: (160295486) 18 days, 13:15:54.86
[root@nagios var]#
To compare this is an essentially identical host to FS001D001B-MPLS but is working as expected

Code: Select all

[root@nagios var]# grep FS001D001A-MPLS nagios.log | head
[1289973600] CURRENT HOST STATE: FS001D001A-MPLS;UP;HARD;1;OK - x.x.x.181: rta 8.627ms, lost 20%
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;CRC Errors;OK;HARD;1;Errors OK - 0 number
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Eth1 Traffic In;OK;HARD;1;Traffic OK - 996027575 bytes in
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Eth1 Traffic Out;OK;HARD;1;Traffic OK - 2898976803 bytes out
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Ping;OK;HARD;1;OK - 10.255.244.181: rta 6.326ms, lost 0%
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;RSL;OK;HARD;1;Status OK - 532 RSL
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Uptime;OK;HARD;1;SNMP OK - Timeticks: (160296445) 18 days, 13:16:04.45
[1289974052] Warning: The check of host 'FS001D001A-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
[1289974055] Warning: Check result queue contained results for host 'FS001D001A-MPLS', but the host could not be found!  Perhaps you forgot to define the host in your config files?
[1289974172] Warning: The check of service 'Eth1 Traffic In' on host 'FS001D001A-MPLS' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios reporting host down but they are not down

Post by mguthrie »

If you have orphaned checks, you might have multiple instances of nagios running on your machine. Try:

killall -9 nagios

From the command line, and then start the nagios monitoring engine from the web interface again and see if that fixes. We've had this report before, not sure if it's related. http://support.nagios.com/wiki/index.ph ... g_Orphaned
usitech
Posts: 5
Joined: Fri Oct 22, 2010 4:10 pm

Re: Nagios reporting host down but they are not down

Post by usitech »

Restarting Nagios appears to have cleared up one of the "host down" issues but two are still showing down.
FS001D001B-MPLS is now "up"
FS001R001-MPLS and FS001S001A-MPLS are still "down"

The log is much quieter now too. I am seeing this line get repeated every 20 seconds and nothing else is showing in the log anymore.

Code: Select all

[root@nagios ~]# tail -f /usr/local/nagios/var/nagios.log
[1290094960] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290094980] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095000] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095020] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095030] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095050] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095070] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095090] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095100] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095120] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095140] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095160] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095170] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.

Would it be worth deleting the hosts and re-adding them?
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios reporting host down but they are not down

Post by mguthrie »

You could try doing it that way. From what I'm seeing the problem lies in the check_command for that host. It's getting some sort of bad output from the check, perhaps an error message or something. Can you try running that actual check with the parameters for those hosts from the command line and see what you get for output? All checks are looking for a return code of 0-3, and then a string of text. If something else comes back you'll get an error like that.

Just out of curiosity, do you get a change when you use a "check-host-alive" command as the host check instead of your current check?
Locked