Nagios reporting host down but they are not down
Nagios reporting host down but they are not down
Got a very new Nagios XI server (never used any version at all before this) and I am having some problems understanding why some things are happening with host down alerts.
Aside from the initial setup of NagiosXI (fresh install 2009R1.3E) all configuration additions, changes, etc have been thru the UI (no CLI modifications).
I have several hosts (not all) that are showing as "Host Down" but all of the respective services are all up. The hosts are of course not actually down and are fully pingable from the Nagios server.
When viewing the host detail I see "CRITICAL - 10.255.244.182: rta nan, lost 100%" for the ones that are showing as down
When checking the log at /usr/local/nagios/var/nagios.log I see various entries along these lines and I am not sure if they are pertinent or maybe otherwise indicative of some other misconfiguration.
[1289861795] Warning: Check result queue contained results for host 'FS001D001A-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289861805] Warning: Check result queue contained results for host 'FS001R001-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289861805] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
How do I troubleshoot this?
I have some more "how do configure it to do xyz?" type questions related to these hosts but I'll save those for another more general thread at another time.
Aside from the initial setup of NagiosXI (fresh install 2009R1.3E) all configuration additions, changes, etc have been thru the UI (no CLI modifications).
I have several hosts (not all) that are showing as "Host Down" but all of the respective services are all up. The hosts are of course not actually down and are fully pingable from the Nagios server.
When viewing the host detail I see "CRITICAL - 10.255.244.182: rta nan, lost 100%" for the ones that are showing as down
When checking the log at /usr/local/nagios/var/nagios.log I see various entries along these lines and I am not sure if they are pertinent or maybe otherwise indicative of some other misconfiguration.
[1289861795] Warning: Check result queue contained results for host 'FS001D001A-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289861805] Warning: Check result queue contained results for host 'FS001R001-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289861805] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
How do I troubleshoot this?
I have some more "how do configure it to do xyz?" type questions related to these hosts but I'll save those for another more general thread at another time.
Re: Nagios reporting host down but they are not down
Would you be able to show us the configurations for the host definitions for these 'down' hosts? I would imagine the error messages in the log are related to the output you're seeing on the host details, but we'll just have to figure out where specifically the problem is. The host definitions can be found in /usr/local/nagios/etc/hosts.
Re: Nagios reporting host down but they are not down
I obviously sanitized a couple of entries on the FS001R001-MPLS host (IP and SNMP community). The rest is untouched.
Code: Select all
###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date: 2010-11-15 16:56:32
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################
define host {
host_name FS001S001A-MPLS
use xiwizard_switch_host
address 10.21.1.101
parents FS001R001-MPLS
hostgroups FTTX-Switches
max_check_attempts 2
check_interval 1
retry_interval 1
contacts Justin Krejci
notification_interval 60
icon_image switch.png
statusmap_image switch.png
_xiwizard switch
register 1
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
Code: Select all
###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date: 2010-11-15 16:56:32
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################
define host {
host_name FS001R001-MPLS
use xiwizard_switch_host
address x.x.x.11
parents FS001D001B-MPLS
hostgroups FTTX-Routers
check_command Check_IOS!x.x.x.11!xxxx!!!!!!
max_check_attempts 1
check_interval 1
retry_interval 1
contacts Justin Krejci
notification_interval 60
first_notification_delay 0
notification_options d,u,r,f,s
notifications_enabled 1
icon_image switch.png
statusmap_image switch.png
_xiwizard switch
register 1
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
Code: Select all
###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date: 2010-11-15 16:56:32
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################
define host {
host_name FS001D001B-MPLS
use xiwizard_genericnetdevice_host
address 10.255.244.182
parents FS001D001A-MPLS
hostgroups Egress
max_check_attempts 2
check_interval 1
retry_interval 1
check_period xi_timeperiod_24x7
contacts Justin Krejci
notification_interval 60
notification_period xi_timeperiod_24x7
icon_image snmp.png
statusmap_image snmp.png
_xiwizard snmp
register 1
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
Re: Nagios reporting host down but they are not down
I might have you do a little bit of hunting to see if we can narrow this down. What I'd first like to find it is the "check_command" being used for each of these hosts. I see the second one you listed has it defined in the host file:
Try these steps.
1. Find the check_command that's being used for that host
2. Look up the full definition of that check_command in the Core Config Manager->Commands page.
3. Try your full check command with the host's arguments from the command line to see what kind of output you get. For example.
But I'm guessing the others have the check_command defined in the template. I'm wondering if there's either some bad output, or a misconfig with host check that's taking place. If the output is bad, or unexpected, it will probably show up as the host being down.check_command Check_IOS!x.x.x.11!xxxx!!!!!!
Try these steps.
1. Find the check_command that's being used for that host
2. Look up the full definition of that check_command in the Core Config Manager->Commands page.
3. Try your full check command with the host's arguments from the command line to see what kind of output you get. For example.
Code: Select all
cd /usr/local/nagios/libexec
./check_icmp -H 192.168.5.1
OK - 192.168.5.1: rta 2.365ms, lost 0%|rta=2.365ms;200.000;500.000;0; pl=0%;40;80;;
Re: Nagios reporting host down but they are not down
The others do not have the IOS check command and are not even Cisco IOS devices.
In the Host Management > Common Settings there is section that shows
Command view check_snmp_cisco_ios -h $ARG1$ -c $ARG2$
$ARG1$ field -- i've filled in the host IP address
$ARG2$ field -- i've filled in the SNMP community
The other hosts do not or at least should not have this check command at all. I manually added this check to this particular host.
Here are some additional nagios.log entries for this particular host and the rest. The "plugin may be missing" seems like a clue but I don't know how to identify what is returning a "127" code.
Here is additional logs for FS001S001A-MPLS
Here is the third one FS001D001B-MPLS
To compare this is an essentially identical host to FS001D001B-MPLS but is working as expected
In the Host Management > Common Settings there is section that shows
Command view check_snmp_cisco_ios -h $ARG1$ -c $ARG2$
$ARG1$ field -- i've filled in the host IP address
$ARG2$ field -- i've filled in the SNMP community
Code: Select all
[root@nagios libexec]# ./check_snmp_cisco_ios -h x.x.x.11 -c xxxx
12.2(54)SG
[root@nagios libexec]#
Here are some additional nagios.log entries for this particular host and the rest. The "plugin may be missing" seems like a clue but I don't know how to identify what is returning a "127" code.
Code: Select all
[1289973600] CURRENT HOST STATE: FS001R001-MPLS;DOWN;HARD;1;(Return code of 127 is out of bounds - plugin may be missing)
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Ping;OK;HARD;1;OK - 216.17.70.11: rta 6.944ms, lost 0%
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 2 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 34 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 44 Bandwidth;OK;HARD;1;OK - Current BW in: .01Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 58 Bandwidth;OK;HARD;1;OK - Current BW in: .01Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 59 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 60 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001R001-MPLS;Port 60 Status;OK;HARD;1;OK: Interface Vlan900 (index 60) is up.
[1289973992] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289973995] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
[1289974712] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289974715] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
[1289975432] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289975435] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
[1289976152] Warning: The check of service 'Port 60 Status' on host 'FS001R001-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289976155] Warning: Check result queue contained results for service 'Port 60 Status' on host 'FS001R001-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
Code: Select all
[1289973600] CURRENT HOST STATE: FS001S001A-MPLS;DOWN;HARD;2;CRITICAL - 10.21.1.101: rta nan, lost 100%
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Egress Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Egress Status;OK;HARD;1;OK: Interface Ethernet Port on unit 1, port 28 (index 28) is up.
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Ping;OK;HARD;1;OK - 10.21.1.101: rta 5.002ms, lost 0%
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 1 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 10 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 11 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 12 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 13 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 14 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 15 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 16 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 17 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 18 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 19 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 2 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 20 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 21 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 22 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 23 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 24 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 25 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 26 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 27 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 3 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 4 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 5 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 6 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 7 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 8 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289973600] CURRENT SERVICE STATE: FS001S001A-MPLS;Port 9 Bandwidth;OK;HARD;1;OK - Current BW in: 0Mbps Out: 0Mbps
[1289974112] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289974115] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
[1289974832] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289974835] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
[1289975552] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
[1289975555] Warning: Check result queue contained results for service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS', but the service could not be found! Perhaps you forgot to define the service in your config files?
[1289976272] Warning: The check of service 'Port 8 Bandwidth' on host 'FS001S001A-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
Code: Select all
[root@nagios var]# grep FS001D001B-MPLS nagios.log | head
[1289973600] CURRENT HOST STATE: FS001D001B-MPLS;DOWN;HARD;1;CRITICAL - x.x.x.182: rta nan, lost 100%
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;CRC Errors;OK;HARD;1;Errors OK - 0 number
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;Eth1 Traffic In;OK;HARD;1;Traffic OK - 2878901419 bytes in
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;Eth1 Traffic Out;OK;HARD;1;Traffic OK - 1608813497 bytes out
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;RSL;OK;HARD;1;Status OK - 532 RSL
[1289973600] CURRENT SERVICE STATE: FS001D001B-MPLS;Uptime;OK;HARD;1;SNMP OK - Timeticks: (160295486) 18 days, 13:15:54.86
[root@nagios var]#
Code: Select all
[root@nagios var]# grep FS001D001A-MPLS nagios.log | head
[1289973600] CURRENT HOST STATE: FS001D001A-MPLS;UP;HARD;1;OK - x.x.x.181: rta 8.627ms, lost 20%
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;CRC Errors;OK;HARD;1;Errors OK - 0 number
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Eth1 Traffic In;OK;HARD;1;Traffic OK - 996027575 bytes in
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Eth1 Traffic Out;OK;HARD;1;Traffic OK - 2898976803 bytes out
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Ping;OK;HARD;1;OK - 10.255.244.181: rta 6.326ms, lost 0%
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;RSL;OK;HARD;1;Status OK - 532 RSL
[1289973600] CURRENT SERVICE STATE: FS001D001A-MPLS;Uptime;OK;HARD;1;SNMP OK - Timeticks: (160296445) 18 days, 13:16:04.45
[1289974052] Warning: The check of host 'FS001D001A-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...
[1289974055] Warning: Check result queue contained results for host 'FS001D001A-MPLS', but the host could not be found! Perhaps you forgot to define the host in your config files?
[1289974172] Warning: The check of service 'Eth1 Traffic In' on host 'FS001D001A-MPLS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
Re: Nagios reporting host down but they are not down
If you have orphaned checks, you might have multiple instances of nagios running on your machine. Try:
killall -9 nagios
From the command line, and then start the nagios monitoring engine from the web interface again and see if that fixes. We've had this report before, not sure if it's related. http://support.nagios.com/wiki/index.ph ... g_Orphaned
killall -9 nagios
From the command line, and then start the nagios monitoring engine from the web interface again and see if that fixes. We've had this report before, not sure if it's related. http://support.nagios.com/wiki/index.ph ... g_Orphaned
Re: Nagios reporting host down but they are not down
Restarting Nagios appears to have cleared up one of the "host down" issues but two are still showing down.
FS001D001B-MPLS is now "up"
FS001R001-MPLS and FS001S001A-MPLS are still "down"
The log is much quieter now too. I am seeing this line get repeated every 20 seconds and nothing else is showing in the log anymore.
Would it be worth deleting the hosts and re-adding them?
FS001D001B-MPLS is now "up"
FS001R001-MPLS and FS001S001A-MPLS are still "down"
The log is much quieter now too. I am seeing this line get repeated every 20 seconds and nothing else is showing in the log anymore.
Code: Select all
[root@nagios ~]# tail -f /usr/local/nagios/var/nagios.log
[1290094960] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290094980] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095000] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095020] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095030] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095050] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095070] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095090] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095100] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095120] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095140] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095160] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1290095170] Warning: Return code of 127 for check of host 'FS001R001-MPLS' was out of bounds. Make sure the plugin you're trying to run actually exists.
Would it be worth deleting the hosts and re-adding them?
Re: Nagios reporting host down but they are not down
You could try doing it that way. From what I'm seeing the problem lies in the check_command for that host. It's getting some sort of bad output from the check, perhaps an error message or something. Can you try running that actual check with the parameters for those hosts from the command line and see what you get for output? All checks are looking for a return code of 0-3, and then a string of text. If something else comes back you'll get an error like that.
Just out of curiosity, do you get a change when you use a "check-host-alive" command as the host check instead of your current check?
Just out of curiosity, do you get a change when you use a "check-host-alive" command as the host check instead of your current check?