Page 1 of 2
Nagios UI is saying something is down when it is not
Posted: Thu Mar 12, 2020 9:38 am
by Alan
Hello, I have two APC units and a camera that Nagios is monitoring. In the Nagios UI it shows down for the status and if I go into the Host State Information it shows Host Status down. I have had this in the past but after a day or so it changes back to up. But it was never down. I am not sure if there is something weird in my config file that i set incorrectly. I Have attached several screen shots to show what I am seeing.
Re: Nagios UI is saying something is down when it is not
Posted: Thu Mar 12, 2020 3:25 pm
by scottwilkerson
Can you share the configuration you have for the APC1 host and any underlying templates and commands?
Re: Nagios UI is saying something is down when it is not
Posted: Fri Mar 13, 2020 10:10 am
by Alan
One thing I just noticed was the host has generic-switch and the define service has generic-service. I am not sure if this could be the issue?
Code: Select all
define host{
use generic-switch
host_name APC1
alias APC1
address 172.17.20.100
hostgroups DataCenter
}
define service{
use generic-service
host_name APC1
service_description PING
check_command check_ping!200.0,20%!600.0,60%
normal_check_interval 5
retry_check_interval 1
}
define service {
use generic-service ; Inherit values from a template
host_name APC1
service_description Air Filter Run Hours
check_command check_snmp!-C CMPublic -o enterprises.318.1.1.13.3.4.1.3.1.0
}
define service {
use generic-service ; Inherit values from a template
host_name APC1
service_description Air Flow
freshness_threshold 0
check_command check_snmp!-C CMPublic -o .1.3.6.1.4.1.318.1.1.13.3.4.1.2.4.0
}
define service {
use generic-service ; Inherit values from a template
host_name APC1
service_description Compressor Status
check_command check_snmp!-C CMPublic -o .1.3.6.1.4.1.318.1.1.13.3.4.1.2.32.0
}
define service {
use generic-service ; Inherit values from a template
host_name APC1
service_description Rack Inlet Temp
check_command check_snmp!-C CMPublic -o .1.3.6.1.4.1.318.1.1.13.3.4.1.2.6.0
}
Code: Select all
###############################################################################
# TEMPLATES.CFG - SAMPLE OBJECT TEMPLATES
#
#
# NOTES: This config file provides you with some example object definition
# templates that are refered by other host, service, contact, etc.
# definitions in other config files.
#
# You don't need to keep these definitions in a separate file from your
# other object definitions. This has been done just to make things
# easier to understand.
#
###############################################################################
# Generic service definition template - This is NOT a real service, just a template!
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 10 ; Check the service every 10 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 60 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
# Define a template for switches that we can reuse
define host{
name generic-switch ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, switches are monitored round the clock
check_interval 0.3 ; Switches are checked every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 2 ; Check each switch 10 times (max)
check_command check-host-alive ; Default command to check if routers are "alive"
notification_period 24x7 ; Send notifications at any time
notification_interval 30 ; Resend notifications every 30 minutes
notification_options d,r,u ; Only send notifications for specific host states
contact_groups admins,calls ; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
Re: Nagios UI is saying something is down when it is not
Posted: Fri Mar 13, 2020 4:23 pm
by scottwilkerson
In looking at the bottom screenshot above I noticed that Last Check Time is 06-05-2031
Is the date way off on this machine?
If it was and you fixed it, you may need to force an immediate check to reset this
Re: Nagios UI is saying something is down when it is not
Posted: Mon Mar 16, 2020 11:09 am
by Alan
So I looked at the date and time on the CentOS server and it is showing correct. I put a screen shot of the time and this is what is shows when SSH into the server:
Code: Select all
[alan@Svr-Monitor ~]$ date
Mon Mar 16 09:03:39 PDT 2020
[alan@Svr-Monitor ~]$
How do I force an immediate check to reset it?
Re: Nagios UI is saying something is down when it is not
Posted: Mon Mar 16, 2020 11:15 am
by Alan
I did look at a server that shows up and the date is at least in 2020. I checked all three devices that say down and all of them have the date Last State Change: 06-05-2031 19:19:20
I also look at several other devices that are up and they have the date Last State Change: 03-02-2020 09:31:29. Not sure why those three have that weird date.
Re: Nagios UI is saying something is down when it is not
Posted: Mon Mar 16, 2020 3:26 pm
by scottwilkerson
Can you go to each of these and under the "Commands" box click "Re-schedule the next check of this host"
Then Click "Commit"
Re: Nagios UI is saying something is down when it is not
Posted: Tue Mar 17, 2020 1:33 pm
by Alan
I apologize but I have not been able to find where that is. Is this in the Nagios UI?
Re: Nagios UI is saying something is down when it is not
Posted: Tue Mar 17, 2020 1:41 pm
by scottwilkerson
Alan wrote:I apologize but I have not been able to find where that is. Is this in the Nagios UI?
Yes in your user added to these lines in the cgi.cfg?
Code: Select all
authorized_for_all_host_commands=
authorized_for_all_hosts=
authorized_for_all_service_commands=
authorized_for_all_services=
authorized_for_configuration_information=
authorized_for_system_commands=
authorized_for_system_information=
Re: Nagios UI is saying something is down when it is not
Posted: Tue Mar 17, 2020 4:48 pm
by Alan
I do have the username added to all of these in the cgi.cfg file. It is just the nagiosadmin user.