I have an issue thats driving me nuts
Basically my Nagios server decides to pick on a random host and then bullies it
By this i mean it will check it and if it thinks its down, rather than wait the 3 minutes its told too recheck it again it starts checking it every 30-50 seconds!
So an example:
I have a custom bash script that will hit one of my websites, log in and click a few links looking for stuff.
here is its host bit -
Code: Select all
define host{
host_name HOST CLENSED
alias HOST CLENSED
address https://cleansed/cleansed
use host-energy-website ; See Host Templates section (below)
hostgroups GDE Websites ; See Hostgroup section (below)
parents FTL GNATBOX ; No alerts if parent is in Down state
check_command check_website!"cleansed" ; See Commands section (below)
}
Code: Select all
define host{
name host-energy-website ; The name of this host template
check_period website_24x7 ; Websites are monitored at all times
check_interval 10 ; Websites are checked every 10 minutes when in OK state
retry_interval 3 ; Website re-checked every 3 minutes if in problem state
max_check_attempts 3 ; Websites checked 3 times to determine Up or Down state
notification_period website_24x7 ; Send notifications at any time
notification_interval 10 ; Resend notifications every 10 minutes
notification_options d,r ; Only send notifications for DOWN and RECOVERY states
notifications_enabled 1 ; Host notifications are enabled
contact_groups Website Email, Website sms ; Notifications get sent to these groups
event_handler_enabled 1 ; Host event handler is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Non-Status information is kept between server restarts
passive_checks_enabled 0 ; Passive checks are disabled
obsess_over_host 0 ; We do not obsess over this service
check_freshness 0 ; We do not check this service for freshness
flap_detection_enabled 0 ; Flap Detection is disabled
failure_prediction_enabled 0 ; Failure Prediction is disabled
}
[04-01-2018 14:42:50] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 14:42:20] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 14:41:50] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)
[04-01-2018 11:44:20] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 11:43:50] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 11:43:00] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;CRITICAL - My custom error messgae from my bash script
[04-01-2018 10:52:00] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 10:51:20] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 10:50:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)
[04-01-2018 02:31:40] HOST ALERT: HOST CLEANSED;UP;HARD;1;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 02:31:00] HOST ALERT: HOST CLEANSED;DOWN;HARD;3;CRITICAL - My custom error messgae from my bash script
[04-01-2018 02:30:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 02:29:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)
And so on - makes mrs FTL really angry when my phone goes off at 4am
If i stop and start the Nagios service it will behave itself for a few days and then it will pick on another random host and do the same - ignore the retry interval and check every 30-50 seconds when it thinks its got a problem.
Machine is running Ubuntu 12.04LTS and Nagios is 3.4.1 - yes i know its old.
But i have another server of the same Ubuntu 12.04LTS on Core 3.4.1 in another location and that doesnt go around bullying hosts!
Please can somebody help me diagnose this playground bully and bring it in for after school detention.
Thanks