Page 1 of 1

Nagios is checking too frequently

Posted: Thu Jan 04, 2018 10:12 am
by FTL
Hi,

I have an issue thats driving me nuts

Basically my Nagios server decides to pick on a random host and then bullies it

By this i mean it will check it and if it thinks its down, rather than wait the 3 minutes its told too recheck it again it starts checking it every 30-50 seconds!

So an example:

I have a custom bash script that will hit one of my websites, log in and click a few links looking for stuff.

here is its host bit -

Code: Select all

define host{	
	host_name		HOST CLENSED				
	alias			HOST CLENSED						
	address			https://cleansed/cleansed			
	use			host-energy-website				; See Host Templates section (below)	
	hostgroups		GDE Websites					; See Hostgroup section (below)	
	parents			FTL GNATBOX					; No alerts if parent is in Down state
	check_command		check_website!"cleansed"			; See Commands section (below)
	}
Here is its host template

Code: Select all

define host{
	name			     host-energy-website	; The name of this host template
	check_period		     website_24x7		; Websites are monitored at all times
	check_interval		     10				; Websites are checked every 10 minutes when in OK state
	retry_interval		     3				; Website re-checked every 3 minutes if in problem state
	max_check_attempts	     3				; Websites checked 3 times to determine Up or Down state
	notification_period	     website_24x7		; Send notifications at any time
	notification_interval	     10				; Resend notifications every 10 minutes
	notification_options	     d,r			; Only send notifications for DOWN and RECOVERY states
	notifications_enabled        1       			; Host notifications are enabled
	contact_groups		     Website Email, Website sms	; Notifications get sent to these groups
	event_handler_enabled        1       			; Host event handler is enabled
        process_perf_data            1       			; Process performance data
        retain_status_information    1       			; Retain status information across program restarts
        retain_nonstatus_information 1       			; Non-Status information is kept between server restarts
	passive_checks_enabled	     0				; Passive checks are disabled
	obsess_over_host	     0				; We do not obsess over this service
	check_freshness		     0				; We do not check this service for freshness
	flap_detection_enabled	     0				; Flap Detection is disabled
	failure_prediction_enabled   0				; Failure Prediction is disabled
	}
And here is example of it being bullied every 30-50 seconds not 3 minutes!

[04-01-2018 14:42:50] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 14:42:20] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 14:41:50] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)

[04-01-2018 11:44:20] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 11:43:50] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 11:43:00] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;CRITICAL - My custom error messgae from my bash script

[04-01-2018 10:52:00] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 10:51:20] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 10:50:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)

[04-01-2018 02:31:40] HOST ALERT: HOST CLEANSED;UP;HARD;1;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 02:31:00] HOST ALERT: HOST CLEANSED;DOWN;HARD;3;CRITICAL - My custom error messgae from my bash script
[04-01-2018 02:30:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 02:29:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)

And so on - makes mrs FTL really angry when my phone goes off at 4am :)

If i stop and start the Nagios service it will behave itself for a few days and then it will pick on another random host and do the same - ignore the retry interval and check every 30-50 seconds when it thinks its got a problem.

Machine is running Ubuntu 12.04LTS and Nagios is 3.4.1 - yes i know its old.

But i have another server of the same Ubuntu 12.04LTS on Core 3.4.1 in another location and that doesnt go around bullying hosts! :)

Please can somebody help me diagnose this playground bully and bring it in for after school detention.

Thanks

Re: Nagios is checking too frequently

Posted: Thu Jan 04, 2018 11:10 am
by dwhitfield
Please post the nagios.cfg files from both the working and non-working servers (and label them appropriately). Are you using the experimental scheduler that 3.x provides?

Also, I have to ask...was the retry_interval 1 in the past? If so, did you restart the nagios service after making the change?

Re: Nagios is checking too frequently

Posted: Thu Jan 04, 2018 11:47 am
by FTL
Hi, please see attached files

Been 3 minute retry check for as long as i remember

DOnt even know what the experimental scheduler is so i guess thats a no :)

Thankyou

Re: Nagios is checking too frequently

Posted: Thu Jan 04, 2018 12:41 pm
by npolovenko
@FTL , Please run the following command and show us the output:

Code: Select all

ps -aux | grep nagios
If the output is too big, you may run it like this

Code: Select all

ps -aux | grep nagios > 1.txt
to save the output to a text file, and then upload 1.txt instead.
Also, this one:

Code: Select all

find / -name nagios.cfg

Re: Nagios is checking too frequently

Posted: Fri Jan 05, 2018 5:41 am
by FTL
Please see ps aux file attached

Output of find:

Code: Select all

nagios@XXXXXXXXXX:~$ sudo find / -name nagios.cfg
[sudo] password for nagios:
/home/nagios/Downloads/nagios/sample-config/nagios.cfg
/home/nagios/Downloads/nagios/t/etc/nagios.cfg
/home/nagios/Downloads/nagios/t-tap/smallconfig/nagios.cfg
/usr/local/nagios/etc/nagios.cfg

Re: Nagios is checking too frequently

Posted: Fri Jan 05, 2018 1:24 pm
by npolovenko
@FTL, Based on the output it looks like you may have 2 instances of Nagios running at the same time. That would explain this strange behavior. Please run the following commands:

Code: Select all

service nagios stop	
killall -9 nagios	
service nagios start

Re: Nagios is checking too frequently

Posted: Tue Jan 09, 2018 11:04 am
by FTL
Hi, which bit in particular shows Nagios running twice please?

Thanks

Re: Nagios is checking too frequently

Posted: Tue Jan 09, 2018 12:19 pm
by npolovenko
@FTL, I was looking at the following lines:

Code: Select all

nagios   12847  0.0  0.0  13392  1504 ?        S    10:30   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   12920  0.0  0.0  13392  1504 ?                 S    10:30   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   26600  0.1  0.1  13388  2324 ?                Ssl  Jan04   1:43 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
It's not a 100% because those could still be child-parent processes. But I saw two different time stamps. Perhaps running ps -aef | grep nagios could show if such dependencies exist.
But killing all Nagios processes and restarting Nagios is a good first step in a troubleshooting process anyway. Have you done it already?