Nagios Support Forum

Posted: **Thu Jan 04, 2018 10:12 am**

Hi,

I have an issue thats driving me nuts

Basically my Nagios server decides to pick on a random host and then bullies it

By this i mean it will check it and if it thinks its down, rather than wait the 3 minutes its told too recheck it again it starts checking it every 30-50 seconds!

So an example:

I have a custom bash script that will hit one of my websites, log in and click a few links looking for stuff.

here is its host bit -

Code: Select all

define host{	
	host_name		HOST CLENSED				
	alias			HOST CLENSED						
	address			https://cleansed/cleansed			
	use			host-energy-website				; See Host Templates section (below)	
	hostgroups		GDE Websites					; See Hostgroup section (below)	
	parents			FTL GNATBOX					; No alerts if parent is in Down state
	check_command		check_website!"cleansed"			; See Commands section (below)
	}

Here is its host template

Code: Select all

define host{
	name			     host-energy-website	; The name of this host template
	check_period		     website_24x7		; Websites are monitored at all times
	check_interval		     10				; Websites are checked every 10 minutes when in OK state
	retry_interval		     3				; Website re-checked every 3 minutes if in problem state
	max_check_attempts	     3				; Websites checked 3 times to determine Up or Down state
	notification_period	     website_24x7		; Send notifications at any time
	notification_interval	     10				; Resend notifications every 10 minutes
	notification_options	     d,r			; Only send notifications for DOWN and RECOVERY states
	notifications_enabled        1       			; Host notifications are enabled
	contact_groups		     Website Email, Website sms	; Notifications get sent to these groups
	event_handler_enabled        1       			; Host event handler is enabled
        process_perf_data            1       			; Process performance data
        retain_status_information    1       			; Retain status information across program restarts
        retain_nonstatus_information 1       			; Non-Status information is kept between server restarts
	passive_checks_enabled	     0				; Passive checks are disabled
	obsess_over_host	     0				; We do not obsess over this service
	check_freshness		     0				; We do not check this service for freshness
	flap_detection_enabled	     0				; Flap Detection is disabled
	failure_prediction_enabled   0				; Failure Prediction is disabled
	}

And here is example of it being bullied every 30-50 seconds not 3 minutes!

[04-01-2018 14:42:50] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 14:42:20] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 14:41:50] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)

[04-01-2018 11:44:20] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 11:43:50] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 11:43:00] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;CRITICAL - My custom error messgae from my bash script

[04-01-2018 10:52:00] HOST ALERT: HOST CLEANSED;UP;SOFT;3;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 10:51:20] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 10:50:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)

[04-01-2018 02:31:40] HOST ALERT: HOST CLEANSED;UP;HARD;1;OK - https://cleansed/cleasned/ is online and working
[04-01-2018 02:31:00] HOST ALERT: HOST CLEANSED;DOWN;HARD;3;CRITICAL - My custom error messgae from my bash script
[04-01-2018 02:30:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;2;(Host Check Timed Out)
[04-01-2018 02:29:30] HOST ALERT: HOST CLEANSED;DOWN;SOFT;1;(Host Check Timed Out)

And so on - makes mrs FTL really angry when my phone goes off at 4am

If i stop and start the Nagios service it will behave itself for a few days and then it will pick on another random host and do the same - ignore the retry interval and check every 30-50 seconds when it thinks its got a problem.

Machine is running Ubuntu 12.04LTS and Nagios is 3.4.1 - yes i know its old.

But i have another server of the same Ubuntu 12.04LTS on Core 3.4.1 in another location and that doesnt go around bullying hosts!

Please can somebody help me diagnose this playground bully and bring it in for after school detention.

Thanks

Posted: **Thu Jan 04, 2018 11:10 am**

Please post the nagios.cfg files from both the working and non-working servers (and label them appropriately). Are you using the experimental scheduler that 3.x provides?

Also, I have to ask...was the retry_interval 1 in the past? If so, did you restart the nagios service after making the change?

Posted: **Thu Jan 04, 2018 11:47 am**

Hi, please see attached files

Been 3 minute retry check for as long as i remember

DOnt even know what the experimental scheduler is so i guess thats a no

Thankyou

Posted: **Thu Jan 04, 2018 12:41 pm**

@FTL , Please run the following command and show us the output:

Code: Select all

ps -aux | grep nagios

If the output is too big, you may run it like this

Code: Select all

ps -aux | grep nagios > 1.txt

to save the output to a text file, and then upload 1.txt instead.
Also, this one:

Code: Select all

find / -name nagios.cfg

Posted: **Fri Jan 05, 2018 5:41 am**

Please see ps aux file attached

Output of find:

Code: Select all

nagios@XXXXXXXXXX:~$ sudo find / -name nagios.cfg
[sudo] password for nagios:
/home/nagios/Downloads/nagios/sample-config/nagios.cfg
/home/nagios/Downloads/nagios/t/etc/nagios.cfg
/home/nagios/Downloads/nagios/t-tap/smallconfig/nagios.cfg
/usr/local/nagios/etc/nagios.cfg

Posted: **Fri Jan 05, 2018 1:24 pm**

@FTL, Based on the output it looks like you may have 2 instances of Nagios running at the same time. That would explain this strange behavior. Please run the following commands:

Code: Select all

service nagios stop	
killall -9 nagios	
service nagios start

Posted: **Tue Jan 09, 2018 11:04 am**

Hi, which bit in particular shows Nagios running twice please?

Thanks

Posted: **Tue Jan 09, 2018 12:19 pm**

@FTL, I was looking at the following lines:

Code: Select all

nagios   12847  0.0  0.0  13392  1504 ?        S    10:30   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   12920  0.0  0.0  13392  1504 ?                 S    10:30   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   26600  0.1  0.1  13388  2324 ?                Ssl  Jan04   1:43 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

It's not a 100% because those could still be child-parent processes. But I saw two different time stamps. Perhaps running ps -aef | grep nagios could show if such dependencies exist.
But killing all Nagios processes and restarting Nagios is a good first step in a troubleshooting process anyway. Have you done it already?

Nagios Support Forum

Nagios is checking too frequently

Nagios is checking too frequently

Re: Nagios is checking too frequently

Re: Nagios is checking too frequently

Re: Nagios is checking too frequently

Re: Nagios is checking too frequently

Re: Nagios is checking too frequently

Re: Nagios is checking too frequently

Re: Nagios is checking too frequently