Network broke and killed my nagios

Post by **WillemDH** » Tue Jan 27, 2015 4:09 pm

Hello,

Just a thought, but maybe another path to look at if the whole "service dependency thing" is not very practical to organize or if the "host unreachable makes services stop checking" thing is also to difficult to apply on all types of hosts.

Couldn't there be some max queue number, which causes the Nagios server to stop accepting new checks when too many checks are getting queued. let's say we have on average 500 checks / minute. If this number suddenly jumps to 1200, the time interval of check could be multiplied by some number? Or could be lengthened by X time? Or could be just skipped?

Let's say we know our Nagios server is not able to process more then 1000 checks / minute. Why would we let it go beyond that knowing that the server will crash?

Imho it seems better to have a working Nagios server with a few less checks then a completely unusable server? With a litle more self-monitoring, this should be possible no?

Grtz

Willem

Post by **BanditBBS** » Tue Jan 27, 2015 4:46 pm

How does this look to everyone?

Code: Select all

#!/bin/sh
#
# Event handler script for enabling/disabling active
# checks on services depending on the host status
now=`date +%s`
commandfile='/usr/local/nagios/var/rw/nagios.cmd'

case "$1" in

OK)
	# The host came back up, enable notifications
	/bin/printf "[%lu] ENABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
	;;

WARNING)
	case "$2" in
	SOFT)
		# Do nothing
		;;
	HARD)
		/bin/printf "[%lu] ENABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
		;;
	esac
	;;

UNKNOWN)
	case "$2" in
	SOFT)
		# Do nothing
		;;
	HARD)
		/bin/printf "[%lu] DISABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
		;;
	esac
	;;
	
CRITICAL)
	case "$2" in
	SOFT)
		# Do nothing
		;;
	HARD)
		/bin/printf "[%lu] DISABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
		;;
	esac
	;;
esac
exit 0

Still waiting on Ludmil to go into detail on this before I put this into production:

lmiltchev wrote:This creates some "latency" and other issues, so I guess this is not a great solution either.

scottwilkerson · Post by **scottwilkerson** » Tue Jan 27, 2015 5:15 pm

This actually looks good to accomplish what you were asking for.

A point on the question of not running the service checks if a host is in a DOWN state:

This is not what many people want, in many environments it is quite conceivable that a host check could be in a down state and a service check could be performed.

If the checks are not performed, it would be impossible to accurately return a percentage of time a services were in certain states..

There are settings in the nagios.cfg you can use to force a time delay between each check to prevent quantity of processes spinning out of control such as max_concurrent_checks
http://nagios.sourceforge.net/docs/nagi ... gmain.html

Post by **BanditBBS** » Tue Jan 27, 2015 5:31 pm

scottwilkerson wrote:This actually looks good to accomplish what you were asking for.

A point on the question of not running the service checks if a host is in a DOWN state:

This is not what many people want, in many environments it is quite conceivable that a host check could be in a down state and a service check could be performed.

If the checks are not performed, it would be impossible to accurately return a percentage of time a services were in certain states..

There are settings in the nagios.cfg you can use to force a time delay between each check to prevent quantity of processes spinning out of control such as max_concurrent_checks
http://nagios.sourceforge.net/docs/nagi ... gmain.html

Scott,

I can't even imagine wanting it that way, but definitely understand that others may. That's why my feature request was to add a new variable that if set on services would then disable checks when host is in whatever specified state. That should affect anyone's current setup at all and just make those of us that want it this way have to add a new variable.

I'm going to take a guess that if max_concurrent_checks=0 then that means unlimited?

Thanks!

scottwilkerson · Post by **scottwilkerson** » Tue Jan 27, 2015 5:40 pm

BanditBBS wrote:I'm going to take a guess that if max_concurrent_checks=0 then that means unlimited?

Correct

Post by **BanditBBS** » Tue Jan 27, 2015 5:42 pm

Cool...thanks to everyone for replying.....I'm done with this thread if you want to close it.

Nagios Support Forum

Network broke and killed my nagios

Re: Network broke and killed my nagios

Re: Network broke and killed my nagios

Re: Network broke and killed my nagios

Re: Network broke and killed my nagios

Re: Network broke and killed my nagios

Re: Network broke and killed my nagios