Network broke and killed my nagios

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Network broke and killed my nagios

Post by WillemDH »

Hello,

Just a thought, but maybe another path to look at if the whole "service dependency thing" is not very practical to organize or if the "host unreachable makes services stop checking" thing is also to difficult to apply on all types of hosts.

Couldn't there be some max queue number, which causes the Nagios server to stop accepting new checks when too many checks are getting queued. let's say we have on average 500 checks / minute. If this number suddenly jumps to 1200, the time interval of check could be multiplied by some number? Or could be lengthened by X time? Or could be just skipped?

Let's say we know our Nagios server is not able to process more then 1000 checks / minute. Why would we let it go beyond that knowing that the server will crash?

Imho it seems better to have a working Nagios server with a few less checks then a completely unusable server? With a litle more self-monitoring, this should be possible no?

Grtz

Willem
Nagios XI 5.8.1
https://outsideit.net
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Network broke and killed my nagios

Post by BanditBBS »

How does this look to everyone?

Code: Select all

#!/bin/sh
#
# Event handler script for enabling/disabling active
# checks on services depending on the host status
now=`date +%s`
commandfile='/usr/local/nagios/var/rw/nagios.cmd'

case "$1" in

OK)
	# The host came back up, enable notifications
	/bin/printf "[%lu] ENABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
	;;

WARNING)
	case "$2" in
	SOFT)
		# Do nothing
		;;
	HARD)
		/bin/printf "[%lu] ENABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
		;;
	esac
	;;

UNKNOWN)
	case "$2" in
	SOFT)
		# Do nothing
		;;
	HARD)
		/bin/printf "[%lu] DISABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
		;;
	esac
	;;
	
CRITICAL)
	case "$2" in
	SOFT)
		# Do nothing
		;;
	HARD)
		/bin/printf "[%lu] DISABLE_HOST_SVC_CHECKS;$3\n" $now > $commandfile
		;;
	esac
	;;
esac
exit 0
Still waiting on Ludmil to go into detail on this before I put this into production:
lmiltchev wrote:This creates some "latency" and other issues, so I guess this is not a great solution either.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Network broke and killed my nagios

Post by scottwilkerson »

This actually looks good to accomplish what you were asking for.

A point on the question of not running the service checks if a host is in a DOWN state:

This is not what many people want, in many environments it is quite conceivable that a host check could be in a down state and a service check could be performed.

If the checks are not performed, it would be impossible to accurately return a percentage of time a services were in certain states..

There are settings in the nagios.cfg you can use to force a time delay between each check to prevent quantity of processes spinning out of control such as max_concurrent_checks
http://nagios.sourceforge.net/docs/nagi ... gmain.html
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Network broke and killed my nagios

Post by BanditBBS »

scottwilkerson wrote:This actually looks good to accomplish what you were asking for.

A point on the question of not running the service checks if a host is in a DOWN state:

This is not what many people want, in many environments it is quite conceivable that a host check could be in a down state and a service check could be performed.

If the checks are not performed, it would be impossible to accurately return a percentage of time a services were in certain states..

There are settings in the nagios.cfg you can use to force a time delay between each check to prevent quantity of processes spinning out of control such as max_concurrent_checks
http://nagios.sourceforge.net/docs/nagi ... gmain.html
Scott,

I can't even imagine wanting it that way, but definitely understand that others may. That's why my feature request was to add a new variable that if set on services would then disable checks when host is in whatever specified state. That should affect anyone's current setup at all and just make those of us that want it this way have to add a new variable.

I'm going to take a guess that if max_concurrent_checks=0 then that means unlimited?

Thanks!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Network broke and killed my nagios

Post by scottwilkerson »

BanditBBS wrote:I'm going to take a guess that if max_concurrent_checks=0 then that means unlimited?
Correct
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Network broke and killed my nagios

Post by BanditBBS »

Cool...thanks to everyone for replying.....I'm done with this thread if you want to close it.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked