Page 2 of 3
Re: check_icmp false alerts
Posted: Tue Mar 17, 2015 6:27 pm
by rajasegar
jdalrymple wrote:Sounds like it's a very infrequent problem that occurs? That is going to make it even tougher to sort out.
Any chance you could find one of the false alerts in your nagios.log and share the contents exactly? Like I mentioned check_icmp is pretty solid so I doubt the actual plugin is where the problem lies. I'm wondering if there is some useful output coming back.
It is a nightmare to troubleshoot.
Anyway, the support team cited cases in the net where check_icmp was giving false alarm.
I just need the parameters to use to actually minimize the alerts when there are intermittent packet loss.
check_icmp should check if the loss is constant for about 10sec before alerting.
Right now it is configured to check every 5 minutes and retry after 5 minutes before alerting.
However, we get alerts host is down and then the very next minute host is up alert. Not sure what happened to check again after 5 minutes.
Please advice
Code: Select all
define host {
host_name My Server
alias Staging Server
address 10.10.10.10
check_period 24x7
check_command check-host-fping!!!!!!!!
contact_groups CGRP_INFRA_SM1_WINTEL,CGRP_TMC
notification_period 24x7
initial_state o
importance 0
check_interval 5.000000
retry_interval 5.000000
max_check_attempts 2
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d
notifications_enabled 1
notification_interval 44640.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
..
...
Re: check_icmp false alerts
Posted: Wed Mar 18, 2015 10:07 am
by jdalrymple
If you have a particularly troublesome group of hosts you could specify a custom check.
Increase your thresholds for packet loss,
Increase your number of packets
Increase your max packet interval
Code: Select all
Usage:
check_icmp [options] [-H] host1 host2 hostN
<snip>
-w
warning threshold (currently 200.000ms,40%)
-c
critical threshold (currently 500.000ms,80%)
<snip>
-n
number of packets to send (currently 5)
<snip>
-t
timeout value (seconds, currently 10)
<snip>
Threshold format for -w and -c is 200.25,60% for 200.25 msec RTA and 60%
packet loss. The default values should work well for most users.
You can specify different RTA factors using the standardized abbreviations
us (microseconds), ms (milliseconds, default) or just plain s for seconds.
<snip>
Something like this:
Code: Select all
check_icmp -H $HOSTADDRESS$ -w 1000,80% -c 1500,90% -n 20 -t 30
Would check your host as follows:
Warn if round-trip time was greater than 1 second or if there was more than 80% packet loss (5 or more packets over 30 seconds)
Critical if round-trip time was greater than 1.5 seconds or if there was more than 90% packet loss (3 or more packets over 30 seconds)
Send 20 ICMP packets (this is the max the plugin supports)
Is that maybe something that would help the problem?
Re: check_icmp false alerts
Posted: Wed Mar 18, 2015 10:55 am
by tmcdonald
This issue is similar to one that has been affecting several users recently. We are not yet sure of the cause, so we are gathering information to try and determine a common attribute across all affected systems.
Please send us an updated system profile if possible. Go to Admin -> System Profile and click the blue "Download Profile" button. Then attach that profile.zip file to a PM to myself.
Please also run the following commands:
Code: Select all
ipcs -q >> /tmp/nagios.txt
/usr/local/nagios/bin/ndo2db --version >> /tmp/nagios.txt
/usr/local/nagios/bin/nagios --version >> /tmp/nagios.txt
ps -ef | grep ndo >> /tmp/nagios.txt
ps -ef | grep nagios.cfg >> /tmp/nagios.txt
yum list installed >> /tmp/nagios.txt
And also send me the /tmp/nagios.txt file.
Re: check_icmp false alerts
Posted: Wed Mar 18, 2015 6:11 pm
by rajasegar
jdalrymple wrote:If you have a particularly troublesome group of hosts you could specify a custom check.
Increase your thresholds for packet loss,
Increase your number of packets
Increase your max packet interval
Code: Select all
Usage:
check_icmp [options] [-H] host1 host2 hostN
<snip>
-w
warning threshold (currently 200.000ms,40%)
-c
critical threshold (currently 500.000ms,80%)
<snip>
-n
number of packets to send (currently 5)
<snip>
-t
timeout value (seconds, currently 10)
<snip>
Threshold format for -w and -c is 200.25,60% for 200.25 msec RTA and 60%
packet loss. The default values should work well for most users.
You can specify different RTA factors using the standardized abbreviations
us (microseconds), ms (milliseconds, default) or just plain s for seconds.
<snip>
Something like this:
Code: Select all
check_icmp -H $HOSTADDRESS$ -w 1000,80% -c 1500,90% -n 20 -t 30
Would check your host as follows:
Warn if round-trip time was greater than 1 second or if there was more than 80% packet loss (5 or more packets over 30 seconds)
Critical if round-trip time was greater than 1.5 seconds or if there was more than 90% packet loss (3 or more packets over 30 seconds)
Send 20 ICMP packets (this is the max the plugin supports)
Is that maybe something that would help the problem?
Already tried this same result.
Re: check_icmp false alerts
Posted: Wed Mar 18, 2015 6:15 pm
by rajasegar
tmcdonald wrote:This issue is similar to one that has been affecting several users recently. We are not yet sure of the cause, so we are gathering information to try and determine a common attribute across all affected systems.
Please send us an updated system profile if possible. Go to Admin -> System Profile and click the blue "Download Profile" button. Then attach that profile.zip file to a PM to myself.
Please also run the following commands:
Code: Select all
ipcs -q >> /tmp/nagios.txt
/usr/local/nagios/bin/ndo2db --version >> /tmp/nagios.txt
/usr/local/nagios/bin/nagios --version >> /tmp/nagios.txt
ps -ef | grep ndo >> /tmp/nagios.txt
ps -ef | grep nagios.cfg >> /tmp/nagios.txt
yum list installed >> /tmp/nagios.txt
And also send me the /tmp/nagios.txt file.
Sent you PM. It seems to be stuck in Outbox though.
Re: check_icmp false alerts
Posted: Thu Mar 19, 2015 9:19 am
by tgriep
On Tue Mar 17, 2015 5:27 pm you posted one of your host checks and it looks like you are using fping and not icmp checks.
If that is true, could you post how your check-host-fping command is defined?
Re: check_icmp false alerts
Posted: Thu Mar 19, 2015 9:39 am
by tmcdonald
And for the record, the Outbox just means the person has not read the message yet. I received it.
Re: check_icmp false alerts
Posted: Thu Mar 19, 2015 5:58 pm
by rajasegar
tgriep wrote:On Tue Mar 17, 2015 5:27 pm you posted one of your host checks and it looks like you are using fping and not icmp checks.
If that is true, could you post how your check-host-fping command is defined?
We are trying various types of check to improve the check time and accuracy.
Code: Select all
define command {
command_name check-host-alive
command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
define command {
command_name check-host-alive-http
command_line $USER1$/check_http -H $HOSTADDRESS$
}
define command {
command_name check-host-fping
command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0 -c 5000.0 -n 3
}
define command {
command_name check-host-fping-custom
command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -n $ARG3$
}
define command {
command_name check-host-fping-temp
command_line $USER1$/check_fping -H $HOSTADDRESS$ -T 5000 -w 3000,50% -c 5000,100%
}
Re: check_icmp false alerts
Posted: Fri Mar 20, 2015 10:07 am
by tgriep
Have you captured any false alerts that we can look at?
Re: check_icmp false alerts
Posted: Mon Mar 23, 2015 12:33 am
by rajasegar
tgriep wrote:Have you captured any false alerts that we can look at?
Have not looked into this yet. Busy trying to solve the other BAU issues.
Will update this week.