Nagios Support Forum

Posted: **Tue Mar 17, 2015 6:27 pm**

jdalrymple wrote:Sounds like it's a very infrequent problem that occurs? That is going to make it even tougher to sort out.

Any chance you could find one of the false alerts in your nagios.log and share the contents exactly? Like I mentioned check_icmp is pretty solid so I doubt the actual plugin is where the problem lies. I'm wondering if there is some useful output coming back.

It is a nightmare to troubleshoot.
Anyway, the support team cited cases in the net where check_icmp was giving false alarm.

I just need the parameters to use to actually minimize the alerts when there are intermittent packet loss.
check_icmp should check if the loss is constant for about 10sec before alerting.

Right now it is configured to check every 5 minutes and retry after 5 minutes before alerting.
However, we get alerts host is down and then the very next minute host is up alert. Not sure what happened to check again after 5 minutes.
Please advice

Code: Select all

define host {
        host_name       My Server
        alias   Staging Server
        address 10.10.10.10
        check_period    24x7
        check_command   check-host-fping!!!!!!!!
        contact_groups  CGRP_INFRA_SM1_WINTEL,CGRP_TMC
        notification_period     24x7
        initial_state   o
        importance      0
        check_interval  5.000000
        retry_interval  5.000000
        max_check_attempts      2
        active_checks_enabled   1
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold      0.000000
        high_flap_threshold     0.000000
        flap_detection_enabled  1
        flap_detection_options  a
        freshness_threshold     0
        check_freshness 0
        notification_options    r,d
        notifications_enabled   1
        notification_interval   44640.000000
        first_notification_delay        0.000000
        stalking_options        n
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
..
...

Posted: **Wed Mar 18, 2015 10:07 am**

If you have a particularly troublesome group of hosts you could specify a custom check.

Increase your thresholds for packet loss,
Increase your number of packets
Increase your max packet interval

Code: Select all

Usage:
 check_icmp [options] [-H] host1 host2 hostN
<snip>
 -w
    warning threshold (currently 200.000ms,40%)
 -c
    critical threshold (currently 500.000ms,80%)
<snip>
 -n
    number of packets to send (currently 5)
<snip>
 -t
    timeout value (seconds, currently  10)
<snip>
 Threshold format for -w and -c is 200.25,60% for 200.25 msec RTA and 60%
 packet loss.  The default values should work well for most users.
 You can specify different RTA factors using the standardized abbreviations
 us (microseconds), ms (milliseconds, default) or just plain s for seconds.
<snip>

Something like this:

Code: Select all

check_icmp -H $HOSTADDRESS$ -w 1000,80% -c 1500,90% -n 20 -t 30

Would check your host as follows:
Warn if round-trip time was greater than 1 second or if there was more than 80% packet loss (5 or more packets over 30 seconds)
Critical if round-trip time was greater than 1.5 seconds or if there was more than 90% packet loss (3 or more packets over 30 seconds)
Send 20 ICMP packets (this is the max the plugin supports)

Is that maybe something that would help the problem?

Posted: **Wed Mar 18, 2015 10:55 am**

This issue is similar to one that has been affecting several users recently. We are not yet sure of the cause, so we are gathering information to try and determine a common attribute across all affected systems.

Please send us an updated system profile if possible. Go to Admin -> System Profile and click the blue "Download Profile" button. Then attach that profile.zip file to a PM to myself.

Please also run the following commands:

Code: Select all

ipcs -q >> /tmp/nagios.txt
/usr/local/nagios/bin/ndo2db --version >> /tmp/nagios.txt
/usr/local/nagios/bin/nagios --version >> /tmp/nagios.txt
ps -ef | grep ndo >> /tmp/nagios.txt
ps -ef | grep nagios.cfg >> /tmp/nagios.txt
yum list installed >> /tmp/nagios.txt

And also send me the /tmp/nagios.txt file.

Posted: **Wed Mar 18, 2015 6:11 pm**

jdalrymple wrote:If you have a particularly troublesome group of hosts you could specify a custom check.

Increase your thresholds for packet loss,
Increase your number of packets
Increase your max packet interval
Code: Select all
Usage:
 check_icmp [options] [-H] host1 host2 hostN
<snip>
 -w
    warning threshold (currently 200.000ms,40%)
 -c
    critical threshold (currently 500.000ms,80%)
<snip>
 -n
    number of packets to send (currently 5)
<snip>
 -t
    timeout value (seconds, currently  10)
<snip>
 Threshold format for -w and -c is 200.25,60% for 200.25 msec RTA and 60%
 packet loss.  The default values should work well for most users.
 You can specify different RTA factors using the standardized abbreviations
 us (microseconds), ms (milliseconds, default) or just plain s for seconds.
<snip>
Something like this:
Code: Select all
check_icmp -H $HOSTADDRESS$ -w 1000,80% -c 1500,90% -n 20 -t 30
Would check your host as follows:
Warn if round-trip time was greater than 1 second or if there was more than 80% packet loss (5 or more packets over 30 seconds)
Critical if round-trip time was greater than 1.5 seconds or if there was more than 90% packet loss (3 or more packets over 30 seconds)
Send 20 ICMP packets (this is the max the plugin supports)

Is that maybe something that would help the problem?

Already tried this same result.

Posted: **Wed Mar 18, 2015 6:15 pm**

tmcdonald wrote:This issue is similar to one that has been affecting several users recently. We are not yet sure of the cause, so we are gathering information to try and determine a common attribute across all affected systems.

Please send us an updated system profile if possible. Go to Admin -> System Profile and click the blue "Download Profile" button. Then attach that profile.zip file to a PM to myself.

Please also run the following commands:
Code: Select all
ipcs -q >> /tmp/nagios.txt
/usr/local/nagios/bin/ndo2db --version >> /tmp/nagios.txt
/usr/local/nagios/bin/nagios --version >> /tmp/nagios.txt
ps -ef | grep ndo >> /tmp/nagios.txt
ps -ef | grep nagios.cfg >> /tmp/nagios.txt
yum list installed >> /tmp/nagios.txt
And also send me the /tmp/nagios.txt file.

Sent you PM. It seems to be stuck in Outbox though.

Posted: **Thu Mar 19, 2015 9:19 am**

On Tue Mar 17, 2015 5:27 pm you posted one of your host checks and it looks like you are using fping and not icmp checks.
If that is true, could you post how your check-host-fping command is defined?

Posted: **Thu Mar 19, 2015 9:39 am**

And for the record, the Outbox just means the person has not read the message yet. I received it.

Posted: **Thu Mar 19, 2015 5:58 pm**

tgriep wrote:On Tue Mar 17, 2015 5:27 pm you posted one of your host checks and it looks like you are using fping and not icmp checks.
If that is true, could you post how your check-host-fping command is defined?

We are trying various types of check to improve the check time and accuracy.

Code: Select all

define command {
       command_name                  		check-host-alive
       command_line                  		$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

define command {
       command_name                  		check-host-alive-http
       command_line                  		$USER1$/check_http -H $HOSTADDRESS$
}

define command {
       command_name                  		check-host-fping
       command_line                  		$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0 -c 5000.0 -n 3
}

define command {
       command_name                  		check-host-fping-custom
       command_line                  		$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -n $ARG3$
}

define command {
       command_name                  		check-host-fping-temp
       command_line                  		$USER1$/check_fping -H $HOSTADDRESS$ -T 5000 -w 3000,50% -c 5000,100%
}

Posted: **Fri Mar 20, 2015 10:07 am**

Have you captured any false alerts that we can look at?

Posted: **Mon Mar 23, 2015 12:33 am**

tgriep wrote:Have you captured any false alerts that we can look at?

Have not looked into this yet. Busy trying to solve the other BAU issues.
Will update this week.

Nagios Support Forum

check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts