check_icmp false alerts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

jdalrymple wrote:Sounds like it's a very infrequent problem that occurs? That is going to make it even tougher to sort out.

Any chance you could find one of the false alerts in your nagios.log and share the contents exactly? Like I mentioned check_icmp is pretty solid so I doubt the actual plugin is where the problem lies. I'm wondering if there is some useful output coming back.
It is a nightmare to troubleshoot.
Anyway, the support team cited cases in the net where check_icmp was giving false alarm.

I just need the parameters to use to actually minimize the alerts when there are intermittent packet loss.
check_icmp should check if the loss is constant for about 10sec before alerting.

Right now it is configured to check every 5 minutes and retry after 5 minutes before alerting.
However, we get alerts host is down and then the very next minute host is up alert. Not sure what happened to check again after 5 minutes.
Please advice

Code: Select all

define host {
        host_name       My Server
        alias   Staging Server
        address 10.10.10.10
        check_period    24x7
        check_command   check-host-fping!!!!!!!!
        contact_groups  CGRP_INFRA_SM1_WINTEL,CGRP_TMC
        notification_period     24x7
        initial_state   o
        importance      0
        check_interval  5.000000
        retry_interval  5.000000
        max_check_attempts      2
        active_checks_enabled   1
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold      0.000000
        high_flap_threshold     0.000000
        flap_detection_enabled  1
        flap_detection_options  a
        freshness_threshold     0
        check_freshness 0
        notification_options    r,d
        notifications_enabled   1
        notification_interval   44640.000000
        first_notification_delay        0.000000
        stalking_options        n
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
..
...
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: check_icmp false alerts

Post by jdalrymple »

If you have a particularly troublesome group of hosts you could specify a custom check.

Increase your thresholds for packet loss,
Increase your number of packets
Increase your max packet interval

Code: Select all

Usage:
 check_icmp [options] [-H] host1 host2 hostN
<snip>
 -w
    warning threshold (currently 200.000ms,40%)
 -c
    critical threshold (currently 500.000ms,80%)
<snip>
 -n
    number of packets to send (currently 5)
<snip>
 -t
    timeout value (seconds, currently  10)
<snip>
 Threshold format for -w and -c is 200.25,60% for 200.25 msec RTA and 60%
 packet loss.  The default values should work well for most users.
 You can specify different RTA factors using the standardized abbreviations
 us (microseconds), ms (milliseconds, default) or just plain s for seconds.
<snip>
Something like this:

Code: Select all

check_icmp -H $HOSTADDRESS$ -w 1000,80% -c 1500,90% -n 20 -t 30
Would check your host as follows:
Warn if round-trip time was greater than 1 second or if there was more than 80% packet loss (5 or more packets over 30 seconds)
Critical if round-trip time was greater than 1.5 seconds or if there was more than 90% packet loss (3 or more packets over 30 seconds)
Send 20 ICMP packets (this is the max the plugin supports)

Is that maybe something that would help the problem?
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: check_icmp false alerts

Post by tmcdonald »

This issue is similar to one that has been affecting several users recently. We are not yet sure of the cause, so we are gathering information to try and determine a common attribute across all affected systems.

Please send us an updated system profile if possible. Go to Admin -> System Profile and click the blue "Download Profile" button. Then attach that profile.zip file to a PM to myself.

Please also run the following commands:

Code: Select all

ipcs -q >> /tmp/nagios.txt
/usr/local/nagios/bin/ndo2db --version >> /tmp/nagios.txt
/usr/local/nagios/bin/nagios --version >> /tmp/nagios.txt
ps -ef | grep ndo >> /tmp/nagios.txt
ps -ef | grep nagios.cfg >> /tmp/nagios.txt
yum list installed >> /tmp/nagios.txt
And also send me the /tmp/nagios.txt file.
Former Nagios employee
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

jdalrymple wrote:If you have a particularly troublesome group of hosts you could specify a custom check.

Increase your thresholds for packet loss,
Increase your number of packets
Increase your max packet interval

Code: Select all

Usage:
 check_icmp [options] [-H] host1 host2 hostN
<snip>
 -w
    warning threshold (currently 200.000ms,40%)
 -c
    critical threshold (currently 500.000ms,80%)
<snip>
 -n
    number of packets to send (currently 5)
<snip>
 -t
    timeout value (seconds, currently  10)
<snip>
 Threshold format for -w and -c is 200.25,60% for 200.25 msec RTA and 60%
 packet loss.  The default values should work well for most users.
 You can specify different RTA factors using the standardized abbreviations
 us (microseconds), ms (milliseconds, default) or just plain s for seconds.
<snip>
Something like this:

Code: Select all

check_icmp -H $HOSTADDRESS$ -w 1000,80% -c 1500,90% -n 20 -t 30
Would check your host as follows:
Warn if round-trip time was greater than 1 second or if there was more than 80% packet loss (5 or more packets over 30 seconds)
Critical if round-trip time was greater than 1.5 seconds or if there was more than 90% packet loss (3 or more packets over 30 seconds)
Send 20 ICMP packets (this is the max the plugin supports)

Is that maybe something that would help the problem?
Already tried this same result.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

tmcdonald wrote:This issue is similar to one that has been affecting several users recently. We are not yet sure of the cause, so we are gathering information to try and determine a common attribute across all affected systems.

Please send us an updated system profile if possible. Go to Admin -> System Profile and click the blue "Download Profile" button. Then attach that profile.zip file to a PM to myself.

Please also run the following commands:

Code: Select all

ipcs -q >> /tmp/nagios.txt
/usr/local/nagios/bin/ndo2db --version >> /tmp/nagios.txt
/usr/local/nagios/bin/nagios --version >> /tmp/nagios.txt
ps -ef | grep ndo >> /tmp/nagios.txt
ps -ef | grep nagios.cfg >> /tmp/nagios.txt
yum list installed >> /tmp/nagios.txt
And also send me the /tmp/nagios.txt file.
Sent you PM. It seems to be stuck in Outbox though.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: check_icmp false alerts

Post by tgriep »

On Tue Mar 17, 2015 5:27 pm you posted one of your host checks and it looks like you are using fping and not icmp checks.
If that is true, could you post how your check-host-fping command is defined?
Be sure to check out our Knowledgebase for helpful articles and solutions!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: check_icmp false alerts

Post by tmcdonald »

And for the record, the Outbox just means the person has not read the message yet. I received it.
Former Nagios employee
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

tgriep wrote:On Tue Mar 17, 2015 5:27 pm you posted one of your host checks and it looks like you are using fping and not icmp checks.
If that is true, could you post how your check-host-fping command is defined?
We are trying various types of check to improve the check time and accuracy.

Code: Select all

define command {
       command_name                  		check-host-alive
       command_line                  		$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

define command {
       command_name                  		check-host-alive-http
       command_line                  		$USER1$/check_http -H $HOSTADDRESS$
}

define command {
       command_name                  		check-host-fping
       command_line                  		$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0 -c 5000.0 -n 3
}

define command {
       command_name                  		check-host-fping-custom
       command_line                  		$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -n $ARG3$
}

define command {
       command_name                  		check-host-fping-temp
       command_line                  		$USER1$/check_fping -H $HOSTADDRESS$ -T 5000 -w 3000,50% -c 5000,100%
}
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: check_icmp false alerts

Post by tgriep »

Have you captured any false alerts that we can look at?
Be sure to check out our Knowledgebase for helpful articles and solutions!
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

tgriep wrote:Have you captured any false alerts that we can look at?
Have not looked into this yet. Busy trying to solve the other BAU issues.
Will update this week.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
Locked