check_icmp false alerts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

check_icmp false alerts

Post by rajasegar »

Nagios XI 2014R1.2
check_gearman: version 1.4_nagios4 running on libgearman 0.25

Recently we are having a lot of issue with check_icmp.
It keeps on giving intermittent false alarms.

Please advice if there is anyway to reduce this.

Code: Select all

 $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0 -c 5000.0 -n 3
Last edited by rajasegar on Thu Mar 12, 2015 5:59 pm, edited 1 time in total.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

Not sure if this is related but I am seeing this message in the system log.

Please advice on the tuning if required.

Code: Select all

Mar 12 11:00:01 nagiosprodxi1 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 23815 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Mar 12 11:00:08 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:08 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:09 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:09 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:10 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:10 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:11 nagiosprodxi1 ndo2db: Message sent to queue.
------
Mar 12 11:04:07 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: get_exports: lookup(hosts): exports lookup failed for <
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: get_exports: lookup(hosts): exports lookup failed for <
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:14:07 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
/etc/sysctl.conf

Code: Select all

[nagios@nagiosprodxi1 local]$ cat /etc/sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1

# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1

# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 131072000

# Controls the maximum size of a message, in bytes
kernel.msgmax = 131072000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
jomann
Development Lead
Posts: 611
Joined: Mon Apr 22, 2013 10:06 am
Location: Nagios Enterprises

Re: check_icmp false alerts

Post by jomann »

It looks like you might be having hostname resolution issues with your DNS. Other services are reporting hostname lookup failed:

Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known

As for the queue size, it looks like you're good on amount of messages (23.8k of 128k) and with the setup you have (if this is the same server with 12cpu cores, etc) then you should be able to up your kernel.msgmnb and kernel.msgmax values in the sysctl.conf file and then run the sysctl -p command to update them. That should help with the ndoutils queue maxing out.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

jomann wrote:It looks like you might be having hostname resolution issues with your DNS. Other services are reporting hostname lookup failed:

Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known

As for the queue size, it looks like you're good on amount of messages (23.8k of 128k) and with the setup you have (if this is the same server with 12cpu cores, etc) then you should be able to up your kernel.msgmnb and kernel.msgmax values in the sysctl.conf file and then run the sysctl -p command to update them. That should help with the ndoutils queue maxing out.
You are currently using 128000 of 23815 messages.
Isn't this statement saying the system is using 128k of 23k messages?
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

Code: Select all

kernel.msgmnb = 231072000
kernel.msgmax = 231072000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 256000
Updated to the above.

Code: Select all

You are currently using 128000 of 256000 messages and 131072000 of 131072000 bytes in the queue
Messages issue solved. Memory total updated. So far looks ok.
No more wild spikes in scheduling
2015-03-13_07-18-41.png
Any update on the check_icmp issue
You do not have the required permissions to view the files attached to this post.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: check_icmp false alerts

Post by jdalrymple »

Are you getting false alerts or is it timing out?

If you're getting false alerts, what is the alert status and output?
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

jdalrymple wrote:Are you getting false alerts or is it timing out?

If you're getting false alerts, what is the alert status and output?
False alert. It is not timing out
Cant remember exactly but it is something like
CRITICAL - Host down 100% packet loss
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: check_icmp false alerts

Post by jdalrymple »

That's going to be a tough one, check_icmp is usually very reliable. I think the first troubleshooting step will be to see it run from the command line on the gearman server and analyze the input you're giving it and the output. It would also be useful to see a ping to the host in question right after it fails:

Code: Select all

[jdalrymple@localhost libexec]$ ./check_icmp -H 8.8.8.8
OK - 8.8.8.8: rta 23.670ms, lost 0%|rta=23.670ms;200.000;500.000;0; pl=0%;40;80;; rtmax=26.253ms;;;; rtmin=21.982ms;;;;
[jdalrymple@localhost libexec]$ ping 8.8.8.8 -c 10
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=22.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=24.5 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=20.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=128 time=22.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=128 time=26.8 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=128 time=22.7 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=128 time=40.1 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=128 time=25.2 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=128 time=45.0 ms
64 bytes from 8.8.8.8: icmp_seq=10 ttl=128 time=27.1 ms

--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9041ms
rtt min/avg/max/mdev = 20.673/27.733/45.096/7.782 ms
rajasegar
Posts: 1018
Joined: Sun Mar 30, 2014 10:49 pm

Re: check_icmp false alerts

Post by rajasegar »

jdalrymple wrote:That's going to be a tough one, check_icmp is usually very reliable. I think the first troubleshooting step will be to see it run from the command line on the gearman server and analyze the input you're giving it and the output. It would also be useful to see a ping to the host in question right after it fails:

Code: Select all

[jdalrymple@localhost libexec]$ ./check_icmp -H 8.8.8.8
OK - 8.8.8.8: rta 23.670ms, lost 0%|rta=23.670ms;200.000;500.000;0; pl=0%;40;80;; rtmax=26.253ms;;;; rtmin=21.982ms;;;;
[jdalrymple@localhost libexec]$ ping 8.8.8.8 -c 10
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=22.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=24.5 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=20.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=128 time=22.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=128 time=26.8 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=128 time=22.7 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=128 time=40.1 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=128 time=25.2 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=128 time=45.0 ms
64 bytes from 8.8.8.8: icmp_seq=10 ttl=128 time=27.1 ms

--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9041ms
rtt min/avg/max/mdev = 20.673/27.733/45.096/7.782 ms
Ok. I will put a continuous check to capture the error.
5 x Nagios 5.6.9 Enterprise Edition
RHEL 6 & 7
rrdcached & ramdisk optimisation
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: check_icmp false alerts

Post by jdalrymple »

Sounds like it's a very infrequent problem that occurs? That is going to make it even tougher to sort out.

Any chance you could find one of the false alerts in your nagios.log and share the contents exactly? Like I mentioned check_icmp is pretty solid so I doubt the actual plugin is where the problem lies. I'm wondering if there is some useful output coming back.
Locked