check_icmp false alerts

rajasegar · Post by **rajasegar** » Wed Mar 11, 2015 10:21 pm

Nagios XI 2014R1.2
check_gearman: version 1.4_nagios4 running on libgearman 0.25

Recently we are having a lot of issue with check_icmp.
It keeps on giving intermittent false alarms.

Please advice if there is anyway to reduce this.

Code: Select all

 $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0 -c 5000.0 -n 3

rajasegar · Post by **rajasegar** » Wed Mar 11, 2015 10:43 pm

Not sure if this is related but I am seeing this message in the system log.

Please advice on the tuning if required.

Code: Select all

Mar 12 11:00:01 nagiosprodxi1 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 23815 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Mar 12 11:00:08 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:08 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:09 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:09 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:10 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:10 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:11 nagiosprodxi1 ndo2db: Message sent to queue.
------
Mar 12 11:04:07 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: get_exports: lookup(hosts): exports lookup failed for <
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: get_exports: lookup(hosts): exports lookup failed for <
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:14:07 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).

/etc/sysctl.conf

Code: Select all

[nagios@nagiosprodxi1 local]$ cat /etc/sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1

# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1

# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 131072000

# Controls the maximum size of a message, in bytes
kernel.msgmax = 131072000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

jomann · Post by **jomann** » Thu Mar 12, 2015 11:20 am

It looks like you might be having hostname resolution issues with your DNS. Other services are reporting hostname lookup failed:

Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known

As for the queue size, it looks like you're good on amount of messages (23.8k of 128k) and with the setup you have (if this is the same server with 12cpu cores, etc) then you should be able to up your kernel.msgmnb and kernel.msgmax values in the sysctl.conf file and then run the sysctl -p command to update them. That should help with the ndoutils queue maxing out.

rajasegar · Post by **rajasegar** » Thu Mar 12, 2015 5:51 pm

jomann wrote:It looks like you might be having hostname resolution issues with your DNS. Other services are reporting hostname lookup failed:

Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known

As for the queue size, it looks like you're good on amount of messages (23.8k of 128k) and with the setup you have (if this is the same server with 12cpu cores, etc) then you should be able to up your kernel.msgmnb and kernel.msgmax values in the sysctl.conf file and then run the sysctl -p command to update them. That should help with the ndoutils queue maxing out.

You are currently using 128000 of 23815 messages.
Isn't this statement saying the system is using 128k of 23k messages?

rajasegar · Post by **rajasegar** » Thu Mar 12, 2015 6:19 pm

Code: Select all

kernel.msgmnb = 231072000
kernel.msgmax = 231072000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 256000

Updated to the above.

Code: Select all

You are currently using 128000 of 256000 messages and 131072000 of 131072000 bytes in the queue

Messages issue solved. Memory total updated. So far looks ok.
No more wild spikes in scheduling

2015-03-13_07-18-41.png

Any update on the check_icmp issue

jdalrymple · Post by **jdalrymple** » Fri Mar 13, 2015 8:55 am

Are you getting false alerts or is it timing out?

If you're getting false alerts, what is the alert status and output?

rajasegar · Post by **rajasegar** » Sun Mar 15, 2015 9:02 pm

jdalrymple wrote:Are you getting false alerts or is it timing out?

If you're getting false alerts, what is the alert status and output?

False alert. It is not timing out
Cant remember exactly but it is something like
CRITICAL - Host down 100% packet loss

jdalrymple · Post by **jdalrymple** » Mon Mar 16, 2015 8:56 am

That's going to be a tough one, check_icmp is usually very reliable. I think the first troubleshooting step will be to see it run from the command line on the gearman server and analyze the input you're giving it and the output. It would also be useful to see a ping to the host in question right after it fails:

Code: Select all

[jdalrymple@localhost libexec]$ ./check_icmp -H 8.8.8.8
OK - 8.8.8.8: rta 23.670ms, lost 0%|rta=23.670ms;200.000;500.000;0; pl=0%;40;80;; rtmax=26.253ms;;;; rtmin=21.982ms;;;;
[jdalrymple@localhost libexec]$ ping 8.8.8.8 -c 10
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=22.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=24.5 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=20.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=128 time=22.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=128 time=26.8 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=128 time=22.7 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=128 time=40.1 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=128 time=25.2 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=128 time=45.0 ms
64 bytes from 8.8.8.8: icmp_seq=10 ttl=128 time=27.1 ms

--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9041ms
rtt min/avg/max/mdev = 20.673/27.733/45.096/7.782 ms

rajasegar · Post by **rajasegar** » Mon Mar 16, 2015 6:23 pm

jdalrymple wrote:That's going to be a tough one, check_icmp is usually very reliable. I think the first troubleshooting step will be to see it run from the command line on the gearman server and analyze the input you're giving it and the output. It would also be useful to see a ping to the host in question right after it fails:

Code: Select all

[jdalrymple@localhost libexec]$ ./check_icmp -H 8.8.8.8
OK - 8.8.8.8: rta 23.670ms, lost 0%|rta=23.670ms;200.000;500.000;0; pl=0%;40;80;; rtmax=26.253ms;;;; rtmin=21.982ms;;;;
[jdalrymple@localhost libexec]$ ping 8.8.8.8 -c 10
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=22.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=24.5 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=20.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=128 time=22.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=128 time=26.8 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=128 time=22.7 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=128 time=40.1 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=128 time=25.2 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=128 time=45.0 ms
64 bytes from 8.8.8.8: icmp_seq=10 ttl=128 time=27.1 ms

--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9041ms
rtt min/avg/max/mdev = 20.673/27.733/45.096/7.782 ms

Ok. I will put a continuous check to capture the error.

jdalrymple · Post by **jdalrymple** » Tue Mar 17, 2015 9:04 am

Sounds like it's a very infrequent problem that occurs? That is going to make it even tougher to sort out.

Any chance you could find one of the false alerts in your nagios.log and share the contents exactly? Like I mentioned check_icmp is pretty solid so I doubt the actual plugin is where the problem lies. I'm wondering if there is some useful output coming back.

Nagios Support Forum

check_icmp false alerts

check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts

Re: check_icmp false alerts