Page 1 of 3
check_icmp false alerts
Posted: Wed Mar 11, 2015 10:21 pm
by rajasegar
Nagios XI 2014R1.2
check_gearman: version 1.4_nagios4 running on libgearman 0.25
Recently we are having a lot of issue with check_icmp.
It keeps on giving intermittent false alarms.
Please advice if there is anyway to reduce this.
Code: Select all
$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0 -c 5000.0 -n 3
Re: check_icmp false alerts
Posted: Wed Mar 11, 2015 10:43 pm
by rajasegar
Not sure if this is related but I am seeing this message in the system log.
Please advice on the tuning if required.
Code: Select all
Mar 12 11:00:01 nagiosprodxi1 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 23815 messages and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Mar 12 11:00:08 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:08 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:09 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:09 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:10 nagiosprodxi1 ndo2db: Message sent to queue.
Mar 12 11:00:10 nagiosprodxi1 ndo2db: Warning: queue send error, retrying...
Mar 12 11:00:11 nagiosprodxi1 ndo2db: Message sent to queue.
------
Mar 12 11:04:07 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: get_exports: lookup(hosts): exports lookup failed for <
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: get_exports: lookup(hosts): exports lookup failed for <
Mar 12 11:09:08 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
Mar 12 11:14:07 nagiosprodxi1 automount[2273]: key "<" not found in map source(s).
/etc/sysctl.conf
Code: Select all
[nagios@nagiosprodxi1 local]$ cat /etc/sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
# sysctl.conf(5) for more details.
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 131072000
# Controls the maximum size of a message, in bytes
kernel.msgmax = 131072000
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456
Re: check_icmp false alerts
Posted: Thu Mar 12, 2015 11:20 am
by jomann
It looks like you might be having hostname resolution issues with your DNS. Other services are reporting hostname lookup failed:
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
As for the queue size, it looks like you're good on amount of messages (23.8k of 128k) and with the setup you have (if this is the same server with 12cpu cores, etc) then you should be able to up your kernel.msgmnb and kernel.msgmax values in the sysctl.conf file and then run the sysctl -p command to update them. That should help with the ndoutils queue maxing out.
Re: check_icmp false alerts
Posted: Thu Mar 12, 2015 5:51 pm
by rajasegar
jomann wrote:It looks like you might be having hostname resolution issues with your DNS. Other services are reporting hostname lookup failed:
Mar 12 11:04:12 nagiosprodxi1 automount[2273]: create_client: hostname lookup failed: Name or service not known
As for the queue size, it looks like you're good on amount of messages (23.8k of 128k) and with the setup you have (if this is the same server with 12cpu cores, etc) then you should be able to up your kernel.msgmnb and kernel.msgmax values in the sysctl.conf file and then run the sysctl -p command to update them. That should help with the ndoutils queue maxing out.
You are currently using 128000 of 23815 messages.
Isn't this statement saying the system is using 128k of 23k messages?
Re: check_icmp false alerts
Posted: Thu Mar 12, 2015 6:19 pm
by rajasegar
Code: Select all
kernel.msgmnb = 231072000
kernel.msgmax = 231072000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 256000
Updated to the above.
Code: Select all
You are currently using 128000 of 256000 messages and 131072000 of 131072000 bytes in the queue
Messages issue solved. Memory total updated. So far looks ok.
No more wild spikes in scheduling
2015-03-13_07-18-41.png
Any update on the check_icmp issue
Re: check_icmp false alerts
Posted: Fri Mar 13, 2015 8:55 am
by jdalrymple
Are you getting false alerts or is it timing out?
If you're getting false alerts, what is the alert status and output?
Re: check_icmp false alerts
Posted: Sun Mar 15, 2015 9:02 pm
by rajasegar
jdalrymple wrote:Are you getting false alerts or is it timing out?
If you're getting false alerts, what is the alert status and output?
False alert. It is not timing out
Cant remember exactly but it is something like
CRITICAL - Host down 100% packet loss
Re: check_icmp false alerts
Posted: Mon Mar 16, 2015 8:56 am
by jdalrymple
That's going to be a tough one, check_icmp is usually very reliable. I think the first troubleshooting step will be to see it run from the command line on the gearman server and analyze the input you're giving it and the output. It would also be useful to see a ping to the host in question right after it fails:
Code: Select all
[jdalrymple@localhost libexec]$ ./check_icmp -H 8.8.8.8
OK - 8.8.8.8: rta 23.670ms, lost 0%|rta=23.670ms;200.000;500.000;0; pl=0%;40;80;; rtmax=26.253ms;;;; rtmin=21.982ms;;;;
[jdalrymple@localhost libexec]$ ping 8.8.8.8 -c 10
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=22.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=24.5 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=20.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=128 time=22.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=128 time=26.8 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=128 time=22.7 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=128 time=40.1 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=128 time=25.2 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=128 time=45.0 ms
64 bytes from 8.8.8.8: icmp_seq=10 ttl=128 time=27.1 ms
--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9041ms
rtt min/avg/max/mdev = 20.673/27.733/45.096/7.782 ms
Re: check_icmp false alerts
Posted: Mon Mar 16, 2015 6:23 pm
by rajasegar
jdalrymple wrote:That's going to be a tough one, check_icmp is usually very reliable. I think the first troubleshooting step will be to see it run from the command line on the gearman server and analyze the input you're giving it and the output. It would also be useful to see a ping to the host in question right after it fails:
Code: Select all
[jdalrymple@localhost libexec]$ ./check_icmp -H 8.8.8.8
OK - 8.8.8.8: rta 23.670ms, lost 0%|rta=23.670ms;200.000;500.000;0; pl=0%;40;80;; rtmax=26.253ms;;;; rtmin=21.982ms;;;;
[jdalrymple@localhost libexec]$ ping 8.8.8.8 -c 10
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=22.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=24.5 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=20.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=128 time=22.6 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=128 time=26.8 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=128 time=22.7 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=128 time=40.1 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=128 time=25.2 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=128 time=45.0 ms
64 bytes from 8.8.8.8: icmp_seq=10 ttl=128 time=27.1 ms
--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9041ms
rtt min/avg/max/mdev = 20.673/27.733/45.096/7.782 ms
Ok. I will put a continuous check to capture the error.
Re: check_icmp false alerts
Posted: Tue Mar 17, 2015 9:04 am
by jdalrymple
Sounds like it's a very infrequent problem that occurs? That is going to make it even tougher to sort out.
Any chance you could find one of the false alerts in your nagios.log and share the contents exactly? Like I mentioned check_icmp is pretty solid so I doubt the actual plugin is where the problem lies. I'm wondering if there is some useful output coming back.