Nagios timeouts on random service checks

tek0tron · Post by **tek0tron** » Wed Jan 29, 2020 11:44 pm

At least the past few weeks about every 15 to 30 min, nagios shows some critical timeout for a check on some server. The servers or services timing out are random. We have upgraded nagios core to the latest available on centos 7 and likewise for the nrpe agents on remote servers. Many of the servers timing out are on the same network, as also on the same Host (proxmox vm's).

Current Nagios® Core™ 4.4.3
Host: CentOS Linux release 7.7.1908 (Core)
Processors: 2
Memory: 3GB
Disk space: 40GB, available space: 30GB

Any suggestions on what we can check/do to get over this issue?

Post by **mbellerue** » Thu Jan 30, 2020 4:37 pm

Welcome to the forum!

How many service checks are running on that Core instance? And what's the load like? Try running sar 1 10 and let's see what the result looks like. My first suspicion would be performance on the Core machine. 2 cores and 3GB of memory is pretty thin. But if you don't have too many services, and/or they're not being checked too often, it can be enough.

The next thing to potentially look at is the Proxmox host, and its VMs. Everything is probably using the same network adapter to get out to the network. Might be worth checking Proxmox to make sure there are no errors in dmesg, and get the output of netstat -i

tek0tron · Post by **tek0tron** » Thu Jan 30, 2020 10:13 pm

Thanks for the response, mbellerue.

Following your suggestions, output/results below:

Hosts: 79
Services: 724
Type of service checks: CPU Load, Memory, disk i/o, IP reputation test, Total process, logged in users, zombie processes, disk partition checks, URL (http/https) checks, SMTP, POP3, Mysql, IMAP, FTP, mail queue

service check interval: 1 Minute
max check attempts: 3
retry interval: 1 min
we use the ramdisk setup on the nagios server host

sar 1 10

[root@nagiosrv ~]# sar 1 10
Linux 3.10.0-1062.9.1.el7.x86_64 (nagiosrv) 01/30/2020 _x86_64_ (2 CPU)

10:01:51 PM CPU %user %nice %system %iowait %steal %idle
10:01:52 PM all 12.56 0.00 4.52 0.00 0.00 82.91
10:01:53 PM all 3.02 0.00 3.02 0.00 0.00 93.97
10:01:54 PM all 1.01 0.00 1.52 0.00 0.00 97.47
10:01:55 PM all 3.52 0.00 1.51 0.00 0.00 94.97
10:01:56 PM all 2.01 0.00 3.02 0.00 0.00 94.97
10:01:57 PM all 2.02 0.00 3.03 0.00 0.00 94.95
10:01:58 PM all 3.02 0.00 3.02 0.00 0.00 93.97
10:01:59 PM all 4.06 0.00 1.52 0.00 0.00 94.42
10:02:00 PM all 4.50 0.00 1.50 0.00 0.00 94.00
10:02:01 PM all 5.05 0.00 3.03 0.00 0.00 91.92
Average: all 4.08 0.00 2.57 0.00 0.00 93.35

ProxMox Host: netstat -i

root@pm5:~# netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
enp95s0f 1500 15019964241 0 3024341 0 13527130075 0 00 BMRU
enp95s0f 1500 1144637904 0 0 0 2214819051 0 0 0 BMRU
fwbr163i 1500 33900772 0 135 0 2 0 0 0 BMRU
fwbr169i 1500 33902645 0 145 0 2 0 0 0 BMRU
fwln163i 1500 62571650 0 0 0 17859333 0 0 0 BMRU
fwln169i 1500 34211846 0 0 0 139970 0 0 0 BMRU
fwpr163p 1500 17859333 0 0 0 62571650 0 0 0 BMRU
fwpr169p 1500 139970 0 0 0 34211846 0 0 0 BMRU
lo 65536 613305224 0 0 0 613305224 0 0 0 LRU
tap100i0 1500 23394064 0 0 0 63652073 0 0 0 BMPRU
tap101i0 1500 15970540 0 0 0 52674890 0 0 0 BMPRU
tap105i0 1500 34212295 0 0 0 68126318 0 31105 0 BMPRU
tap110i0 1500 129155 0 0 0 34076251 0 0 0 BMPRU
tap114i0 1500 2835173 0 0 0 36800210 0 0 0 BMPRU
tap115i0 1500 22725583 0 0 0 29782196 0 0 0 BMPRU
tap116i0 1500 1588050 0 0 0 35732956 0 0 0 BMPRU
tap118i0 1500 3638737 0 0 0 38319580 0 8355 0 BMPRU
tap122i0 1500 2659422 0 0 0 37469473 0 0 0 BMPRU
tap123i0 1500 9488064 0 0 0 45014284 0 276 0 BMPRU
tap124i0 1500 96693059 0 0 0 132099153 0 0 0 BMPR U
tap125i0 1500 54980391 0 0 0 126315855 0 0 0 BMPR U
tap130i0 1500 7426525 0 0 0 42909591 0 0 0 BMPRU
tap134i0 1500 22974287 0 0 0 56035652 0 0 0 BMPRU
tap137i0 1500 51922856 0 0 0 76055158 0 0 0 BMPRU
tap138i0 1500 6976726 0 0 0 44548652 0 0 0 BMPRU
tap139i0 1500 77978192 0 0 0 123949735 0 0 0 BMPR U
tap140i0 1500 112054712 0 0 0 171454891 0 0 0 BMP RU
tap143i0 1500 2798297 0 0 0 39485888 0 0 0 BMPRU
tap144i0 1500 15850317 0 0 0 52679014 0 0 0 BMPRU
tap151i0 1500 8915133 0 0 0 44873725 0 0 0 BMPRU
tap157i0 1500 947338 0 0 0 34804385 0 53 0 BMPRU
tap160i0 1500 43342345 0 0 0 87140986 0 2569 0 BMPRU
tap163i0 1500 17860240 0 0 0 62571023 0 0 0 BMPRU
tap167i0 1500 423053686 0 0 0 82063225 0 0 0 BMPR U
tap169i0 1500 139971 0 0 0 34211412 0 0 0 BMPRU
tap173i0 1500 4009079 0 0 0 37616104 0 730 0 BMPRU
vmbr0 1500 195258674 0 155 0 157202904 0 0 0 BMR U

tek0tron · Post by **tek0tron** » Thu Jan 30, 2020 10:14 pm

Thanks for the response, mbellerue.

Following your suggestions, output/results below. If you could look at these and help further?

Hosts: 79
Services: 724
Type of service checks: CPU Load, Memory, disk i/o, IP reputation test, Total process, logged in users, zombie processes, disk partition checks, URL (http/https) checks, SMTP, POP3, Mysql, IMAP, FTP, mail queue

service check interval: 1 Minute
max check attempts: 3
retry interval: 1 min
we use the ramdisk setup on the nagios server host

sar 1 10

[root@nagiosrv ~]# sar 1 10
Linux 3.10.0-1062.9.1.el7.x86_64 (nagiosrv) 01/30/2020 _x86_64_ (2 CPU)

10:01:51 PM CPU %user %nice %system %iowait %steal %idle
10:01:52 PM all 12.56 0.00 4.52 0.00 0.00 82.91
10:01:53 PM all 3.02 0.00 3.02 0.00 0.00 93.97
10:01:54 PM all 1.01 0.00 1.52 0.00 0.00 97.47
10:01:55 PM all 3.52 0.00 1.51 0.00 0.00 94.97
10:01:56 PM all 2.01 0.00 3.02 0.00 0.00 94.97
10:01:57 PM all 2.02 0.00 3.03 0.00 0.00 94.95
10:01:58 PM all 3.02 0.00 3.02 0.00 0.00 93.97
10:01:59 PM all 4.06 0.00 1.52 0.00 0.00 94.42
10:02:00 PM all 4.50 0.00 1.50 0.00 0.00 94.00
10:02:01 PM all 5.05 0.00 3.03 0.00 0.00 91.92
Average: all 4.08 0.00 2.57 0.00 0.00 93.35

ProxMox Host: netstat -i

root@pm5:~# netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
enp95s0f 1500 15019964241 0 3024341 0 13527130075 0 0 0 BMRU
enp95s0f 1500 1144637904 0 0 0 2214819051 0 0 0 BMRU
fwbr163i 1500 33900772 0 135 0 2 0 0 0 BMRU
fwbr169i 1500 33902645 0 145 0 2 0 0 0 BMRU
fwln163i 1500 62571650 0 0 0 17859333 0 0 0 BMRU
fwln169i 1500 34211846 0 0 0 139970 0 0 0 BMRU
fwpr163p 1500 17859333 0 0 0 62571650 0 0 0 BMRU
fwpr169p 1500 139970 0 0 0 34211846 0 0 0 BMRU
lo 65536 613305224 0 0 0 613305224 0 0 0 LRU
tap100i0 1500 23394064 0 0 0 63652073 0 0 0 BMPRU
tap101i0 1500 15970540 0 0 0 52674890 0 0 0 BMPRU
tap105i0 1500 34212295 0 0 0 68126318 0 31105 0 BMPRU
tap110i0 1500 129155 0 0 0 34076251 0 0 0 BMPRU
tap114i0 1500 2835173 0 0 0 36800210 0 0 0 BMPRU
tap115i0 1500 22725583 0 0 0 29782196 0 0 0 BMPRU
tap116i0 1500 1588050 0 0 0 35732956 0 0 0 BMPRU
tap118i0 1500 3638737 0 0 0 38319580 0 8355 0 BMPRU
tap122i0 1500 2659422 0 0 0 37469473 0 0 0 BMPRU
tap123i0 1500 9488064 0 0 0 45014284 0 276 0 BMPRU
tap124i0 1500 96693059 0 0 0 132099153 0 0 0 BMPRU
tap125i0 1500 54980391 0 0 0 126315855 0 0 0 BMPRU
tap130i0 1500 7426525 0 0 0 42909591 0 0 0 BMPRU
tap134i0 1500 22974287 0 0 0 56035652 0 0 0 BMPRU
tap137i0 1500 51922856 0 0 0 76055158 0 0 0 BMPRU
tap138i0 1500 6976726 0 0 0 44548652 0 0 0 BMPRU
tap139i0 1500 77978192 0 0 0 123949735 0 0 0 BMPRU
tap140i0 1500 112054712 0 0 0 171454891 0 0 0 BMPRU
tap143i0 1500 2798297 0 0 0 39485888 0 0 0 BMPRU
tap144i0 1500 15850317 0 0 0 52679014 0 0 0 BMPRU
tap151i0 1500 8915133 0 0 0 44873725 0 0 0 BMPRU
tap157i0 1500 947338 0 0 0 34804385 0 53 0 BMPRU
tap160i0 1500 43342345 0 0 0 87140986 0 2569 0 BMPRU
tap163i0 1500 17860240 0 0 0 62571023 0 0 0 BMPRU
tap167i0 1500 423053686 0 0 0 82063225 0 0 0 BMPRU
tap169i0 1500 139971 0 0 0 34211412 0 0 0 BMPRU
tap173i0 1500 4009079 0 0 0 37616104 0 730 0 BMPRU
vmbr0 1500 195258674 0 155 0 157202904 0 0 0 BMRU

Post by **tgriep** » Tue Feb 04, 2020 3:47 pm

What is the memory usage on the server and the CPU load?

Have looked in the /var/log/messages file to see if there are any errors that can help troubleshoot this?

You have a lot of network interfaces in the system, is it any one of then failing the most?

Have you tried to increase the check_nrpe plugin's timeout by adding a -t 60 to the command?

tek0tron · Post by **tek0tron** » Tue Feb 04, 2020 9:53 pm

Hi tgriep,

The CPU is almost always 92-96% idle. Since the first post, I have increased the CPU to 4 processors and memory increased to 5GB. The memory usage continues to show 3GB free at all times. After this increase, the timeouts have reduced but are still there and as random as they were earlier.

The netstat output in my previous post relate to the PROXMOX Host, and not the nagios VM itself. The output from nagios VM is as below:
[root@ ~]# netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 33822027 0 213686 0 28040194 0 0 0 BMRU
lo 65536 192276 0 0 0 192276 0 0 0 LRU

From /var/log/messages the most recent time out errors are listed below. While the log/nagios UI showed the server alert - timeout, the servers and services were actually up.

Feb 2 03:10:15 nagiosrv nagios: wproc: Core Worker 13468: job 185342 (pid=26808) timed out. Killing it
Feb 2 03:10:15 nagiosrv nagios: job 185342 (pid=26808): read() returned error 11
Feb 2 03:10:15 nagiosrv nagios: wproc: CHECK job 185342 from worker Core Worker 13468 timed out after 30.01s
Feb 2 03:10:15 nagiosrv nagios: wproc: host=xyz.com; service=(null);
Feb 2 03:10:15 nagiosrv nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Feb 2 03:10:15 nagiosrv nagios: Warning: Check of host 'xyz.com' timed out after 30.01 seconds
Feb 2 03:10:15 nagiosrv nagios: HOST ALERT: xyz.com;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)

Feb 2 03:54:31 nagiosrv nagios: job 188032 (pid=15777): read() returned error 11
Feb 2 03:54:31 nagiosrv nagios: wproc: Core Worker 13467: job 188032 (pid=15777) timed out. Killing it
Feb 2 03:54:31 nagiosrv nagios: wproc: CHECK job 188032 from worker Core Worker 13467 timed out after 60.01s
Feb 2 03:54:31 nagiosrv nagios: wproc: host=abc.com; service=Opendrive check_NFS_service;
Feb 2 03:54:31 nagiosrv nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Feb 2 03:54:31 nagiosrv nagios: Warning: Check of service 'Opendrive check_NFS_service' on host 'abc.com' timed out after 60.006s!
Feb 2 03:54:31 nagiosrv nagios: SERVICE ALERT: abc.com; check_NFS_service;CRITICAL;SOFT;2;(Service check timed out after 60.01 seconds)
Feb 2 03:54:31 nagiosrv nagios: wproc: Core Worker 13467: job 188032 (pid=15777): Dormant child reaped

tek0tron · Post by **tek0tron** » Tue Feb 04, 2020 10:03 pm

This just happened as I was checking the logs on nagios VM. I could pint the remote server, but nagios itself showed the server and its services down due to timeout. It was back up in 35 or so seconds, just as i tested out the ping and telnet from nagios vm to the remote host (name changed in the example below to someserver.net)

Feb 4 21:59:20 nagiosrv nagios: wproc: host=someserver.net; service=(null);
Feb 4 21:59:20 nagiosrv nagios: Warning: Check of host 'someserver.net' timed out after 30.01 seconds
Feb 4 21:59:20 nagiosrv nagios: HOST ALERT: someserver.net;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)
Feb 4 21:59:54 nagiosrv nagios: HOST ALERT: someserver.net;UP;SOFT;1;PING OK - Packet loss = 0%, RTA = 0.73 ms

Post by **tgriep** » Wed Feb 05, 2020 4:49 pm

What version of the check_nrpe plugin is installed on the Nagios server?
What version of the NRPE agent is installed on the remote system?

For the host timeout, what plugin are you using for the host check?
It looks like the check_ping plugin but I want to confirm it.
What version is it?
How it is configured?

And chance of upgrading to Core 4.4.5?

tek0tron · Post by **tek0tron** » Wed Feb 05, 2020 10:42 pm

Hi tgriep,

Nagios server:
check_nrpe
NRPE Plugin for Nagios
Version: 3.2.1

Remote server:
nrpe-3.2.1-8.el7.x86_64

For the host timeout, what plugin are you using for the host check?
==> This is with check_ping defined as a command. check_ping v2.3.1 (nagios-plugins 2.3.1)
define command {

command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

And chance of upgrading to Core 4.4.5?
==> this will need us to compile from source. So far, we have used rpm/yum repositoroes to install, update and maintain nagios core installation. Can be explored though, if this is the only alternative

Thanks

Post by **tgriep** » Thu Feb 06, 2020 11:07 am

Try running a continuous ping to a host that times out the most and pipe it to a file.
Wait for a timeout Alert in Nagios to happen and see if the continuous ping fails around the same time.

Sample script to use and when you see the issue, stop the loop and check the ping.txt file.

Code: Select all

#!/bin/bash

while :
do
   date >> ping.txt
   ping -c 1 xxx.xxx.xxx.xxx >>ping.txt
   sleep 1
done

See this link that talks about the check_ping plugin and the failure you are seeing.
https://github.com/nagios-plugins/nagio ... issues/419

Another suggestion is to use the check_icmp plugin instead if check_ping.

Nagios Support Forum

Nagios timeouts on random service checks

Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks

Re: Nagios timeouts on random service checks