Page 1 of 2

A lot of Service check timed out.

PostPosted: Thu Aug 22, 2019 3:08 pm
by paulol
Hi guys,

I'm getting a lot of service timeout em multiple services...

I have 2 machines for Nagios XI:

1. Nagios XI 5.6.5
(Centos 7.6) VM(HyperV)
20GB Memory
18 VCPUs
SSD

2. Nagios XI database
(CentOS 7.6 MySQL 5.7) VM(HyperV)
12GB Memory
8 VCPUs
SSD

I already read the following documents, but no one fixed the problem...
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
https://assets.nagios.com/downloads/nag ... Server.pdf

Re: A lot of Service check timed out.

PostPosted: Thu Aug 22, 2019 4:07 pm
by mbellerue
What does the networking look like on your VMs and physical host?

Re: A lot of Service check timed out.

PostPosted: Fri Aug 23, 2019 8:19 am
by paulol
What does the networking look like on your VMs and physical host?

They are in the same physical host but in separated networks.

The Nagios XI is in 10.0.74.xx
The MySQL DB is in 10.0.64.xx

But they were on the same server(VM) and that problem was happening the same away. After to read the https://assets.nagios.com/downloads/nag ... Server.pdf and separated in two VM servers...

Re: A lot of Service check timed out.

PostPosted: Fri Aug 23, 2019 9:33 am
by bheden
What happens if you run one of the plugins from the command line on the XI server?

Are you able to ping the hosts you're attempting to check from the XI server?

Re: A lot of Service check timed out.

PostPosted: Fri Aug 23, 2019 10:35 am
by mbellerue
So they're on separate logical networks, but are all of the VMs using the same NIC to access the physical network? I'm wondering if the physical NIC on the host is being overloaded with requests. Do all of the VMs on the host communicate through the same NIC? Is it a 1 gigabit NIC, or a 10 gigabit NIC? Is there a team of NICs that they use to help deal with the load?

Also, have the number of timed out checks increased recently, or has this been a problem for some time?

Re: A lot of Service check timed out.

PostPosted: Fri Aug 23, 2019 12:12 pm
by paulol
bheden thks for reply...

Asking your questions...

What happens if you run one of the plugins from the command line on the XI server?
The plugins are executing normally, but sometimes I get "Service check timed out after 60.01 seconds".

Are you able to ping the hosts you're attempting to check from the XI server?
Yes, I am.

Re: A lot of Service check timed out.

PostPosted: Fri Aug 23, 2019 12:44 pm
by paulol
mbellerue,

Are all of the VMs using the same NIC to access the physical network?
Yes, they are...

I'm wondering if the physical NIC on the host is being overloaded with requests.
I have asked my coworker to verify if the physical NIC is being overloaded. He said that is impossible for that machine to use 20gigabit...He showed me the load network graph and it was normal...

Is it a 1 gigabit NIC, or a 10 gigabit NIC?
The physical server has NIC Team with 2 NIC 10gigabit each one. So the NIC team has 20gigabit capacity.

Do all of the VMs on the host communicate through the same NIC?
Yes, In the same NIC Team...

I'm going to explain better...
We bought some new servers and we are moving our VMs environment for this new one.
The Nagios was in CentOS 6.9, So I needed to install Nagios from zero on CentOS 7.6. So I formatted the server and install it on CentOS 7.6.

I use Nagios XI for 3 years and I never saw this problem...

Re: A lot of Service check timed out.

PostPosted: Fri Aug 23, 2019 1:00 pm
by mbellerue
20 gigabit is definitely a lot of bandwidth. Could you PM me your system profile? I will take a look at it on this side and see if I can find out why these are timing out.

Could you also send the output of dmesg along with the system profile?

Re: A lot of Service check timed out.

PostPosted: Mon Aug 26, 2019 3:26 pm
by mbellerue
Okay, we've looked over the profile a little bit, and we've come up with a few more questions.

First, could you also send me the database logs from the database server?

Next, we're seeing a lot of "SSL handshake failed" messages. We've tied a few of them to Windows hosts. Are you able to find out if this issue is only happening to Windows hosts? If it's only happening to Windows hosts, maybe an update to NSClient++ would help.

Those are the two big ones. We're going to continue to look at the profile to see if there's anything else to be found.

Re: A lot of Service check timed out.

PostPosted: Fri Aug 30, 2019 8:33 am
by paulol
We are already in the last version of NSClient++ 5.2 in almost all servers. Those messages happen because of some servers that don't have the NSClient installed yet.

I think is something related to NRPE. I have set the NRPE command timeout to 50 seconds and I see only "NRPE timeout error" on Nagios event log.

Support edit: Shared 1567171962-eventlog.pdf and mysqld.log with team.