A lot of Service check timed out.

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

A lot of Service check timed out.

Post by paulol »

Hi guys,

I'm getting a lot of service timeout em multiple services...

I have 2 machines for Nagios XI:

1. Nagios XI 5.6.5
(Centos 7.6) VM(HyperV)
20GB Memory
18 VCPUs
SSD

2. Nagios XI database
(CentOS 7.6 MySQL 5.7) VM(HyperV)
12GB Memory
8 VCPUs
SSD

I already read the following documents, but no one fixed the problem...
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
https://assets.nagios.com/downloads/nag ... Server.pdf
You do not have the required permissions to view the files attached to this post.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

What does the networking look like on your VMs and physical host?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

What does the networking look like on your VMs and physical host?

They are in the same physical host but in separated networks.

The Nagios XI is in 10.0.74.xx
The MySQL DB is in 10.0.64.xx

But they were on the same server(VM) and that problem was happening the same away. After to read the https://assets.nagios.com/downloads/nag ... Server.pdf and separated in two VM servers...
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: A lot of Service check timed out.

Post by bheden »

What happens if you run one of the plugins from the command line on the XI server?

Are you able to ping the hosts you're attempting to check from the XI server?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

So they're on separate logical networks, but are all of the VMs using the same NIC to access the physical network? I'm wondering if the physical NIC on the host is being overloaded with requests. Do all of the VMs on the host communicate through the same NIC? Is it a 1 gigabit NIC, or a 10 gigabit NIC? Is there a team of NICs that they use to help deal with the load?

Also, have the number of timed out checks increased recently, or has this been a problem for some time?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

bheden thks for reply...

Asking your questions...

What happens if you run one of the plugins from the command line on the XI server?
The plugins are executing normally, but sometimes I get "Service check timed out after 60.01 seconds".

Are you able to ping the hosts you're attempting to check from the XI server?
Yes, I am.
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

mbellerue,

Are all of the VMs using the same NIC to access the physical network?
Yes, they are...

I'm wondering if the physical NIC on the host is being overloaded with requests.
I have asked my coworker to verify if the physical NIC is being overloaded. He said that is impossible for that machine to use 20gigabit...He showed me the load network graph and it was normal...

Is it a 1 gigabit NIC, or a 10 gigabit NIC?
The physical server has NIC Team with 2 NIC 10gigabit each one. So the NIC team has 20gigabit capacity.

Do all of the VMs on the host communicate through the same NIC?
Yes, In the same NIC Team...

I'm going to explain better...
We bought some new servers and we are moving our VMs environment for this new one.
The Nagios was in CentOS 6.9, So I needed to install Nagios from zero on CentOS 7.6. So I formatted the server and install it on CentOS 7.6.

I use Nagios XI for 3 years and I never saw this problem...
You do not have the required permissions to view the files attached to this post.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

20 gigabit is definitely a lot of bandwidth. Could you PM me your system profile? I will take a look at it on this side and see if I can find out why these are timing out.

Could you also send the output of dmesg along with the system profile?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

Okay, we've looked over the profile a little bit, and we've come up with a few more questions.

First, could you also send me the database logs from the database server?

Next, we're seeing a lot of "SSL handshake failed" messages. We've tied a few of them to Windows hosts. Are you able to find out if this issue is only happening to Windows hosts? If it's only happening to Windows hosts, maybe an update to NSClient++ would help.

Those are the two big ones. We're going to continue to look at the profile to see if there's anything else to be found.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

We are already in the last version of NSClient++ 5.2 in almost all servers. Those messages happen because of some servers that don't have the NSClient installed yet.

I think is something related to NRPE. I have set the NRPE command timeout to 50 seconds and I see only "NRPE timeout error" on Nagios event log.

Support edit: Shared 1567171962-eventlog.pdf and mysqld.log with team.
Locked