A lot of Service check timed out.

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

Alright, first and foremost, can you run these commands and then see if the number of timeouts decreases?

Code: Select all

systemctl stop crond
systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
rm -rf /usr/local/nagios/var/ndo2db.lock
rm -rf /usr/local/nagios/var/ndo2db.pid
rm -rf /usr/local/nagios/var/ndo2db.sock
rm -rf /usr/local/nagios/var/ndo.sock
rm -rf /us/local/nagiosxi/var/subsys/ndo2db
rm -rf /var/run/nagios.lock
rm -rf /usr/local/nagios/var/nagios.lock
systemctl start ndo2db
systemctl start nagios
systemctl start npcd
systemctl start crond
systemctl restart httpd
systemctl restart snmptt
If the timeouts persist, then I will need to get the following information from you.

Your /etc/hosts file
The version of MySQL or MariaDB running on the database server
The output of this command

Code: Select all

rpm -qa | grep "nagios\|httpd\|php"
I was also wondering why the database was offloaded. Was it in response to these timeouts?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

The problem continues...

/etc/hosts file

Code: Select all

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
MySQL Version

Code: Select all

mysql-community-common-5.7.26-1.el7.x86_64
mysql-community-libs-compat-5.7.26-1.el7.x86_64
mysql-community-libs-5.7.26-1.el7.x86_64
mysql-community-server-5.7.26-1.el7.x86_64
mysql-community-devel-5.7.26-1.el7.x86_64
mysql-community-client-5.7.26-1.el7.x86_64
rpm -qa | grep "nagios\|httpd\|php"

Code: Select all

php-devel-5.4.16-46.el7.x86_64
php-process-5.4.16-46.el7.x86_64
php-imap-5.4.16-9.el7.x86_64
php-mssql-5.4.16-9.el7.x86_64
httpd-2.4.6-89.el7.centos.x86_64
php-cli-5.4.16-46.el7.x86_64
php-mbstring-5.4.16-46.el7.x86_64
php-pdo-5.4.16-46.el7.x86_64
php-mysql-5.4.16-46.el7.x86_64
php-snmp-5.4.16-46.el7.x86_64
php-pear-1.9.4-21.el7.noarch
php-5.4.16-46.el7.x86_64
nagios-repo-7-3.el7.noarch
php-common-5.4.16-46.el7.x86_64
php-ldap-5.4.16-46.el7.x86_64
php-xml-5.4.16-46.el7.x86_64
php-gd-5.4.16-46.el7.x86_64
php-pgsql-5.4.16-46.el7.x86_64
php-pecl-ssh2-0.12-1.el7.x86_64
httpd-tools-2.4.6-89.el7.centos.x86_64
I was also wondering why the database was offloaded. Was it in response to these timeouts?
Yes, it was...This error happened the same way when the database was on the same server and I offloaded the database to try to improve the performance.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

Alright, so the software versions look good. Let's try running the repair script on the databases.

On the Nagios Xi server, run

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
That may take a moment to run. So let's gather a couple of other things as well.

If you could PM me a copy of the nsclient.ini file from one of your Windows machines, that would be great.
Once the database repair script has finished, run the following query to get the sizes of the tables,

Code: Select all

SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)` FROM   information_schema.TABLES WHERE TABLE_SCHEMA in ("nagiosql", "nagios", "nagiosxi") ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
Also, once the repair script is finished, let the system just run for about an hour, and then let's get a fresh profile from Nagios XI, and a fresh copy of the database logs.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

Following the NSClient.ini, Tables Size and Repair Database files.

Support Edit: Files downloaded and shared with team
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

Thank you for those files! Could we also get another copy of your mysqld.log file?

One thing we noticed was that the message queues were a little high. One thing you could try is running through this KB article to increase the queues.

https://support.nagios.com/kb/article/n ... d-139.html
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

I've changed, but the problem continues...

[root@TREVOUX var]# ipcs -q

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x40000002 2392064    nagios     600        25600        25
[root@TREVOUX var]# tail -f nagios.log | grep timeout

Code: Select all

[1568379101] SERVICE ALERT: SVCENSURA_ULA;Counter PhysicalDisk_Total - Disk Writes/sec;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379189] SERVICE ALERT: SAEBY;Counter PhysicalDisk_Total - Disk Reads/sec;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379199] SERVICE ALERT: ARRAS;Qtd. de canais ativos locais;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379269] SERVICE ALERT: BOLONHA;Counter PhysicalDisk_Total - Idle Time;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379286] SERVICE ALERT: LILLE;Sincronizacao de Horario;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379325] SERVICE ALERT: TARANTO;Memoria;UNKNOWN;HARD;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379334] SERVICE ALERT: ANTAKYA;Counter PhysicalDisk_Total - Disk Reads/sec;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
[1568379357] SERVICE ALERT: NANCY;/tmp Disk Usage;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 50 seconds.
You do not have the required permissions to view the files attached to this post.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: A lot of Service check timed out.

Post by mbellerue »

Is that the mysqld.log file from the offloaded server? The entries in there stop at 8/22.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
paulol
Posts: 159
Joined: Wed Jul 02, 2014 11:39 am

Re: A lot of Service check timed out.

Post by paulol »

I found the problem.

It was my network firewall. I don't know why but sometimes the network firewall starts to block IPS action in some signatures ( TCP.Out.Of.Range.Timestamp, SSL.Anonymous.Ciphers.Negotiation and TCP.Overlapping.Fragments ).

Sometimes when the NRPE checks services or hosts, the network firewall think that it's an Attack comming from Nagios XI. Maybe because of the amount of checks coming from one place.

Thanks for help.
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: A lot of Service check timed out.

Post by scottwilkerson »

paulol wrote:I found the problem.

It was my network firewall. I don't know why but sometimes the network firewall starts to block IPS action in some signatures ( TCP.Out.Of.Range.Timestamp, SSL.Anonymous.Ciphers.Negotiation and TCP.Overlapping.Fragments ).

Sometimes when the NRPE checks services or hosts, the network firewall think that it's an Attack comming from Nagios XI. Maybe because of the amount of checks coming from one place.

Thanks for help.
Glad to hear you found it!

Locking thread
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked