NXI 5.5.2 - Host check timeout on Worker

This board serves as an open discussion and support collaboration point for Nagios XI. NOTE: Nagios XI customers should use the Customer Support forum to obtain expedited support.

NXI 5.5.2 - Host check timeout on Worker

Postby Aezox » Thu Jun 13, 2019 4:17 am

Hello everyone,

Since few days we are facing issues with monitoring falling in error regarding error state : Host Check Timeout on Worker or CHECK_NRPE: Socket timeout after 45 seconds

We are running Nagios XI 5.5.2 along with 5 MG_2 workers

NXI Server :
Code: Select all
nagiosxi-nrds-5.5.2-1.el6.x86_64
nagiosxi-5.5.2-1.el6.x86_64
nagiosxi-wkhtmltox-5.5.2-1.el6.x86_64
nagiosxi-nsca-5.5.2-1.el6.x86_64
nagiosxi-pnp-5.5.2-1.el6.x86_64
nagiosxi-shellinabox-5.5.2-1.el6.x86_64
nagiosxi-nxti-5.5.2-1.el6.x86_64
nagiosxi-nagioscore-5-4.13.el6.x86_64
nagiosxi-nrpe-5.5.2-1.el6.x86_64
nagiosxi-nagiosmobile-5.5.2-1.el6.x86_64
nagiosxi-mrtg-5.5.2-1.el6.x86_64
nagiosxi-nagvis-5.5.2-1.el6.x86_64
nagiosxi-wmic-5.5.2-1.el6.x86_64
nagiosxi-ndoutils-5.5.2-1.el6.x86_64
nagiosxi-nagiosplugins-5.5.2-1.el6.x86_64


MG worker servers :
Code: Select all
mod_gearman2-2.1.1-1.el6.x86_64


While digging into the errors I found out that :
- checking services from client-side are working 100 % properly
- checking services from nxiserver-side thru putty are working 100% properly
- checking services from mgserver-side thru putty are falling in errors one check out of two
- checking services from nxiconsole-side thru web browser are falling in errors one check out of two

MG worker.conf :
Code: Select all
# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=120

# Minimum number of worker processes which should
# run at any time.
min-worker=25

# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=200

# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=30

# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=1000


I think that my worker is overloaded but you guys might have a clue for me to help me out :)

Regards
Aezox
 
Posts: 21
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Postby ssax » Thu Jun 13, 2019 4:37 pm

Is it an external gearman worker that's failing or is it the local gearman worker (on the XI server)?

Is it all workers or just some?

What version of nagios core is it running on the XI system?

Code: Select all
/usr/local/nagios/bin/nagios -V


If it's higher than 4.2.4 then you are REQUIRED to upgrade mod_gearman (server and workers) and Nagios Core to at least 4.4.3 following this guide:

Code: Select all
https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf


Please send your logs from here:

Code: Select all
/var/log/gearmand/
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
ssax
Dreams In Code
 
Posts: 4108
Joined: Wed Feb 11, 2015 12:54 pm

Re: NXI 5.5.2 - Host check timeout on Worker

Postby Aezox » Fri Jun 14, 2019 2:48 am

@ssax

It is the output on XI server console (in web browser) that is falling. Local check from worker via putty is working fine.

Testing from worker:
worker_check.png
worker_check.png (3.49 KiB) Viewed 268 times

Nagios XI output for the same check:
worker_timeout.png
worker_timeout.png (4.15 KiB) Viewed 268 times


It is only one gearman server concerned out of 3 and it is concerning only 4 hosts out of 833 hosts running on that gearman.
EDIT : I've just found out that I have 30+ services in errors with error type "Service Check Timed Out On Worker" mixed between 3 running gearmans.

We are running Nagios Core 4.2.4

There is no recent log from gearmand logs since a month and the issue has begun 2 days ago.

I might have a clue regarding a difference between : host_check_timeout vs worker_check_timeout (terms may not be accurate)
Aezox
 
Posts: 21
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Postby ssax » Fri Jun 14, 2019 2:11 pm

I might have a clue regarding a difference between : host_check_timeout vs worker_check_timeout (terms may not be accurate)


That may be the case, good point!

What is the timeout set to (or is there even a timeout set) for:

Code: Select all
/usr/local/nagios/etc/nagios.cfg


These:

Code: Select all
host_check_timeout
service_check_timeout


On the check command, edit the host/service and click the Run Check Command button and send us the entire command with arguments (I'm looking to see if there's a timeout on the command).

Then compare the timeouts in your gearman worker.conf file from the worker having issues.

How long does the check run from the worker take from the CLI manually?

Code: Select all
time /full/commmand -with arguments
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
ssax
Dreams In Code
 
Posts: 4108
Joined: Wed Feb 11, 2015 12:54 pm

Re: NXI 5.5.2 - Host check timeout on Worker

Postby Aezox » Mon Jun 17, 2019 3:07 am

nagios.cfg :
Code: Select all
host_check_timeout=30
service_check_timeout=60


worker.conf :
Code: Select all
job_timeout=120
idle-timeout=30

It is the only timeouts that I found in the worker.conf file

We are using host templates to run host checks, for the below error here what check_command is used :
Code: Select all
$USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ -t 30


Run Check Command from host template configuration gives positive :
Code: Select all
[nagios@*********** ~]$ /usr/local/nagios/libexec/check_tcp -H ************** -p 443 -t 30
TCP OK - 0.058 second response time on *********** port 443|time=0.057710s;;;0.000000;30.000000


While Nagios output in browser says :
Code: Select all
Advanced Status Details
Host State:   Down
Duration:   5d 0h 3m 8s
State Type:   Hard
Current Check:   2 of 2
Last Check:   17/06/2019 09:56:47
Next Check:   17/06/2019 09:59:15
Last State Change:   12/06/2019 09:56:20
Last Notification:   16/06/2019 10:03:08
Check Type:   Active
Check Latency:   0.72588 seconds
Execution Time:   30.00018 seconds
State Change:   0%
Performance Data:   
Aezox
 
Posts: 21
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Postby scottwilkerson » Mon Jun 17, 2019 8:28 am

Aezox wrote:Run Check Command from host template configuration gives positive :


Code: Select all
[nagios@*********** ~]$ /usr/local/nagios/libexec/check_tcp -H ************** -p 443 -t 30
    TCP OK - 0.058 second response time on *********** port 443|time=0.057710s;;;0.000000;30.000000



Did you run this test from the worker or the Nagios server?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 15057
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: NXI 5.5.2 - Host check timeout on Worker

Postby Aezox » Mon Jun 17, 2019 8:50 am

According to output run check command it has been ran from Nagios server
Aezox
 
Posts: 21
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Postby scottwilkerson » Mon Jun 17, 2019 9:04 am

Aezox wrote:According to output run check command it has been ran from Nagios server


You would want to test it from the worker to see if it can be run successfully from the worker (which is where it is timing out)
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 15057
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: NXI 5.5.2 - Host check timeout on Worker

Postby Aezox » Mon Jun 17, 2019 10:28 am

@scottwilkerson

Already tried from worker, check the screenshot from my second post.
Check from worker are working :)
Aezox
 
Posts: 21
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Postby scottwilkerson » Mon Jun 17, 2019 10:44 am

Aezox wrote:@scottwilkerson

Already tried from worker, check the screenshot from my second post.
Check from worker are working :)


I guess I wasn't clear, you mentioned having 4 workers, I was wondering if it is working properly from all 4
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 15057
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Next

Return to Nagios XI

Who is online

Users browsing this forum: No registered users and 17 guests