NXI 5.5.2 - Host check timeout on Worker

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Aezox
Posts: 25
Joined: Fri Feb 09, 2018 9:31 am

NXI 5.5.2 - Host check timeout on Worker

Post by Aezox »

Hello everyone,

Since few days we are facing issues with monitoring falling in error regarding error state : Host Check Timeout on Worker or CHECK_NRPE: Socket timeout after 45 seconds

We are running Nagios XI 5.5.2 along with 5 MG_2 workers

NXI Server :

Code: Select all

nagiosxi-nrds-5.5.2-1.el6.x86_64
nagiosxi-5.5.2-1.el6.x86_64
nagiosxi-wkhtmltox-5.5.2-1.el6.x86_64
nagiosxi-nsca-5.5.2-1.el6.x86_64
nagiosxi-pnp-5.5.2-1.el6.x86_64
nagiosxi-shellinabox-5.5.2-1.el6.x86_64
nagiosxi-nxti-5.5.2-1.el6.x86_64
nagiosxi-nagioscore-5-4.13.el6.x86_64
nagiosxi-nrpe-5.5.2-1.el6.x86_64
nagiosxi-nagiosmobile-5.5.2-1.el6.x86_64
nagiosxi-mrtg-5.5.2-1.el6.x86_64
nagiosxi-nagvis-5.5.2-1.el6.x86_64
nagiosxi-wmic-5.5.2-1.el6.x86_64
nagiosxi-ndoutils-5.5.2-1.el6.x86_64
nagiosxi-nagiosplugins-5.5.2-1.el6.x86_64
MG worker servers :

Code: Select all

mod_gearman2-2.1.1-1.el6.x86_64
While digging into the errors I found out that :
- checking services from client-side are working 100 % properly
- checking services from nxiserver-side thru putty are working 100% properly
- checking services from mgserver-side thru putty are falling in errors one check out of two
- checking services from nxiconsole-side thru web browser are falling in errors one check out of two

MG worker.conf :

Code: Select all

# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=120

# Minimum number of worker processes which should
# run at any time.
min-worker=25

# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=200

# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=30

# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=1000
I think that my worker is overloaded but you guys might have a clue for me to help me out :)

Regards
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: NXI 5.5.2 - Host check timeout on Worker

Post by ssax »

Is it an external gearman worker that's failing or is it the local gearman worker (on the XI server)?

Is it all workers or just some?

What version of nagios core is it running on the XI system?

Code: Select all

/usr/local/nagios/bin/nagios -V
If it's higher than 4.2.4 then you are REQUIRED to upgrade mod_gearman (server and workers) and Nagios Core to at least 4.4.3 following this guide:

Code: Select all

https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf
Please send your logs from here:

Code: Select all

/var/log/gearmand/
Aezox
Posts: 25
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Post by Aezox »

@ssax

It is the output on XI server console (in web browser) that is falling. Local check from worker via putty is working fine.

Testing from worker:
worker_check.png
Nagios XI output for the same check:
worker_timeout.png
It is only one gearman server concerned out of 3 and it is concerning only 4 hosts out of 833 hosts running on that gearman.
EDIT : I've just found out that I have 30+ services in errors with error type "Service Check Timed Out On Worker" mixed between 3 running gearmans.

We are running Nagios Core 4.2.4

There is no recent log from gearmand logs since a month and the issue has begun 2 days ago.

I might have a clue regarding a difference between : host_check_timeout vs worker_check_timeout (terms may not be accurate)
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: NXI 5.5.2 - Host check timeout on Worker

Post by ssax »

I might have a clue regarding a difference between : host_check_timeout vs worker_check_timeout (terms may not be accurate)
That may be the case, good point!

What is the timeout set to (or is there even a timeout set) for:

Code: Select all

/usr/local/nagios/etc/nagios.cfg
These:

Code: Select all

host_check_timeout
service_check_timeout
On the check command, edit the host/service and click the Run Check Command button and send us the entire command with arguments (I'm looking to see if there's a timeout on the command).

Then compare the timeouts in your gearman worker.conf file from the worker having issues.

How long does the check run from the worker take from the CLI manually?

Code: Select all

time /full/commmand -with arguments
Aezox
Posts: 25
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Post by Aezox »

nagios.cfg :

Code: Select all

host_check_timeout=30
service_check_timeout=60
worker.conf :

Code: Select all

job_timeout=120
idle-timeout=30
It is the only timeouts that I found in the worker.conf file

We are using host templates to run host checks, for the below error here what check_command is used :

Code: Select all

 $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ -t 30 
Run Check Command from host template configuration gives positive :

Code: Select all

[nagios@*********** ~]$ /usr/local/nagios/libexec/check_tcp -H ************** -p 443 -t 30
TCP OK - 0.058 second response time on *********** port 443|time=0.057710s;;;0.000000;30.000000
While Nagios output in browser says :

Code: Select all

Advanced Status Details
Host State:	Down
Duration:	5d 0h 3m 8s
State Type:	Hard
Current Check:	2 of 2
Last Check:	17/06/2019 09:56:47
Next Check:	17/06/2019 09:59:15
Last State Change:	12/06/2019 09:56:20
Last Notification:	16/06/2019 10:03:08
Check Type:	Active
Check Latency:	0.72588 seconds
Execution Time:	30.00018 seconds
State Change:	0%
Performance Data:	
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NXI 5.5.2 - Host check timeout on Worker

Post by scottwilkerson »

Aezox wrote:Run Check Command from host template configuration gives positive :

Code: Select all

[nagios@*********** ~]$ /usr/local/nagios/libexec/check_tcp -H ************** -p 443 -t 30
    TCP OK - 0.058 second response time on *********** port 443|time=0.057710s;;;0.000000;30.000000
Did you run this test from the worker or the Nagios server?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Aezox
Posts: 25
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Post by Aezox »

According to output run check command it has been ran from Nagios server
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NXI 5.5.2 - Host check timeout on Worker

Post by scottwilkerson »

Aezox wrote:According to output run check command it has been ran from Nagios server
You would want to test it from the worker to see if it can be run successfully from the worker (which is where it is timing out)
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Aezox
Posts: 25
Joined: Fri Feb 09, 2018 9:31 am

Re: NXI 5.5.2 - Host check timeout on Worker

Post by Aezox »

@scottwilkerson

Already tried from worker, check the screenshot from my second post.
Check from worker are working :)
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NXI 5.5.2 - Host check timeout on Worker

Post by scottwilkerson »

Aezox wrote:@scottwilkerson

Already tried from worker, check the screenshot from my second post.
Check from worker are working :)
I guess I wasn't clear, you mentioned having 4 workers, I was wondering if it is working properly from all 4
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked