NLS goes to hung mode every 2 hours

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Apr 02, 2024 7:27 am

NLS goes to hung mode every 2 hours .
Current we have 2 NLS instances in the cluster in 2 different datacenter.
But every 2 hours we see below errors in the cluster.log

errors from the /var/log/elasticsearch/<cluster.log>

[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]

[2024-04-02 06:50:07,764][WARN ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master left (reason = transport disconnected), current nodes: {[528b9c0b-d2fd-4a90-8889-2cd35ca64b70][mtBZ6nFMQu6fLAguPux8aw][<<NLSnode1 IP>>][inet[/<IP>:9300]]{max_local_storage_nodes=1},}

[2024-04-02 06:50:07,764][INFO ][cluster.service ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] removed {[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>][inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1})

-------------------------------------------
every 2 hours 11 mins once we see same error that the NLS node2 left the cluster

# grep master_left 90bdcc06-402e-429e-87e2-a1de1745ecc7.log
[2024-04-02 04:38:30,675][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 09:01:44,851][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
#

-------------------------------------------

# cat /proc/sys/net/ipv4/tcp_keepalive_time /proc/sys/net/ipv4/tcp_keepalive_intvl /proc/sys/net/ipv4/tcp_keepalive_probes
7200
75
9
#

jmichaelson · Post by **jmichaelson** » Tue Apr 02, 2024 4:31 pm

Is there by chance a stateful firewall between the two data centers that might be timing out?

You did bring up the keepalive settings, I came up with this elastic search post:

https://discuss.elastic.co/t/possible-c ... ry/13651/3

which led me to this old Linux documentation project post (the site needs to renew their certificate) https://tldp.org/HOWTO/TCP-Keepalive-HO ... alive.html

about how to change the keep alive settings. Try reducing it to 600 and let us know if that helps.

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Apr 02, 2024 11:01 pm

Thanks, i have reduced the keepalive to 3600 .
will share the updates

gsl_ops_practice · Post by **gsl_ops_practice** » Mon Apr 15, 2024 4:30 am

after amending the values as below, NLS is working as expected, thanks for the help.
----------------
net.ipv4.tcp_keepalive_time=601
net.ipv4.tcp_keepalive_intvl=21
net.ipv4.tcp_keepalive_probes=21

jmichaelson · Post by **jmichaelson** » Mon Apr 15, 2024 11:27 am

Good to know! To be honest, I'm not certain that Elasticsearch is meant to be used in the manner you're using it (with two nodes of the cluster divided across data centers. Personally I like the idea for availability, but I'm just not sure its officially supported for that. Its great that adjusting timeouts got it working for you!

Nagios Support Forum

NLS goes to hung mode every 2 hours

NLS goes to hung mode every 2 hours

Re: NLS goes to hung mode every 2 hours

Re: NLS goes to hung mode every 2 hours

Re: NLS goes to hung mode every 2 hours

Re: NLS goes to hung mode every 2 hours