Page 1 of 1

NLS goes to hung mode every 2 hours

Posted: Tue Apr 02, 2024 7:27 am
by gsl_ops_practice
NLS goes to hung mode every 2 hours .
Current we have 2 NLS instances in the cluster in 2 different datacenter.
But every 2 hours we see below errors in the cluster.log


errors from the /var/log/elasticsearch/<cluster.log>


[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]

[2024-04-02 06:50:07,764][WARN ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master left (reason = transport disconnected), current nodes: {[528b9c0b-d2fd-4a90-8889-2cd35ca64b70][mtBZ6nFMQu6fLAguPux8aw][<<NLSnode1 IP>>][inet[/<IP>:9300]]{max_local_storage_nodes=1},}

[2024-04-02 06:50:07,764][INFO ][cluster.service ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] removed {[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>][inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1})

-------------------------------------------
every 2 hours 11 mins once we see same error that the NLS node2 left the cluster

# grep master_left 90bdcc06-402e-429e-87e2-a1de1745ecc7.log
[2024-04-02 04:38:30,675][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 09:01:44,851][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
#

-------------------------------------------


# cat /proc/sys/net/ipv4/tcp_keepalive_time /proc/sys/net/ipv4/tcp_keepalive_intvl /proc/sys/net/ipv4/tcp_keepalive_probes
7200
75
9
#

Re: NLS goes to hung mode every 2 hours

Posted: Tue Apr 02, 2024 4:31 pm
by jmichaelson
Is there by chance a stateful firewall between the two data centers that might be timing out?

You did bring up the keepalive settings, I came up with this elastic search post:

https://discuss.elastic.co/t/possible-c ... ry/13651/3

which led me to this old Linux documentation project post (the site needs to renew their certificate) https://tldp.org/HOWTO/TCP-Keepalive-HO ... alive.html

about how to change the keep alive settings. Try reducing it to 600 and let us know if that helps.

Re: NLS goes to hung mode every 2 hours

Posted: Tue Apr 02, 2024 11:01 pm
by gsl_ops_practice
Thanks, i have reduced the keepalive to 3600 .
will share the updates

Re: NLS goes to hung mode every 2 hours

Posted: Mon Apr 15, 2024 4:30 am
by gsl_ops_practice
after amending the values as below, NLS is working as expected, thanks for the help.
----------------
net.ipv4.tcp_keepalive_time=601
net.ipv4.tcp_keepalive_intvl=21
net.ipv4.tcp_keepalive_probes=21

Re: NLS goes to hung mode every 2 hours

Posted: Mon Apr 15, 2024 11:27 am
by jmichaelson
Good to know! To be honest, I'm not certain that Elasticsearch is meant to be used in the manner you're using it (with two nodes of the cluster divided across data centers. Personally I like the idea for availability, but I'm just not sure its officially supported for that. Its great that adjusting timeouts got it working for you!