NLS goes to hung mode every 2 hours .
Current we have 2 NLS instances in the cluster in 2 different datacenter.
But every 2 hours we see below errors in the cluster.log
errors from the /var/log/elasticsearch/<cluster.log>
[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 06:50:07,764][WARN ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master left (reason = transport disconnected), current nodes: {[528b9c0b-d2fd-4a90-8889-2cd35ca64b70][mtBZ6nFMQu6fLAguPux8aw][<<NLSnode1 IP>>][inet[/<IP>:9300]]{max_local_storage_nodes=1},}
[2024-04-02 06:50:07,764][INFO ][cluster.service ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] removed {[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>][inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1})
-------------------------------------------
every 2 hours 11 mins once we see same error that the NLS node2 left the cluster
# grep master_left 90bdcc06-402e-429e-87e2-a1de1745ecc7.log
[2024-04-02 04:38:30,675][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 09:01:44,851][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
#
-------------------------------------------
# cat /proc/sys/net/ipv4/tcp_keepalive_time /proc/sys/net/ipv4/tcp_keepalive_intvl /proc/sys/net/ipv4/tcp_keepalive_probes
7200
75
9
#
NLS goes to hung mode every 2 hours
-
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
- jmichaelson
- Posts: 133
- Joined: Wed Aug 23, 2023 1:02 pm
Re: NLS goes to hung mode every 2 hours
Is there by chance a stateful firewall between the two data centers that might be timing out?
You did bring up the keepalive settings, I came up with this elastic search post:
https://discuss.elastic.co/t/possible-c ... ry/13651/3
which led me to this old Linux documentation project post (the site needs to renew their certificate) https://tldp.org/HOWTO/TCP-Keepalive-HO ... alive.html
about how to change the keep alive settings. Try reducing it to 600 and let us know if that helps.
You did bring up the keepalive settings, I came up with this elastic search post:
https://discuss.elastic.co/t/possible-c ... ry/13651/3
which led me to this old Linux documentation project post (the site needs to renew their certificate) https://tldp.org/HOWTO/TCP-Keepalive-HO ... alive.html
about how to change the keep alive settings. Try reducing it to 600 and let us know if that helps.
Please let us know if you have any other questions or concerns.
-Jason
-Jason
-
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS goes to hung mode every 2 hours
Thanks, i have reduced the keepalive to 3600 .
will share the updates
will share the updates
-
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS goes to hung mode every 2 hours
after amending the values as below, NLS is working as expected, thanks for the help.
----------------
net.ipv4.tcp_keepalive_time=601
net.ipv4.tcp_keepalive_intvl=21
net.ipv4.tcp_keepalive_probes=21
----------------
net.ipv4.tcp_keepalive_time=601
net.ipv4.tcp_keepalive_intvl=21
net.ipv4.tcp_keepalive_probes=21
- jmichaelson
- Posts: 133
- Joined: Wed Aug 23, 2023 1:02 pm
Re: NLS goes to hung mode every 2 hours
Good to know! To be honest, I'm not certain that Elasticsearch is meant to be used in the manner you're using it (with two nodes of the cluster divided across data centers. Personally I like the idea for availability, but I'm just not sure its officially supported for that. Its great that adjusting timeouts got it working for you!
Please let us know if you have any other questions or concerns.
-Jason
-Jason