NLS goes to hung mode every 2 hours

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Post Reply
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

NLS goes to hung mode every 2 hours

Post by gsl_ops_practice »

NLS goes to hung mode every 2 hours .
Current we have 2 NLS instances in the cluster in 2 different datacenter.
But every 2 hours we see below errors in the cluster.log


errors from the /var/log/elasticsearch/<cluster.log>


[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]

[2024-04-02 06:50:07,764][WARN ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master left (reason = transport disconnected), current nodes: {[528b9c0b-d2fd-4a90-8889-2cd35ca64b70][mtBZ6nFMQu6fLAguPux8aw][<<NLSnode1 IP>>][inet[/<IP>:9300]]{max_local_storage_nodes=1},}

[2024-04-02 06:50:07,764][INFO ][cluster.service ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] removed {[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>][inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug] [<<NLSnode2>>] [inet[/<<NLSnode2 IP>>:9300]]{max_local_storage_nodes=1})

-------------------------------------------
every 2 hours 11 mins once we see same error that the NLS node2 left the cluster

# grep master_left 90bdcc06-402e-429e-87e2-a1de1745ecc7.log
[2024-04-02 04:38:30,675][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 06:50:07,764][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2024-04-02 09:01:44,851][INFO ][discovery.zen ] [528b9c0b-d2fd-4a90-8889-2cd35ca64b70] master_left [[a40b0d7b-a79a-4c37-9225-264a71974bb9][CZ6PtUXlQ2a8OSb8GCy7Ug][<<NLSnode1 name >>][inet[/<<NLSnode1 IP>>:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
#

-------------------------------------------


# cat /proc/sys/net/ipv4/tcp_keepalive_time /proc/sys/net/ipv4/tcp_keepalive_intvl /proc/sys/net/ipv4/tcp_keepalive_probes
7200
75
9
#
User avatar
jmichaelson
Posts: 110
Joined: Wed Aug 23, 2023 1:02 pm

Re: NLS goes to hung mode every 2 hours

Post by jmichaelson »

Is there by chance a stateful firewall between the two data centers that might be timing out?

You did bring up the keepalive settings, I came up with this elastic search post:

https://discuss.elastic.co/t/possible-c ... ry/13651/3

which led me to this old Linux documentation project post (the site needs to renew their certificate) https://tldp.org/HOWTO/TCP-Keepalive-HO ... alive.html

about how to change the keep alive settings. Try reducing it to 600 and let us know if that helps.
Please let us know if you have any other questions or concerns.

-Jason
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS goes to hung mode every 2 hours

Post by gsl_ops_practice »

Thanks, i have reduced the keepalive to 3600 .
will share the updates
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS goes to hung mode every 2 hours

Post by gsl_ops_practice »

after amending the values as below, NLS is working as expected, thanks for the help.
----------------
net.ipv4.tcp_keepalive_time=601
net.ipv4.tcp_keepalive_intvl=21
net.ipv4.tcp_keepalive_probes=21
User avatar
jmichaelson
Posts: 110
Joined: Wed Aug 23, 2023 1:02 pm

Re: NLS goes to hung mode every 2 hours

Post by jmichaelson »

Good to know! To be honest, I'm not certain that Elasticsearch is meant to be used in the manner you're using it (with two nodes of the cluster divided across data centers. Personally I like the idea for availability, but I'm just not sure its officially supported for that. Its great that adjusting timeouts got it working for you!
Please let us know if you have any other questions or concerns.

-Jason
Post Reply