Cluster failure and UDP syslogs

CFT6Server · Post by **CFT6Server** » Tue Jul 26, 2016 5:31 pm

Looks like a few days back, our cluster went to a complete halt. We have different data sources going to 3 different nodes, but all of them seems to have stopped. I see the following errors about 1 day prior to all the logs stopping on the log server clusters

Time of first observable event - 2016-07-23T05:25:01.000Z
(CRON) ERROR (setreuid failed): Resource temporarily unavailable

I was not seeing any logs until the entire cluster was restarted. Looks like some point the cluster had memory issue.

Code: Select all

[2016-07-22 14:16:12,425][WARN ][transport.netty          ] [4521585a-88af-47c9-81e5-c4d13cffb148] exception caught on transport layer [[id: 0x4140ed8a, /<IP>:49192 => /<IP>:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space

On another node I can see..

Code: Select all

[2016-07-22 11:26:20,768][DEBUG][action.admin.cluster.node.stats] [9a92d6ef-d554-49d8-9191-dcf886382926] failed to execute on node [6r8jhfnZSnqEcG7h59h-SQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [e63648a3-d912-4f5d-a867-1b99282a5e7c][inet[/<IP>:9300]][cluster:monitor/nodes/stats[n
]] request_id [233583455] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Before I rebooted the cluster, it was even having trouble showing me the nodes with "curl localhost:9200/_cat/nodes"

So I guess the cluster was low on memory but is looking at the "Resource temporarily unavailable" message, could this be caused by other memory settings like Kernel memory that can be tweaked here? Suggestions?

What we are seeing is that servers sending syslogs to Nagios Log Server over time had issues and would stop logging locally as well and seems to happen about a day after the first cluster resource event. Have you guys seen this where it sending UDP syslogs could cause the source to not log if Nagios Log server is not responding via UDP??

CFT6Server · Post by **CFT6Server** » Tue Jul 26, 2016 5:42 pm

A couple more messages from another node. Looks like Master just disappeared and the cluster was without a master?

Code: Select all

[2016-07-22 08:48:29,302][INFO ][monitor.jvm              ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] [gc][old][768200][46200] duration [11.4m], collections [97]
/[11.5m], total [11.4m]/[1.6h], memory [12.9gb]->[12.8gb]/[12.9gb], all_pools {[young] [399.4mb]->[339.2mb]/[399.4mb]}{[survivor] [49.4mb]->[0b]/[49.8mb]}{[ol
d] [12.5gb]->[12.5gb]/[12.5gb]}
[2016-07-22 08:48:29,449][INFO ][discovery.zen            ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] master_left [[9a92d6ef-d554-49d8-9191-dcf886382926][SYbQ13O
bSNSAosQoTVrWMw][kdcnagls2n1.bchydro.bc.ca][inet[/10.242.102.123:9300]]{max_local_storage_nodes=1}], reason [do not exists on master, act as master failure]
[2016-07-22 08:48:29,452][WARN ][discovery.zen            ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] master left (reason = do not exists on master, act as maste
r failure), current nodes: {[2db4ce89-4c01-4a30-9bc8-66e987b7d613][gW4H7VmzRDC_QeexUTz5JA][kdcnagls2n3.bchydro.bc.ca][inet[/10.242.102.125:9300]]{max_local_st
orage_nodes=1},[c424515a-16b3-43f9-866e-19daedef8a63][p8t5Q_6pQCCUeI4m60TZkQ][kdcnagls2n2.bchydro.bc.ca][inet[/10.242.102.124:9300]]{max_local_storage_nodes=1
},[e63648a3-d912-4f5d-a867-1b99282a5e7c][6r8jhfnZSnqEcG7h59h-SQ][kdcnagls1n3.bchydro.bc.ca][inet[/10.242.102.109:9300]]{max_local_storage_nodes=1},[30ab2b2c-4
39f-4bcc-977d-7c0e9a90f3a5][f9bV0Ej0TsG3L_JTpGNOSA][kdcnagls1n2.bchydro.bc.ca][inet[/10.242.102.108:9300]]{max_local_storage_nodes=1},[4521585a-88af-47c9-81e5
-c4d13cffb148][WLH664xeS8efPqtsyZ56lQ][kdcnagls1n1.bchydro.bc.ca][inet[/10.242.102.107:9300]]{max_local_storage_nodes=1},}
[2016-07-22 08:48:29,454][INFO ][cluster.service          ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] removed {[9a92d6ef-d554-49d8-9191-dcf886382926][SYbQ13ObSNS
AosQoTVrWMw][kdcnagls2n1.bchydro.bc.ca][inet[/10.242.102.123:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([9a92d6ef-d554-49d8-9191-dc
f886382926][SYbQ13ObSNSAosQoTVrWMw][kdcnagls2n1.bchydro.bc.ca][inet[/10.242.102.123:9300]]{max_local_storage_nodes=1})

Post by **Box293** » Tue Jul 26, 2016 10:31 pm

Has the cluster returned to an OK status after a reboot?

Do these KB articles help at all?

https://support.nagios.com/kb/article.php?id=469

https://support.nagios.com/kb/article.php?id=90

https://support.nagios.com/kb/article.php?id=132

CFT6Server · Post by **CFT6Server** » Wed Jul 27, 2016 10:01 am

Unfortunately I know what those symptoms and cluster status/messages. But they don't really help us. I am looking to find out a couple of things,
It isn't a disk space or sharding issue. The sharding error looks like was a result of not having a master node.

1. that resource related message (which happens outside of Log server), is that another setting or perhaps some internal kernel parameters that we can tweak for.
2. The master failed, but how come the other nodes didn't take over as master
3. find out whether if a node is not collecting logs can cause the clients' rsyslogd to queue up and fail even if the logs are sent via UDP.

rkennedy · Post by **rkennedy** » Wed Jul 27, 2016 1:05 pm

(CRON) ERROR (setreuid failed): Resource temporarily unavailable

What is the output of sysctl fs.file-nr? You may be hitting a file descriptor limit.

2. What specs do your different nodes have? How are you inputting all of your logs to spread them between the cluster members? I believe it should have designated another master, but with the out of memory error, I wonder if that's related.

3. If it's unable to forward them, it should queue for a time when they can be sent. I believe they are stored to memory until they can be forwarded, but this could be dependent on how your rsyslog.conf file is setup.

CFT6Server · Post by **CFT6Server** » Wed Jul 27, 2016 5:43 pm

Here's the limit.

Code: Select all

fs.file-nr = 7296       0       2033478

Specs of nodes are 6CPU and 20GB of RAM each with 6 nodes
inputs are spread between 3 of the nodes.

In terms of the client side issue, that is no longer an issue, as I found out that the remote client was sending via TCP which would've cause issues on client's end if the listening node on this side was not responsive.

hsmith · Post by **hsmith** » Thu Jul 28, 2016 10:09 am

Since we have a few issues we're discussing here, I would like to make sure we're on the same page. Is the only issue you're experiencing right now the one with another node not taking over as master when the master did fail?

CFT6Server · Post by **CFT6Server** » Thu Jul 28, 2016 10:32 am

I had hard reboot the entire cluster so it is back at green.

Now I am trying to find out 2 main issues.

1. What is the resource unavailable message and what can we can?
2. Find out what happened as to why the cluster could not elect a master.

hsmith · Post by **hsmith** » Thu Jul 28, 2016 1:46 pm

Are these servers in the same geographic location?

CFT6Server · Post by **CFT6Server** » Thu Jul 28, 2016 5:28 pm

They are. Same network/subnet.

Nagios Support Forum

Cluster failure and UDP syslogs

Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs