Cluster failure and UDP syslogs
Posted: Tue Jul 26, 2016 5:31 pm
Looks like a few days back, our cluster went to a complete halt. We have different data sources going to 3 different nodes, but all of them seems to have stopped. I see the following errors about 1 day prior to all the logs stopping on the log server clusters
Time of first observable event - 2016-07-23T05:25:01.000Z
(CRON) ERROR (setreuid failed): Resource temporarily unavailable
I was not seeing any logs until the entire cluster was restarted. Looks like some point the cluster had memory issue.
On another node I can see..
Before I rebooted the cluster, it was even having trouble showing me the nodes with "curl localhost:9200/_cat/nodes"
So I guess the cluster was low on memory but is looking at the "Resource temporarily unavailable" message, could this be caused by other memory settings like Kernel memory that can be tweaked here? Suggestions?
What we are seeing is that servers sending syslogs to Nagios Log Server over time had issues and would stop logging locally as well and seems to happen about a day after the first cluster resource event. Have you guys seen this where it sending UDP syslogs could cause the source to not log if Nagios Log server is not responding via UDP??
Time of first observable event - 2016-07-23T05:25:01.000Z
(CRON) ERROR (setreuid failed): Resource temporarily unavailable
I was not seeing any logs until the entire cluster was restarted. Looks like some point the cluster had memory issue.
Code: Select all
[2016-07-22 14:16:12,425][WARN ][transport.netty ] [4521585a-88af-47c9-81e5-c4d13cffb148] exception caught on transport layer [[id: 0x4140ed8a, /<IP>:49192 => /<IP>:9300]], closing connection
java.lang.OutOfMemoryError: Java heap spaceCode: Select all
[2016-07-22 11:26:20,768][DEBUG][action.admin.cluster.node.stats] [9a92d6ef-d554-49d8-9191-dcf886382926] failed to execute on node [6r8jhfnZSnqEcG7h59h-SQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [e63648a3-d912-4f5d-a867-1b99282a5e7c][inet[/<IP>:9300]][cluster:monitor/nodes/stats[n
]] request_id [233583455] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
So I guess the cluster was low on memory but is looking at the "Resource temporarily unavailable" message, could this be caused by other memory settings like Kernel memory that can be tweaked here? Suggestions?
What we are seeing is that servers sending syslogs to Nagios Log Server over time had issues and would stop logging locally as well and seems to happen about a day after the first cluster resource event. Have you guys seen this where it sending UDP syslogs could cause the source to not log if Nagios Log server is not responding via UDP??