Cluster failure and UDP syslogs

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Cluster failure and UDP syslogs

Post by CFT6Server »

Looks like a few days back, our cluster went to a complete halt. We have different data sources going to 3 different nodes, but all of them seems to have stopped. I see the following errors about 1 day prior to all the logs stopping on the log server clusters

Time of first observable event - 2016-07-23T05:25:01.000Z
(CRON) ERROR (setreuid failed): Resource temporarily unavailable

I was not seeing any logs until the entire cluster was restarted. Looks like some point the cluster had memory issue.

Code: Select all

[2016-07-22 14:16:12,425][WARN ][transport.netty          ] [4521585a-88af-47c9-81e5-c4d13cffb148] exception caught on transport layer [[id: 0x4140ed8a, /<IP>:49192 => /<IP>:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space
On another node I can see..

Code: Select all

[2016-07-22 11:26:20,768][DEBUG][action.admin.cluster.node.stats] [9a92d6ef-d554-49d8-9191-dcf886382926] failed to execute on node [6r8jhfnZSnqEcG7h59h-SQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [e63648a3-d912-4f5d-a867-1b99282a5e7c][inet[/<IP>:9300]][cluster:monitor/nodes/stats[n
]] request_id [233583455] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Before I rebooted the cluster, it was even having trouble showing me the nodes with "curl localhost:9200/_cat/nodes"

So I guess the cluster was low on memory but is looking at the "Resource temporarily unavailable" message, could this be caused by other memory settings like Kernel memory that can be tweaked here? Suggestions?

What we are seeing is that servers sending syslogs to Nagios Log Server over time had issues and would stop logging locally as well and seems to happen about a day after the first cluster resource event. Have you guys seen this where it sending UDP syslogs could cause the source to not log if Nagios Log server is not responding via UDP??
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Cluster failure and UDP syslogs

Post by CFT6Server »

A couple more messages from another node. Looks like Master just disappeared and the cluster was without a master?

Code: Select all

[2016-07-22 08:48:29,302][INFO ][monitor.jvm              ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] [gc][old][768200][46200] duration [11.4m], collections [97]
/[11.5m], total [11.4m]/[1.6h], memory [12.9gb]->[12.8gb]/[12.9gb], all_pools {[young] [399.4mb]->[339.2mb]/[399.4mb]}{[survivor] [49.4mb]->[0b]/[49.8mb]}{[ol
d] [12.5gb]->[12.5gb]/[12.5gb]}
[2016-07-22 08:48:29,449][INFO ][discovery.zen            ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] master_left [[9a92d6ef-d554-49d8-9191-dcf886382926][SYbQ13O
bSNSAosQoTVrWMw][kdcnagls2n1.bchydro.bc.ca][inet[/10.242.102.123:9300]]{max_local_storage_nodes=1}], reason [do not exists on master, act as master failure]
[2016-07-22 08:48:29,452][WARN ][discovery.zen            ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] master left (reason = do not exists on master, act as maste
r failure), current nodes: {[2db4ce89-4c01-4a30-9bc8-66e987b7d613][gW4H7VmzRDC_QeexUTz5JA][kdcnagls2n3.bchydro.bc.ca][inet[/10.242.102.125:9300]]{max_local_st
orage_nodes=1},[c424515a-16b3-43f9-866e-19daedef8a63][p8t5Q_6pQCCUeI4m60TZkQ][kdcnagls2n2.bchydro.bc.ca][inet[/10.242.102.124:9300]]{max_local_storage_nodes=1
},[e63648a3-d912-4f5d-a867-1b99282a5e7c][6r8jhfnZSnqEcG7h59h-SQ][kdcnagls1n3.bchydro.bc.ca][inet[/10.242.102.109:9300]]{max_local_storage_nodes=1},[30ab2b2c-4
39f-4bcc-977d-7c0e9a90f3a5][f9bV0Ej0TsG3L_JTpGNOSA][kdcnagls1n2.bchydro.bc.ca][inet[/10.242.102.108:9300]]{max_local_storage_nodes=1},[4521585a-88af-47c9-81e5
-c4d13cffb148][WLH664xeS8efPqtsyZ56lQ][kdcnagls1n1.bchydro.bc.ca][inet[/10.242.102.107:9300]]{max_local_storage_nodes=1},}
[2016-07-22 08:48:29,454][INFO ][cluster.service          ] [e63648a3-d912-4f5d-a867-1b99282a5e7c] removed {[9a92d6ef-d554-49d8-9191-dcf886382926][SYbQ13ObSNS
AosQoTVrWMw][kdcnagls2n1.bchydro.bc.ca][inet[/10.242.102.123:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([9a92d6ef-d554-49d8-9191-dc
f886382926][SYbQ13ObSNSAosQoTVrWMw][kdcnagls2n1.bchydro.bc.ca][inet[/10.242.102.123:9300]]{max_local_storage_nodes=1})
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Cluster failure and UDP syslogs

Post by Box293 »

As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Cluster failure and UDP syslogs

Post by CFT6Server »

Unfortunately I know what those symptoms and cluster status/messages. But they don't really help us. I am looking to find out a couple of things,
It isn't a disk space or sharding issue. The sharding error looks like was a result of not having a master node.

1. that resource related message (which happens outside of Log server), is that another setting or perhaps some internal kernel parameters that we can tweak for.
2. The master failed, but how come the other nodes didn't take over as master
3. find out whether if a node is not collecting logs can cause the clients' rsyslogd to queue up and fail even if the logs are sent via UDP.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Cluster failure and UDP syslogs

Post by rkennedy »

(CRON) ERROR (setreuid failed): Resource temporarily unavailable
What is the output of sysctl fs.file-nr? You may be hitting a file descriptor limit.

2. What specs do your different nodes have? How are you inputting all of your logs to spread them between the cluster members? I believe it should have designated another master, but with the out of memory error, I wonder if that's related.

3. If it's unable to forward them, it should queue for a time when they can be sent. I believe they are stored to memory until they can be forwarded, but this could be dependent on how your rsyslog.conf file is setup.
Former Nagios Employee
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Cluster failure and UDP syslogs

Post by CFT6Server »

Here's the limit.

Code: Select all

fs.file-nr = 7296       0       2033478
Specs of nodes are 6CPU and 20GB of RAM each with 6 nodes
inputs are spread between 3 of the nodes.

In terms of the client side issue, that is no longer an issue, as I found out that the remote client was sending via TCP which would've cause issues on client's end if the listening node on this side was not responsive.
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Cluster failure and UDP syslogs

Post by hsmith »

Since we have a few issues we're discussing here, I would like to make sure we're on the same page. Is the only issue you're experiencing right now the one with another node not taking over as master when the master did fail?
Former Nagios Employee.
me.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Cluster failure and UDP syslogs

Post by CFT6Server »

I had hard reboot the entire cluster so it is back at green.

Now I am trying to find out 2 main issues.

1. What is the resource unavailable message and what can we can?
2. Find out what happened as to why the cluster could not elect a master.
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Cluster failure and UDP syslogs

Post by hsmith »

Are these servers in the same geographic location?
Former Nagios Employee.
me.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Cluster failure and UDP syslogs

Post by CFT6Server »

They are. Same network/subnet.
Locked