Nagios Support Forum

Posted: **Fri Jul 29, 2016 9:43 am**

How much data do you have incoming per day split between the 3 machines? Also, are you using local disks or NAS / SAN attached mounts?

I have a theory that you ended up hitting a file descriptor limit, which then in turn caused the machine to become out of sync from the cluster, and since resources became unavailable it didn't know what to do. It's hard to say since everything is working at this point though.

Posted: **Fri Jul 29, 2016 10:26 am**

I think trying to adjust the limit is a good start. We have quite a large amount of inputs and probably pushing the limits a bit. Here are some details. Last night looks like the inputs stopped again. Although no logs on the elasticsearch or logstash side. (I am still looking through the nodes) Just seems that certain nodes just stop taking any logs. Cluster health in this case was still green, so slightly different, but I am guessing that's how it starts perhaps? I also noticed that our local configurations are all gone. (This consistently happens after a crash) So the local file input configurations is just no where to be found.

Overall statistics

index status.JPG

Indices (we should be doing anywhere from 160 to 200G or so average per day) anything less than that means logs are dropped or something isn't working. Notice 22nd to 26th, that's where the cluster hard crashed.

indices.JPG

Posted: **Fri Jul 29, 2016 1:55 pm**

Increasing those limits won't hurt, and it will help us out to see if that's the same case in the future.

I also noticed that our local configurations are all gone. (This consistently happens after a crash) So the local file input configurations is just no where to be found.

Which configurations are you referring to?

Indices (we should be doing anywhere from 160 to 200G or so average per day) anything less than that means logs are dropped or something isn't working. Notice 22nd to 26th, that's where the cluster hard crashed.

Can you post a screenshot of your backup & maintenance page(s) (all pages if they are different between machines)? With this much data, I have a feeling that's part of the culprit as well.

Another thought - is there a reason you're sending logs to only 3 of the 6 members?

Posted: **Thu Aug 04, 2016 3:42 pm**

The local configurations that are node specific. They don't seem to stick.

Our backup and maintenance settings is same for all the nodes in the cluster.

backup and maintenance.JPG

We are only sending to 3 nodes as the other 3 was not going to be permanent when we first implemented. However, since there's isn't a native way to load balance the sources to all nodes, we are sending to nodes by source type. So one type goes to one node.

Posted: **Thu Aug 04, 2016 4:43 pm**

The local configurations that are node specific. They don't seem to stick.

Could you please clarify, which configuration you're talking about? Just trying to understand what part of the local configuration you're referring to.

Has increasing those limits helped to stop the error in the future, or has it still persisted?

Posted: **Fri Aug 05, 2016 12:08 pm**

this is the local configurations (per instance) where you can specify inputs specific to the local node.

CONFIG.JPG

I have not increased the file descriptors yet, but I have not seen any issues with the cluster thus far.

Posted: **Fri Aug 05, 2016 1:01 pm**

CFT6Server wrote: (I am still looking through the nodes) Just seems that certain nodes just stop taking any logs.

I'm going to throw this into the mix, with this volume of data coming into 3 instances, you may want to bump up the heap allocation for logstash by editing

change this

Code: Select all

#LS_HEAP_SIZE="256m"

to something like this

Code: Select all

LS_HEAP_SIZE="2048m"

then

Code: Select all

service logstash restart

Posted: **Fri Aug 05, 2016 2:36 pm**

Thanks. For our implementation, i have the LS heap set to 1024m. But I'll increase it. I edited the config in /etc/sysconfig/logstash

Posted: **Mon Aug 08, 2016 9:50 am**

Did that help, or are you still experiencing issues?

Posted: **Thu Aug 18, 2016 10:26 am**

We did not change the setting. the LS heap was already at 1024m.

Nagios Support Forum

Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs

Re: Cluster failure and UDP syslogs