Logstash dying after new log inputs added

gormank · Post by **gormank** » Thu Sep 24, 2020 3:10 pm

This is a continuation of ticket 980762. I was adding to that old ticket and looked at it when there was no response. I see it was closed, and then reopened by my responding. I also see the email responses in the ticket are formatted so no one can read them. I figured it might be time to start fresh.

We have two of these systems in the production environment. One is live and getting ~55kmessages/min messages, the other standby w/ ~10k messages/min.
Both have the same issue.

NLS 2.1.6
3 instance clusters
RHEL 7.8, 8 cores, 64G RAM
All are virtual machines
/etc/sysconfig/elasticsearch:ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

Back in maybe June, I added a number of new file inputs, and logstash began to die say 5 times a day. I made a script to restart after a few minutes when it did. I see the infamous bad file descriptor error in the log. Originally, each system had 2 nodes, but we added another to each cluster as you recommended. Then it took a while to get the load balancers set to distribute the data across all nodes.
One thing to note is logstash only dies on the 01 and 02 nodes. The new 03 node doesn't seem to have the issue.
We had a remose support session before adding the 3rd node, and I thought one of the config files might be inconsistent after adding the node. But /etc/sysconfig/elasticsearch elasticsearch.yml look the same on all nodes.
In instance overview, the total instances is 2, the there are 3 in the list of instances.
Another oddity is that in the list of instances, one shows port 2001 or all 3, and on the other, 2 have 9300 and one 8244.
I thought that all instances were created on all nodes, so the storage used for data would be about the same on all nodes. That's not the case. The 3rd node uses far less space than the other two. One cluster has node 3 showing no space used on /data. I suppose this is because this is the one that says there are 2 instances.

I guess the instances, disk used and ports issues need to be sorted first, then come back to logstash.

# curl -s localhost:9200/_cluster/health?pretty; echo
{
"cluster_name" : "39b36e88-7460-492e-bdda-3adda329e4d7",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 56,
"active_shards" : 112,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}

# grep -v ^# /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml | sort -u

bootstrap.mlockall: true
cluster.name: nagios_elasticsearch
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["localhost"]
http.host: "localhost"
node.max_local_storage_nodes: 1
transport.tcp.compress: true

Post by **cdienger** » Fri Sep 25, 2020 1:45 pm

We'll continue troubleshooting this through the ticket.

Nagios Support Forum

Logstash dying after new log inputs added

Logstash dying after new log inputs added

Re: Logstash dying after new log inputs added