I'm having some weird problems with an unstable ingestion that I was hoping you fellas could help me with:
Over a seemingly random timespan my ingestion almost crawls to a halt. Often it's after a couple of hours. Following a logstash restart, I see a huge spike from all the logs that has been rejected from my hosts up until that point. It's always the same cycle: A big ingestion of the backlog, followed by an more expected one, and then finished off with a almost dead line (See pics).
After the restart, my CPU starts working hard on the backlog, then dumps itself to 25-35% healthy ingestion, and then 5-15% when ingestion is almost completely lost (You can see the 3 'stages' of ingestion in the pics). I'm on an M-instance in AWS so I shouldn't be limited on credits or the like. Ressources should be good as well. The M-instance itself seems to be doing fine during periods of heavy ingestion.
At first, I thought it was due to my filters or my logstash settings, however, at this point I believe I've adjusted all the settings that's been suggested online.. Settings like heapsize, ES indexthrottling and worker/batchsize. Checked for I/O wait as well, and there's none.
As a last effort, I've tried to disable all filters, and just have the raw ingestion to see if the problem persists. My logic is, that if the raw ingestion fails in the same way as the filtered, it's either a backend-, NagiosLSImage- or AWS-instance-problem.
And just confirmed today that it does - Although I do see a general performance increase, the raw input follows the same pattern as the filtered.
The log from both logstash and ES throws no errors when the ingestion slows down. We have plenty of headroom before we hit the 90% HDD watermark as well.
At this point I'm pretty stumped on the possible sources of this and I'm looking into whether there could be some problem with my AWS instance.
My instance:
- Installed using the Nagios LS AMI from the marketplace
- Instance: M5.2xlarge (8 cores and 32 gigs RAM + NVMeSSD)
- Nagios ver: 2.1.4 (I experienced the problem on previous versions as well)
- Single Instance NagiosLS license
Only input/filter active atm:
Code: Select all
tcp {
codec => json { charset => ["CP1252"] }
port => "5052"
ssl_cert => "crt.path"
ssl_key => "key.path"
ssl_enable => true
ssl_verify => false
type => "nxlog-json"
tags => [ "tcpjson" ]
}/etc/sysconfig/logstash:
Code: Select all
LS_OPTS=" -b 12000 -w 8"
LS_HEAP_SIZE="5120m"Code: Select all
ES_HEAP_SIZE=20480mCode: Select all
index.number_of_replicas: 0
indices.store.throttle.type: noneIf you wan't more info, I'll happily provide.
Thanks in advance!