Troubles with unstable ingestion on AWS M5 image

Wintermute · Post by **Wintermute** » Fri Jan 31, 2020 8:55 am

Hi there,

I'm having some weird problems with an unstable ingestion that I was hoping you fellas could help me with:

Over a seemingly random timespan my ingestion almost crawls to a halt. Often it's after a couple of hours. Following a logstash restart, I see a huge spike from all the logs that has been rejected from my hosts up until that point. It's always the same cycle: A big ingestion of the backlog, followed by an more expected one, and then finished off with a almost dead line (See pics).

ingestion problem4.PNG

ingestion problem2.PNG

ingestion problem1.PNG

After the restart, my CPU starts working hard on the backlog, then dumps itself to 25-35% healthy ingestion, and then 5-15% when ingestion is almost completely lost (You can see the 3 'stages' of ingestion in the pics). I'm on an M-instance in AWS so I shouldn't be limited on credits or the like. Ressources should be good as well. The M-instance itself seems to be doing fine during periods of heavy ingestion.

At first, I thought it was due to my filters or my logstash settings, however, at this point I believe I've adjusted all the settings that's been suggested online.. Settings like heapsize, ES indexthrottling and worker/batchsize. Checked for I/O wait as well, and there's none.

As a last effort, I've tried to disable all filters, and just have the raw ingestion to see if the problem persists. My logic is, that if the raw ingestion fails in the same way as the filtered, it's either a backend-, NagiosLSImage- or AWS-instance-problem.

And just confirmed today that it does - Although I do see a general performance increase, the raw input follows the same pattern as the filtered.

The log from both logstash and ES throws no errors when the ingestion slows down. We have plenty of headroom before we hit the 90% HDD watermark as well.

At this point I'm pretty stumped on the possible sources of this and I'm looking into whether there could be some problem with my AWS instance.

My instance:
- Installed using the Nagios LS AMI from the marketplace
- Instance: M5.2xlarge (8 cores and 32 gigs RAM + NVMeSSD)
- Nagios ver: 2.1.4 (I experienced the problem on previous versions as well)
- Single Instance NagiosLS license

Only input/filter active atm:

Code: Select all

tcp {
    codec => json { charset => ["CP1252"] } 
    port => "5052"

    ssl_cert => "crt.path"
    ssl_key => "key.path"
    ssl_enable => true
    ssl_verify => false

    type => "nxlog-json"
    tags => [ "tcpjson" ]

  }

Settings adjusted:

/etc/sysconfig/logstash:

Code: Select all

LS_OPTS=" -b 12000 -w 8"
LS_HEAP_SIZE="5120m"

/etc/sysconfig/elasticsearch:

Code: Select all

ES_HEAP_SIZE=20480m

/usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml:

Code: Select all

index.number_of_replicas: 0
indices.store.throttle.type: none

I use nxlog to ship my logs.

If you wan't more info, I'll happily provide.

Thanks in advance!

Post by **cdienger** » Fri Jan 31, 2020 3:05 pm

How many clients are using NXLog to send logs? Please gather a profile and PM it to me the next time the system is in a state of low ingestion.

Nagios Support Forum

Troubles with unstable ingestion on AWS M5 image

Troubles with unstable ingestion on AWS M5 image

Re: Troubles with unstable ingestion on AWS M5 image