Troubles with unstable ingestion on AWS M5 image

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
Wintermute
Posts: 13
Joined: Fri Feb 22, 2019 4:25 am

Troubles with unstable ingestion on AWS M5 image

Post by Wintermute »

Hi there,

I'm having some weird problems with an unstable ingestion that I was hoping you fellas could help me with:

Over a seemingly random timespan my ingestion almost crawls to a halt. Often it's after a couple of hours. Following a logstash restart, I see a huge spike from all the logs that has been rejected from my hosts up until that point. It's always the same cycle: A big ingestion of the backlog, followed by an more expected one, and then finished off with a almost dead line (See pics).
ingestion problem4.PNG
ingestion problem2.PNG
ingestion problem1.PNG
After the restart, my CPU starts working hard on the backlog, then dumps itself to 25-35% healthy ingestion, and then 5-15% when ingestion is almost completely lost (You can see the 3 'stages' of ingestion in the pics). I'm on an M-instance in AWS so I shouldn't be limited on credits or the like. Ressources should be good as well. The M-instance itself seems to be doing fine during periods of heavy ingestion.

At first, I thought it was due to my filters or my logstash settings, however, at this point I believe I've adjusted all the settings that's been suggested online.. Settings like heapsize, ES indexthrottling and worker/batchsize. Checked for I/O wait as well, and there's none.

As a last effort, I've tried to disable all filters, and just have the raw ingestion to see if the problem persists. My logic is, that if the raw ingestion fails in the same way as the filtered, it's either a backend-, NagiosLSImage- or AWS-instance-problem.

And just confirmed today that it does - Although I do see a general performance increase, the raw input follows the same pattern as the filtered.

The log from both logstash and ES throws no errors when the ingestion slows down. We have plenty of headroom before we hit the 90% HDD watermark as well.

At this point I'm pretty stumped on the possible sources of this and I'm looking into whether there could be some problem with my AWS instance.

My instance:
- Installed using the Nagios LS AMI from the marketplace
- Instance: M5.2xlarge (8 cores and 32 gigs RAM + NVMeSSD)
- Nagios ver: 2.1.4 (I experienced the problem on previous versions as well)
- Single Instance NagiosLS license

Only input/filter active atm:

Code: Select all

tcp {
    codec => json { charset => ["CP1252"] } 
    port => "5052"

    ssl_cert => "crt.path"
    ssl_key => "key.path"
    ssl_enable => true
    ssl_verify => false

    type => "nxlog-json"
    tags => [ "tcpjson" ]

  }
Settings adjusted:

/etc/sysconfig/logstash:

Code: Select all

LS_OPTS=" -b 12000 -w 8"
LS_HEAP_SIZE="5120m"
/etc/sysconfig/elasticsearch:

Code: Select all

ES_HEAP_SIZE=20480m
/usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml:

Code: Select all

index.number_of_replicas: 0
indices.store.throttle.type: none
I use nxlog to ship my logs.

If you wan't more info, I'll happily provide.

Thanks in advance! :)
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Troubles with unstable ingestion on AWS M5 image

Post by cdienger »

How many clients are using NXLog to send logs? Please gather a profile and PM it to me the next time the system is in a state of low ingestion.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked