Problems when forwarding certain logs.

weveland · Post by **weveland** » Wed Jan 13, 2016 10:51 am

So I've got a strange issue I'm trying to get to the bottom of. Yesterday around 11AM The CPU load on my NagiosLS server seemed to hit max and level out (400% - 4 cores @ 100%). Intially to resolve this required me to restart the logstash process (`service logstash restart`). Because logstash appears to stop processing incoming logs from logstash-forwarder (Logjam encrypted via SSL). I've managed to track this down to the logs coming from one particular server. As soon as I turn the logs back on to start forwarding to the Nagios Log Server, the CPU load maxes out and on the sending server I can see the connection failing after a few thousand lines.

logstash-forwarder.log

On the Nagios Log Server side, nothing gets logged to the /var/log/logstash/logstash.log until I restart the service. Then I get entries that look like this.

logstash.log

I'm not sure if those messages are even related to the problem. But obviously at this point I can't get the logs forwarded over without causing problems for all my other servers.

Any thoughts?

hsmith · Post by **hsmith** » Wed Jan 13, 2016 11:02 am

I'm having a little trouble determining what the issue is here..

weveland · Post by **weveland** » Wed Jan 13, 2016 11:09 am

hsmith wrote:I'm having a little trouble determining what the issue is here..

Sorry about that. Hit enter on the wrong window and it submitted.

jolson · Post by **jolson** » Wed Jan 13, 2016 11:41 am

Could you check the time on both of your servers and ensure that they're synchronized? Timeouts like this can be a result of improper times between the two servers.

According to elastic here: https://github.com/elastic/logstash-for ... issues/134
"timeouts waiting for an ack means either the network is broken or the receiving server (logstash) is stuck doing other tasks and was not able to acknowledge receipt of the events in an appropriate time."

Can you check your date and for any network issues that could be the culprit of this problem?

weveland · Post by **weveland** » Wed Jan 13, 2016 11:49 am

They are definitely both synchronized, I just verified. They also use the same time source for ntp.

I know the issue is with logstash or elasticsearch on the nagios log server. With the CPU load as high as it is, it's just not sending responses back to the sender. I just can't seem to find out what it's spending cycles on.

For the time being I've rotated the old logs out of the way and it's processing the new ones just fine. So there's got to be something specific in those logs causing the problem.

weveland · Post by **weveland** » Wed Jan 13, 2016 11:52 am

Also, I used hexdump to find the position in the logfiles, then put the line position into less. But the entries at that point look quite normal so I couldn't see anything in the logs themselves that would indicate why they cause so much fail.

jolson · Post by **jolson** » Wed Jan 13, 2016 2:24 pm

I just can't seem to find out what it's spending cycles on.

I assume it's Logstash, but there's no way to be certain without further evidence. How many logs does this server start sending when it's turned on? It's possible that Logstash is choking on the pure volume of logs that the server is sending - it looks like it's reading in a log file and outputting that file to NLS - how large is that log file?

So there's got to be something specific in those logs causing the problem.

The logs can contain whatever they want - my bet is that there's something wrong with the input/amount of logs inbound.

I'm interested in the following data from the node that is having issues:

Code: Select all

free -m
top | head -n6
grep HEAP /etc/sysconfig/logstash /etc/sysconfig/elasticsearch

The logfiles provided are unfortunately inconclusive, I think we'll have to dig deep to resolve this one - if we can't get to the root of the problem in a few forum posts, this will likely need to be turned into a ticket/remote session. Thanks!

Jesse

weveland · Post by **weveland** » Wed Jan 13, 2016 2:48 pm

total used free shared buffers cached
Mem: 32240 27215 5025 0 327 7205
-/+ buffers/cache: 19681 12559
Swap: 255 0 255

top - 14:46:03 up 15:54, 1 user, load average: 0.29, 0.27, 0.20
Tasks: 234 total, 1 running, 232 sleeping, 1 stopped, 0 zombie
Cpu(s): 2.0%us, 0.5%sy, 4.7%ni, 92.0%id, 0.8%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33014412k total, 27951224k used, 5063188k free, 335544k buffers
Swap: 262136k total, 0k used, 262136k free, 7451896k cached

># grep HEAP /etc/sysconfig/logstash /etc/sysconfig/elasticsearch
/etc/sysconfig/logstash:LS_HEAP_SIZE="1024m"
/etc/sysconfig/elasticsearch:ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
/etc/sysconfig/elasticsearch:#ES_HEAP_NEWSIZE=

weveland · Post by **weveland** » Wed Jan 13, 2016 3:47 pm

So as a test I added more logfiles to monitor in my logstash-forwarder configuration. It chewed through those logs without a single problem. It's got to be some specific log entry/entries from yesterday.

jolson · Post by **jolson** » Wed Jan 13, 2016 3:59 pm

The setup looks super healthy, check out this post that I found: https://github.com/elastic/logstash-for ... issues/293

The theory is that if a single host is connecting with an invalid cert, it could force disconnects on other hosts. If this happened frequently enough, it could overwhelm Logstash. I don't think that's the case here, but it's worth knowing about.

Are you using codec => json in your Logstash input? Could you attempt to remove it and see if that makes a difference?

If the above doesn't help, please send an email to [email protected] and reference this thread - I'll pick the ticket up and we can troubleshoot from there. It's very interesting to me that a particular log could cause logstash to spin, but I suppose I have seen similar problems before. Case in point: https://github.com/logstash-plugins/log ... /issues/15 (they don't seem to be addressing the issue as quickly as I'd hoped).

Jesse

Nagios Support Forum

Problems when forwarding certain logs.

Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.

Re: Problems when forwarding certain logs.