So I've got a strange issue I'm trying to get to the bottom of. Yesterday around 11AM The CPU load on my NagiosLS server seemed to hit max and level out (400% - 4 cores @ 100%). Intially to resolve this required me to restart the logstash process (`service logstash restart`). Because logstash appears to stop processing incoming logs from logstash-forwarder (Logjam encrypted via SSL). I've managed to track this down to the logs coming from one particular server. As soon as I turn the logs back on to start forwarding to the Nagios Log Server, the CPU load maxes out and on the sending server I can see the connection failing after a few thousand lines.
On the Nagios Log Server side, nothing gets logged to the /var/log/logstash/logstash.log until I restart the service. Then I get entries that look like this.
I'm not sure if those messages are even related to the problem. But obviously at this point I can't get the logs forwarded over without causing problems for all my other servers.
Any thoughts?
Problems when forwarding certain logs.
Problems when forwarding certain logs.
You do not have the required permissions to view the files attached to this post.
Last edited by weveland on Wed Jan 13, 2016 11:07 am, edited 1 time in total.
Re: Problems when forwarding certain logs.
I'm having a little trouble determining what the issue is here..
Former Nagios Employee.
me.
me.
Re: Problems when forwarding certain logs.
hsmith wrote:I'm having a little trouble determining what the issue is here..
Sorry about that. Hit enter on the wrong window and it submitted.
Re: Problems when forwarding certain logs.
Could you check the time on both of your servers and ensure that they're synchronized? Timeouts like this can be a result of improper times between the two servers.
According to elastic here: https://github.com/elastic/logstash-for ... issues/134
"timeouts waiting for an ack means either the network is broken or the receiving server (logstash) is stuck doing other tasks and was not able to acknowledge receipt of the events in an appropriate time."
Can you check your date and for any network issues that could be the culprit of this problem?
According to elastic here: https://github.com/elastic/logstash-for ... issues/134
"timeouts waiting for an ack means either the network is broken or the receiving server (logstash) is stuck doing other tasks and was not able to acknowledge receipt of the events in an appropriate time."
Can you check your date and for any network issues that could be the culprit of this problem?
Re: Problems when forwarding certain logs.
They are definitely both synchronized, I just verified. They also use the same time source for ntp.
I know the issue is with logstash or elasticsearch on the nagios log server. With the CPU load as high as it is, it's just not sending responses back to the sender. I just can't seem to find out what it's spending cycles on.
For the time being I've rotated the old logs out of the way and it's processing the new ones just fine. So there's got to be something specific in those logs causing the problem.
I know the issue is with logstash or elasticsearch on the nagios log server. With the CPU load as high as it is, it's just not sending responses back to the sender. I just can't seem to find out what it's spending cycles on.
For the time being I've rotated the old logs out of the way and it's processing the new ones just fine. So there's got to be something specific in those logs causing the problem.
Re: Problems when forwarding certain logs.
Also, I used hexdump to find the position in the logfiles, then put the line position into less. But the entries at that point look quite normal so I couldn't see anything in the logs themselves that would indicate why they cause so much fail.
Re: Problems when forwarding certain logs.
I assume it's Logstash, but there's no way to be certain without further evidence. How many logs does this server start sending when it's turned on? It's possible that Logstash is choking on the pure volume of logs that the server is sending - it looks like it's reading in a log file and outputting that file to NLS - how large is that log file?I just can't seem to find out what it's spending cycles on.
The logs can contain whatever they want - my bet is that there's something wrong with the input/amount of logs inbound.So there's got to be something specific in those logs causing the problem.
I'm interested in the following data from the node that is having issues:
Code: Select all
free -m
top | head -n6
grep HEAP /etc/sysconfig/logstash /etc/sysconfig/elasticsearchJesse
Re: Problems when forwarding certain logs.
total used free shared buffers cached
Mem: 32240 27215 5025 0 327 7205
-/+ buffers/cache: 19681 12559
Swap: 255 0 255
top - 14:46:03 up 15:54, 1 user, load average: 0.29, 0.27, 0.20
Tasks: 234 total, 1 running, 232 sleeping, 1 stopped, 0 zombie
Cpu(s): 2.0%us, 0.5%sy, 4.7%ni, 92.0%id, 0.8%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33014412k total, 27951224k used, 5063188k free, 335544k buffers
Swap: 262136k total, 0k used, 262136k free, 7451896k cached
># grep HEAP /etc/sysconfig/logstash /etc/sysconfig/elasticsearch
/etc/sysconfig/logstash:LS_HEAP_SIZE="1024m"
/etc/sysconfig/elasticsearch:ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
/etc/sysconfig/elasticsearch:#ES_HEAP_NEWSIZE=
Mem: 32240 27215 5025 0 327 7205
-/+ buffers/cache: 19681 12559
Swap: 255 0 255
top - 14:46:03 up 15:54, 1 user, load average: 0.29, 0.27, 0.20
Tasks: 234 total, 1 running, 232 sleeping, 1 stopped, 0 zombie
Cpu(s): 2.0%us, 0.5%sy, 4.7%ni, 92.0%id, 0.8%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 33014412k total, 27951224k used, 5063188k free, 335544k buffers
Swap: 262136k total, 0k used, 262136k free, 7451896k cached
># grep HEAP /etc/sysconfig/logstash /etc/sysconfig/elasticsearch
/etc/sysconfig/logstash:LS_HEAP_SIZE="1024m"
/etc/sysconfig/elasticsearch:ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
/etc/sysconfig/elasticsearch:#ES_HEAP_NEWSIZE=
Re: Problems when forwarding certain logs.
So as a test I added more logfiles to monitor in my logstash-forwarder configuration. It chewed through those logs without a single problem. It's got to be some specific log entry/entries from yesterday.
Re: Problems when forwarding certain logs.
The setup looks super healthy, check out this post that I found: https://github.com/elastic/logstash-for ... issues/293
The theory is that if a single host is connecting with an invalid cert, it could force disconnects on other hosts. If this happened frequently enough, it could overwhelm Logstash. I don't think that's the case here, but it's worth knowing about.
Are you using codec => json in your Logstash input? Could you attempt to remove it and see if that makes a difference?
If the above doesn't help, please send an email to [email protected] and reference this thread - I'll pick the ticket up and we can troubleshoot from there. It's very interesting to me that a particular log could cause logstash to spin, but I suppose I have seen similar problems before. Case in point: https://github.com/logstash-plugins/log ... /issues/15 (they don't seem to be addressing the issue as quickly as I'd hoped).
Jesse
The theory is that if a single host is connecting with an invalid cert, it could force disconnects on other hosts. If this happened frequently enough, it could overwhelm Logstash. I don't think that's the case here, but it's worth knowing about.
Are you using codec => json in your Logstash input? Could you attempt to remove it and see if that makes a difference?
If the above doesn't help, please send an email to [email protected] and reference this thread - I'll pick the ticket up and we can troubleshoot from there. It's very interesting to me that a particular log could cause logstash to spin, but I suppose I have seen similar problems before. Case in point: https://github.com/logstash-plugins/log ... /issues/15 (they don't seem to be addressing the issue as quickly as I'd hoped).
Jesse