Logstash errors and crashes

CFT6Server · Post by **CFT6Server** » Tue Feb 09, 2016 12:50 pm

I have been noticing that logstash on one of our nodes is constantly crashing, I captured the logs from yesterday and found a few issues.....

This node handles syslogs and file input of apache log files.

Seeing a lot of these...

Code: Select all

{:timestamp=>"2016-02-08T13:31:47.687000-0800", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2016-02-08T13:31:47.687000-0800", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2016-02-08T13:31:47.687000-0800", :message=>"retrying failed action with response code: 503", :level=>:warn}

Then I am seeing these warnings...

Code: Select all

{:timestamp=>"2016-02-08T17:21:07.070000-0800", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2016-02-08T17:21:07.070000-0800", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2016-02-08T17:21:07.071000-0800", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2016-02-08T17:21:07.071000-0800", :message=>"retrying failed action with response code: 429", :level=>:warn}

I then see a lot of permission denied errors...

Code: Select all

{:timestamp=>"2016-02-08T23:55:26.035000-0800", :message=>"failed to open /nfs/shared/somelogs:
Permission denied - /nfs/shared/somelogs", :level=>:warn}

Then looks like eventually the listener dies...

Code: Select all

{:timestamp=>"2016-02-08T23:55:26.509000-0800", :message=>"syslog listener died", :protocol=>:tcp, :address=>"0.0.0.0:5544", :exception=>#<SocketError: probl
em when accepting>, :backtrace=>["org/jruby/ext/socket/RubyTCPServer.java:174:in `accept'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems
/logstash-input-syslog-0.1.6/lib/logstash/inputs/syslog.rb:155:in `tcp_listener'", "org/jruby/RubyKernel.java:1511:in `loop'", "/usr/local/nagioslogserver/lo
gstash/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-0.1.6/lib/logstash/inputs/syslog.rb:154:in `tcp_listener'", "/usr/local/nagioslogserver/logstash/ve
ndor/bundle/jruby/1.9/gems/logstash-input-syslog-0.1.6/lib/logstash/inputs/syslog.rb:117:in `server'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jru
by/1.9/gems/logstash-input-syslog-0.1.6/lib/logstash/inputs/syslog.rb:101:in `run'"], :level=>:warn}
{:timestamp=>"2016-02-08T23:55:26.791000-0800", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n  Plugin: <LogStash::Inputs::File
path=>[\"<PATHS>"], type=>\"iplanet\", tags=>[\"ipla
net-DMZ\"], start_position=>\"end\", delimiter=>\"\\n\">\n  Error: Bad file descriptor - Bad file descriptor", :level=>:error}
{:timestamp=>"2016-02-08T23:55:26.824000-0800", :message=>"UDP listener died", :exception=>#<SocketError: recvfrom: name or service not known>, :backtrace=>[
"/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-0.1.4/lib/logstash/inputs/udp.rb:79:in `udp_listener'", "/usr/local/nagi
oslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-0.1.4/lib/logstash/inputs/udp.rb:49:in `run'", "/usr/local/nagioslogserver/logstash/vend
or/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/pipeline.rb:176:in `inputworker'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.
9/gems/logstash-core-1.5.1-java/lib/logstash/pipeline.rb:170:in `start_input'"], :level=>:warn}

Then the last entry was error looking up GeoIP Data

Code: Select all

{:timestamp=>"2016-02-08T23:55:26.878000-0800", :message=>"Unknown error while looking up GeoIP data", :exception=>#<Errno::EBADF: Bad file descriptor - Bad
file descriptor>, :field=>nil, :event=>#<LogStash::Event:0x22437c03 @metadata={}, @accessors=#<LogStash::Util::Accessors:0x3ebf475c

Now some of these access and error files don't have much entries in them and only has the header....
format=%Ses->client.ip% - %Req->vars.auth-user% [%SYSDATE%] "%Req->reqpb.clf-request%" %Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%

snapshot of the permissions...

Code: Select all

-rw-r--r-- 1 30002 300       143 Feb  1 23:55 access.201602022355
-rw-r--r-- 1 30002 300       143 Feb  2 23:55 access.201602032355
-rw-r--r-- 1 30002 300       143 Feb  3 23:55 access.201602042355
-rw-r--r-- 1 30002 300       143 Feb  4 23:55 access.201602052355
-rw-r--r-- 1 30002 300       143 Feb  5 23:55 access.201602062355
-rw-r--r-- 1 30002 300       143 Feb  6 23:55 access.201602072355
-rw-r--r-- 1 30002 300       143 Feb  7 23:55 access.201602082355

At this point I am suspecting the constant read errors is causing logstash to fall over. But just not sure why it would all of a sudden after a while....

rkennedy · Post by **rkennedy** » Tue Feb 09, 2016 3:33 pm

What is the result of these commands?

Code: Select all

head|top -5
free -m

{:timestamp=>"2016-02-08T23:55:26.035000-0800", :message=>"failed to open /nfs/shared/somelogs:
Permission denied - /nfs/shared/somelogs", :level=>:warn}

To add to that, did the nfs mount go down at all? What shows up by running mount|grep nfs?

{:timestamp=>"2016-02-08T17:21:07.071000-0800", :message=>"retrying failed action with response code: 429", :level=>:warn}

This leads me to believe that your IO is very high, as indexing is hitting a throttle. Is your storage writing to a local disk, or over the network?

CFT6Server · Post by **CFT6Server** » Tue Feb 09, 2016 4:03 pm

As far as I know the NFS mount did not go down but I think IO was probably is high due to the amount of incoming logs. Storage is writing to NFS mount.

Code: Select all

top - 13:01:51 up 23:33,  2 users,  load average: 5.13, 3.48, 3.00
Tasks: 190 total,   1 running, 189 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.5%us,  2.1%sy, 10.3%ni, 77.4%id,  1.1%wa,  0.1%hi,  0.4%si,  0.0%st
Mem:  20603836k total, 20379084k used,   224752k free,     2384k buffers
Swap:   262136k total,    40736k used,   221400k free,  4319968k cached

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         20120      19935        185          0          2       4258
-/+ buffers/cache:      15674       4446
Swap:          255         39        216

rkennedy · Post by **rkennedy** » Tue Feb 09, 2016 4:28 pm

Were you inputting more logs than normal? I believe the IO is the underlying issue. It does seem that your memory is rather high (though you still do have some cache).

How many GB of logs do you have coming in each day, and also when you say 'Storage is writing to NFS mount' are you referring to all of your NLS data?

CFT6Server · Post by **CFT6Server** » Tue Feb 09, 2016 5:57 pm

All NLS data is written to NFS. We currently have over 150GB of incoming data in total for the cluster, but I don't know how much of that volume is coming to this node specifically. We have a pretty big environment with a lot of firewall events and I have split the two major sources to two nodes. Each of these I am seeing drops in logs. More so this one, as logstash crashes.

I have fixed the file inputs and seeing if that will at least keep logstash running.

rkennedy · Post by **rkennedy** » Wed Feb 10, 2016 10:03 am

I don't think NLS should run over NFS as this introduces a lot more issues. The errors are indicating a high IO, which leads me to believe that the throttle between the NFS share and NLS is hitting a threshold.

With a small amount of logs incoming, this MAY work temporairly - but, on a large scale you'll run into issues like this because of the NFS.

Are you able to reallocate to local storage?

Nagios Support Forum

Logstash errors and crashes

Logstash errors and crashes

Re: Logstash errors and crashes

Re: Logstash errors and crashes

Re: Logstash errors and crashes

Re: Logstash errors and crashes

Re: Logstash errors and crashes