Page 1 of 1

logstash failed action with response of 400, dropping action

Posted: Mon Mar 06, 2017 5:33 pm
by burkm
Hello,

The logstash.log file is filling rapidly with very long messages beginning with "failed action with response of 400, dropping action". Literally 99.9% of the messages are like this:

Code: Select all

{:timestamp=>"2017-03-06T15:12:23.292000-0700", :message=>"failed action with response of 400, dropping action: [\"index\", {:_id=>nil, :_index=>\"logstash-2017.03.06\", :_type=>\"syslog\", :_routing=>nil}, #<LogStash::Event:0x49bd9288 @metadata_accessors=#<LogStash::Util::Accessors:0x77e9b08c @store={\"retry_count\"=>0}, @lut={}>, @cancelled=false, @data={\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, @metadata={\"retry_count\"=>0}, @accessors=#<LogStash::Util::Accessors:0x12f372ac @store={\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, @lut={\"type\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"type\"], \"host\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"host\"], \"message\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"message\"], \"priority\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"priority\"], \"timestamp\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"timestamp\"], \"logsource\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"logsource\"], \"program\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"program\"], \"tags\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"tags\"], \"severity\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"severity\"], \"facility\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"facility\"], \"timestamp8601\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"timestamp8601\"], \"@timestamp\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"@timestamp\"], \"facility_label\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"facility_label\"], \"severity_label\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"severity_label\"], \"[program]\"=>[{\"message\"=>\"blk_cloned_rq_check_limits: over max size limit.\\n\", \"@version\"=>\"1\", \"@timestamp\"=>\"2017-03-06T22:12:22.000Z\", \"type\"=>\"syslog\", \"host\"=>\"10.30.216.7\", \"priority\"=>3, \"timestamp\"=>\"Mar  6 15:12:22\", \"logsource\"=>\"nitrogen\", \"program\"=>\"kernel\", \"severity\"=>3, \"facility\"=>0, \"facility_label\"=>\"kernel\", \"severity_label\"=>\"Error\"}, \"program\"]}>>]", :level=>:warn}
I see this question has been asked before, so I tried the suggestion of changing the port in /etc/rsyslog.d/nagioslogserver.conf to this:

Code: Select all

*.* @@localhost:2057
but that didn't seem to help.
This is not just a nuisance message - it's actually filling logstash.log up at a rate of >1G/hr (sometimes much faster) on the primary node.
The secondary node has no such messages in the log file.

Some things about this system:
2 node cluster, CentOS 7.3, NLS 1.4.4

Load average is light:

Code: Select all

08:50:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
   ...
Average:            8       695      0.36      0.36      0.36         0
$ cat /etc/hosts

Code: Select all

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

$ head -n100 /usr/local/nagioslogserver/logstash/etc/conf.d/*

Code: Select all

==> /usr/local/nagioslogserver/logstash/etc/conf.d/000_inputs.conf <==
#
# Logstash Configuration File
# Dynamically created by Nagios Log Server
#
# DO NOT EDIT THIS FILE. IT WILL BE OVERWRITTEN.
#
# Created Wed, 22 Jun 2016 17:27:47 -0600
#

#
# Global inputs
#

input {
    syslog {
        type => 'syslog'
        port => 5544
    }
    tcp {
        type => 'eventlog'
        port => 3515
        codec => json {
            charset => 'CP1252'
        }
    }
    tcp {
        type => 'import_raw'
        tags => 'import_raw'
        port => 2056
    }
    tcp {
        type => 'import_json'
        tags => 'import_json'
        port => 2057
        codec => json
    }
    syslog {
        type => 'syslog'
        port => 1514
    }
    http {
        port => 8544
        codec => 'json'
    }
}

#
# Local inputs
#



==> /usr/local/nagioslogserver/logstash/etc/conf.d/500_filters.conf <==
#
# Logstash Configuration File
# Dynamically created by Nagios Log Server
#
# DO NOT EDIT THIS FILE. IT WILL BE OVERWRITTEN.
#
# Created Wed, 22 Jun 2016 17:27:47 -0600
#

#
# Global filters
#

filter {
    if [program] == 'apache_access' {
        grok {
            match => [ 'message', '%{COMBINEDAPACHELOG}']
        }
        date {
            match => [ 'timestamp', 'dd/MMM/yyyy:HH:mm:ss Z' ]
        }
        mutate {
            replace => [ 'type', 'apache_access' ]
             convert => [ 'bytes', 'integer' ]
             convert => [ 'response', 'integer' ]
        }
    }

    if [program] == 'apache_error' {
        grok {
            match => [ 'message', '\[(?<timestamp>%{DAY:day} %{MONTH:month} %{MONTHDAY} %{TIME} %{YEAR})\] \[%{WORD:class}\] \[%{WORD:originator} %{IP:clientip}\] %{GREEDYDATA:errmsg}']
        }
        mutate {
            replace => [ 'type', 'apache_error' ]
        }
    }
    if [program] == 'nginx_access' {
        grok {
            match => [ 'message', '%{COMBINEDAPACHELOG}']
        }
        date {
            match => [ 'timestamp', 'dd/MMM/yyyy:HH:mm:ss Z' ]
        }
        mutate {
            replace => [ 'type', 'nginx_access' ]
             convert => [ 'bytes', 'integer' ]
             convert => [ 'response', 'integer' ]
        }
    }

    if [program] == 'nginx_error' {
        grok {
            match => [ 'message', '\[(?<timestamp>%{DAY:day} %{MONTH:month} %{MONTHDAY} %{TIME} %{YEAR})\] \[%{WORD:class}\] \[%{WORD:originator} %{IP:clientip}\] %{GREEDYDATA:errmsg}']
        }
        mutate {
            replace => [ 'type', 'nginx_error' ]
        }
    }
}

#
# Local filters
#



==> /usr/local/nagioslogserver/logstash/etc/conf.d/999_outputs.conf <==
#
# Logstash Configuration File
# Dynamically created by Nagios Log Server
#
# DO NOT EDIT THIS FILE. IT WILL BE OVERWRITTEN.
#
# Created Wed, 22 Jun 2016 17:27:47 -0600
#

#
# Required output for Nagios Log Server
#

output {
    elasticsearch {
        cluster => '76900ee2-f769-413c-9948-850204a96b32'
        host => 'localhost'
        document_type => '%{type}'
        node_name => '707a55e2-e8cb-4073-a800-2fbc5bd4c6df'
        protocol => 'transport'
        workers => 4
    }
}

#
# Global outputs
#



#
# Local outputs
#

Not sure if this is relevant, but in a 15-min snapshot, 11841 messages came from 127.0.0.1, with only 379 from the next highest host.

This problem started just a a few days ago, and I am not aware of any recent changes.

Re: logstash failed action with response of 400, dropping ac

Posted: Mon Mar 06, 2017 5:42 pm
by mcapra
Can we see the contents of your Elasticsearch logs? They should be located in /var/log/elasticsearch.

Re: logstash failed action with response of 400, dropping ac

Posted: Mon Mar 06, 2017 6:14 pm
by burkm
Here's a sample from the main log; the two slowlogs are empty.
Although the first part of the message varies, they all end with the MapperParsingException:

Code: Select all

[2017-03-06 16:11:07,778][DEBUG][action.bulk              ] [707a55e2-e8cb-4073-a800-2fbc5bd4c6df] [logstash-2017.03.06][2] failed to execute bulk item (index) index {[logstash-2017.03.06][syslog][AVql5AIA8OQvuOv9Cyt_], source[{"message":"Remote 10.30.216.22 accepted a Version 3 Packet\n","@version":"1","@timestamp":"2017-03-06T23:11:07.000Z","type":"syslog","host":"10.30.217.55","priority":13,"timestamp":"Mar  6 16:11:07","logsource":"cesium","program":"check_nrpe","severity":5,"facility":1,"facility_label":"user-level","severity_label":"Notice"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [timestamp]
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
        at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:706)
        at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:497)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:544)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
        at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:418)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:148)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse date field [Mar  6 16:11:07], tried both date format [dateOptionalTime], and timestamp number with locale []
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:617)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.innerParseCreateField(DateFieldMapper.java:535)
        at org.elasticsearch.index.mapper.core.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:239)
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
        ... 13 more
Caused by: java.lang.IllegalArgumentException: Invalid format: "Mar  6 16:11:07"
        at org.elasticsearch.common.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
        at org.elasticsearch.common.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:780)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:612)
        ... 16 more
The messages are accumulating at a rate of about 1/s.

Re: logstash failed action with response of 400, dropping ac

Posted: Tue Mar 07, 2017 10:45 am
by mcapra
This is a known issue with the logstash-input-syslog plugin. I would suggest pointing that host (10.30.217.55) to a generic tcp/udp input for the time being (NLS uses 2056 by default). If you need help writing a filter to capture the fields, if you can provide some sample messages being collected I can assist with that.

Re: logstash failed action with response of 400, dropping ac

Posted: Tue Mar 07, 2017 12:47 pm
by burkm
That host is just one of many generating these Java errors. Do you suggest that I change them all from port 5544 to port 2056?

Re: logstash failed action with response of 400, dropping ac

Posted: Tue Mar 07, 2017 2:33 pm
by mcapra
Yes. You'll probably need to write a custom filter rule to properly handle the date formatting presented by these logs. We can assist with that if needed. I just want to be sure that we can at least get the logs going to the right place first.