Logs are not indexed anymore

Johan159 · Post by **Johan159** » Tue May 05, 2015 8:30 am

Hello,

Nagios LS is not displaying new logs anymore on the web interface. I know logs are still sent and received (if I stop the logstash one of my device starts complaining that the destination is not available).

When I reboot one of our log servers (we have a 2 nodes cluster), it sometimes process a bulk of logs, then does nothing anymore...

I can see the following errors in the elasticsearch logs, I don't know if it might be related. All the errors are from the source "vmwnagioslog1", which is the Nagios LS itself (well, on of them).

Code: Select all

[2015-05-05 13:55:24,383][DEBUG][action.bulk              ] [bed86ca0-2b78-4d69-a1da-0e63846227a8] [logstash-2015.05.05][0] failed to execute bulk item (index) index {[logstash-2015.05.05][syslog][Bt0RskshSkGn-oEoZm0aXQ], source[{"message":"  apache : TTY=unknown ; PWD=/var/www/html/nagioslogserver/www ; USER=root ; COMMAND=/etc/init.d/elasticsearch status","@version":"1","@timestamp":"2015-05-05T11:55:23.000Z","type":"syslog","host":"127.0.0.1","priority":85,"timestamp":"May  5 13:55:23","logsource":"vmwnagioslog1","program":"sudo","severity":5,"facility":10,"facility_label":"security/authorization","severity_label":"Notice"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [timestamp]
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:414)
        at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:648)
        at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:501)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:534)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:483)
        at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:376)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:430)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:522)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:421)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse date field [May  5 13:55:23], tried both date format [dateOptionalTime], and timestamp number with locale []
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:610)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.innerParseCreateField(DateFieldMapper.java:538)
        at org.elasticsearch.index.mapper.core.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:223)
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:404)
        ... 12 more
[color=#FF0000][b]Caused by: java.lang.IllegalArgumentException: Invalid format: "May  5 13:55:23"[/b][/color]
        at org.elasticsearch.common.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:754)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:604)
        ... 15 more

ssax · Post by **ssax** » Tue May 05, 2015 4:53 pm

Can you zip up and PM me your last couple elasticsearch logs so that I can look through them?

Edit: Received the files and placed in our shared support directory.

Johan159 · Post by **Johan159** » Wed May 06, 2015 2:46 am

Thanks for your support, I've sent the requested logs.

I noticed that the problem started sometimes between April the 29th and the 30th (the number of Docs in the index dropped on the 30th).

Our average number of Docs per day is about 3M, here is the historigram for the last couple days :

04/29 : 3M docs
04/30 : 900K docs
05/01 : 15K docs
05/02 : 8K docs
05/03 : 12K docs
05/04 : 1M docs (my first attempts to restart the server)
05/05 : 6M docs (second day of attempts)

I assume all the non processed logs are kept in a queue, as we can see that yesterday I managed to receive twice the average amount of logs.

Could it be that there are so many messages in the queue that Elasticsearch crashes after a couple minutes of processing?

By the way, I boosted the two Nagios LS VM yesterday (gave them 4CPU and 8GB RAM), but it didn't made the trick...

jolson · Post by **jolson** » Wed May 06, 2015 9:34 am

Johan159,

Please run the following commands on one of your nodes and report the results to us:

Code: Select all

cat /var/log/messages
cat /etc/sysconfig/elasticsearch
tail /var/log/logstash/logstash.log
curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'

I assume all the non processed logs are kept in a queue, as we can see that yesterday I managed to receive twice the average amount of logs.

That depends on the agent sending the logs, but in general yes.

Could it be that there are so many messages in the queue that Elasticsearch crashes after a couple minutes of processing?

Definitely - your elasticsearch sysconfig file is what we want to look at to determine whether or not elasticsearch might be overloaded.

At this point, I think that elasticsearch is being reaped by the kernel, or it's starved for resources. Please post the above and we'll come up with a way to troubleshoot this.

Johan159 · Post by **Johan159** » Wed May 06, 2015 9:59 am

Elasticsearch config :

Code: Select all

# Directory where the Elasticsearch binary distribution resides
APP_DIR="/usr/local/nagioslogserver"
ES_HOME="$APP_DIR/elasticsearch"

# Heap Size (defaults to 256m min, 1g max)
# Nagios Log Server Default to 0.5 physical Memory
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

# Heap new generation
#ES_HEAP_NEWSIZE=

# max direct memory
#ES_DIRECT_SIZE=

# Additional Java OPTS
#ES_JAVA_OPTS=

# Maximum number of open files
MAX_OPEN_FILES=65535

# Maximum amount of locked memory
MAX_LOCKED_MEMORY=unlimited

# Maximum number of VMA (Virtual Memory Areas) a process can own
MAX_MAP_COUNT=262144

# Elasticsearch log directory
LOG_DIR=/var/log/elasticsearch

# Elasticsearch data directory
DATA_DIR="$ES_HOME/data"

# Elasticsearch work directory
WORK_DIR="$APP_DIR/tmp/elasticsearch"

# Elasticsearch conf directory
CONF_DIR="$ES_HOME/config"

# Elasticsearch configuration file (elasticsearch.yml)
CONF_FILE="$ES_HOME/config/elasticsearch.yml"

# User to run as, change this to a specific elasticsearch user if possible
# Also make sure, this user can write into the log directories in case you change them
# This setting only works for the init script, but has to be configured separately for systemd startup
ES_USER=nagios
ES_GROUP=nagios

# Configure restart on package upgrade (true, every other setting will lead to not restarting)
#RESTART_ON_UPGRADE=true

if [ "x$1" == "xstart" -o "x$1" == "xrestart" -o "x$1" == "xreload" -o "x$1" == "xforce-reload" ];then
        GET_ES_CONFIG_MESSAGE="$( php $APP_DIR/scripts/get_es_config.php )"
        GET_ES_CONFIG_RETURN=$?

        if [ "$GET_ES_CONFIG_RETURN" != "0" ]; then
                echo $GET_ES_CONFIG_MESSAGE
                exit 1
        else
                ES_JAVA_OPTS="$GET_ES_CONFIG_MESSAGE"
        fi
fi

logstash.log

Code: Select all

{:timestamp=>"2015-05-05T15:08:03.124000+0200", :message=>"Using milestone 1 input plugin 'syslog'. This plugin should work, but would benefit from use by folks like you. Please let us know if you find bugs or have suggestions on how to improve this plugin.  For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-05-05T15:08:03.237000+0200", :message=>"Using milestone 2 input plugin 'tcp'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}

I will attach the /var/log/messages file as well as the cluster health status.

jolson · Post by **jolson** » Wed May 06, 2015 10:03 am

Your configuration looks fine - did anything happen between the 29th and 30th that you're aware of - perhaps an NLS upgrade or similar? I look forward to your results.

Johan159 · Post by **Johan159** » Wed May 06, 2015 10:03 am

And here are the two other files...

Johan159 · Post by **Johan159** » Wed May 06, 2015 10:05 am

Nothing special happened, at least that I am aware of. I was out of the office and none of my colleagues touched the server.

I performed the upgrade two days ago, hoping my problem would have been a known issue fixed by the upgrade.

jolson · Post by **jolson** » Wed May 06, 2015 10:14 am

In your messages file:

Code: Select all

	Line 917: May  4 16:03:04 vmwnagioslog1 kernel: Out of memory: Kill process 2281 (java) score 633 or sacrifice child
	Line 1014: May  4 16:03:56 vmwnagioslog1 kernel: Out of memory: Kill process 2487 (java) score 629 or sacrifice child
	Line 1110: May  4 16:03:56 vmwnagioslog1 kernel: Out of memory: Kill process 1111 (java) score 260 or sacrifice child
	Line 2028: May  4 16:12:06 vmwnagioslog1 kernel: Out of memory: Kill process 1053 (java) score 644 or sacrifice child
	Line 2121: May  4 16:23:10 vmwnagioslog1 kernel: Out of memory: Kill process 2498 (java) score 631 or sacrifice child
	Line 2215: May  4 16:23:10 vmwnagioslog1 kernel: Out of memory: Kill process 2500 (java) score 631 or sacrifice child
	Line 2313: May  4 16:36:22 vmwnagioslog1 kernel: Out of memory: Kill process 3540 (java) score 633 or sacrifice child
	Line 2409: May  4 16:36:22 vmwnagioslog1 kernel: Out of memory: Kill process 3595 (java) score 633 or sacrifice child
	Line 2502: May  4 16:38:48 vmwnagioslog1 kernel: Out of memory: Kill process 3832 (java) score 632 or sacrifice child

You'll definitely need more memory, it looks like elasticsearch is expanding so largely that the kernel has to kill it. The alternative is adding another node to your cluster to help balance out the work that the nodes need to do. I understand that you increased each node to 8GB recently - you can check on free memory with the below command:

Code: Select all

free -m

It's definitely worth looking at the 'Instance Status' screen from the GUI as well. You can click on a server and check its memory usage and heap size - very important bits to note.

In general though, I would either bump both servers up to 16GB of RAM or add another 8GB node to your cluster. Elasticsearch is RAM heavy, and with 3m logs daily it would make sense to need more resources.

Best,

Jesse

Johan159 · Post by **Johan159** » Thu May 07, 2015 5:50 am

Thanks for your reply. I upgraded the 2 nodes to 16GB and thinks now seem to be working fine. I did not expect Elasticsearch to need that much memory.

Thanks a lot for your support!

Nagios Support Forum

Logs are not indexed anymore

Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore

Re: Logs are not indexed anymore