Occasional missing replica warnings

Post by **WillemDH** » Mon Feb 15, 2016 5:02 am

Hello,

It seems we occasionally get alerts from the check_elasticsearch plugin. (https://support.nagios.com/forum/viewto ... h&start=10) I added the -vv option so they are listed.

Code: Select all

One or more indexes are missing replica shards. Use -vv to list them. Index 'logstash-2016.02.15' replica down on shard 2

What could be causing this? The problem started this morning 15/01/16 around 01:00. Nothing related has changed the last weeks.

Grtz

Willem

jolson · Post by **jolson** » Mon Feb 15, 2016 10:45 am

Could we get the output of your elasticsearch log around the time of the shard detachment?

Code: Select all

tail -n500 /var/log/elasticsearch/*.log

Post by **WillemDH** » Tue Feb 16, 2016 9:06 am

Welll typical.. Seems like since today the issue is magically solved. There must have been something wrong with yesterday's index.... Since 00:40 this morning we had no more warnings about missing replica's..

jolson · Post by **jolson** » Tue Feb 16, 2016 11:21 am

Naturally.

It's probably still worth taking a look at the logs to see what happened that day. I'd be happy to look at it - maybe we'll get lucky?

Post by **WillemDH** » Tue Feb 16, 2016 12:21 pm

Well I tried looking through those tail logs but they didnt go really far.. Let's agree to not pursue this for now. If it comes back I'll spend some more time on it. Going through those logs makes my head spin, just thinking about all the amounts of things I still got to do..
Unless you can provide me with a command to only show the logs in /var/log/elasticsearch/*.log between specific date/times, but 'm not really sure if this is possible. Maybe I need another log server to monitor the logserver's logs?

hsmith · Post by **hsmith** » Tue Feb 16, 2016 3:34 pm

You could set up rsyslog on localhost to forward the elasticsearch log to itself, so you can read through it. Providing a command to sort through those logs may be difficult though, especially since elasticsearch compresses its own logs live. You could probably do a fancy grep with -B and -A to get a timeperiod, but I think the rsyslog idea is the best. You could set up another logserver to monitor the logs of your current logserver, but what would monitor the logs of the logserver monitoring the logs of the logserver you're currently running?

Post by **WillemDH** » Tue Feb 16, 2016 3:41 pm

Well, considering the huge amounts of things I need to do the coming weeks, I think i'm gonna pass. Thanks for the suggestion though. I fear that sending the elasticsearch logs to NLS might be a bad idea and create some kind of very bad loop thing if things go wrong.
A bit like sending logs of your F5 load balancer through the F5 load balancer to NLS.

Let's put this thread on hold for a bit, if it doesn't return i'll you know it can be closed.

hsmith · Post by **hsmith** » Tue Feb 16, 2016 3:43 pm

That makes sense, I could see where there could be some sort of problem with it. I'll probably do this on mine as it's of little consequence. I'll leave the thread open, let us know if the issue comes back.

Post by **WillemDH** » Mon Feb 22, 2016 5:07 am

Ok, issue is back and I was able to get some logs. I'll PM them to jesse, as they contain some sensible info.

jolson · Post by **jolson** » Mon Feb 22, 2016 11:37 am

WillemDH wrote:Ok, issue is back and I was able to get some logs. I'll PM them to jesse, as they contain some sensible info.

These look like simple timestamp parse failures - do these errors directly correspond with the detaching replica problem? If so, it's likely worth resolving the timestamp parser so that the issue might be resolved.

This kind of timestamp issue is almost always caused by the 'syslog' input not matching incoming logs properly. The resolution I like to use is replacing the 'syslog' input with two inputs - one bare tcp and one bare udp. After that you can assign the syslog filter yourself, like so:

Code: Select all

"match" => { "message" => "<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"

The 'SYSLOGTIMESTAMP' pattern is the one to be concerned about. Feel free to try TIMESTAMP here if you still experience parse failures. I'm not convinced that timestamp parsing is causing your replicas to detach, but if there's a correlation it's worth a try.

Otherwise, I'd like the following output:

Code: Select all

free -m
top | head -n5
curl 'localhost:9200/_cluster/health?level=indices&pretty'

Nagios Support Forum

Occasional missing replica warnings

Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings

Re: Occasional missing replica warnings