Page 1 of 2
Occasional missing replica warnings
Posted: Mon Feb 15, 2016 5:02 am
by WillemDH
Hello,
It seems we occasionally get alerts from the check_elasticsearch plugin. (
https://support.nagios.com/forum/viewto ... h&start=10) I added the -vv option so they are listed.
Code: Select all
One or more indexes are missing replica shards. Use -vv to list them. Index 'logstash-2016.02.15' replica down on shard 2
What could be causing this? The problem started this morning 15/01/16 around 01:00. Nothing related has changed the last weeks.
Grtz
Willem
Re: Occasional missing replica warnings
Posted: Mon Feb 15, 2016 10:45 am
by jolson
Could we get the output of your elasticsearch log around the time of the shard detachment?
Code: Select all
tail -n500 /var/log/elasticsearch/*.log
Re: Occasional missing replica warnings
Posted: Tue Feb 16, 2016 9:06 am
by WillemDH
Welll typical.. Seems like since today the issue is magically solved. There must have been something wrong with yesterday's index.... Since 00:40 this morning we had no more warnings about missing replica's..
Re: Occasional missing replica warnings
Posted: Tue Feb 16, 2016 11:21 am
by jolson
Naturally.

It's probably still worth taking a look at the logs to see what happened that day. I'd be happy to look at it - maybe we'll get lucky?
Re: Occasional missing replica warnings
Posted: Tue Feb 16, 2016 12:21 pm
by WillemDH
Well I tried looking through those tail logs but they didnt go really far.. Let's agree to not pursue this for now. If it comes back I'll spend some more time on it. Going through those logs makes my head spin, just thinking about all the amounts of things I still got to do..
Unless you can provide me with a command to only show the logs in /var/log/elasticsearch/*.log between specific date/times, but 'm not really sure if this is possible. Maybe I need another log server to monitor the logserver's logs?

Re: Occasional missing replica warnings
Posted: Tue Feb 16, 2016 3:34 pm
by hsmith
You could set up rsyslog on localhost to forward the elasticsearch log to itself, so you can read through it. Providing a command to sort through those logs may be difficult though, especially since elasticsearch compresses its own logs live. You could probably do a fancy grep with -B and -A to get a timeperiod, but I think the rsyslog idea is the best. You could set up another logserver to monitor the logs of your current logserver, but what would monitor the logs of the logserver monitoring the logs of the logserver you're currently running?

Re: Occasional missing replica warnings
Posted: Tue Feb 16, 2016 3:41 pm
by WillemDH
Well, considering the huge amounts of things I need to do the coming weeks, I think i'm gonna pass. Thanks for the suggestion though. I fear that sending the elasticsearch logs to NLS might be a bad idea and create some kind of very bad loop thing if things go wrong.
A bit like sending logs of your F5 load balancer through the F5 load balancer to NLS.
Let's put this thread on hold for a bit, if it doesn't return i'll you know it can be closed.
Re: Occasional missing replica warnings
Posted: Tue Feb 16, 2016 3:43 pm
by hsmith
That makes sense, I could see where there could be some sort of problem with it. I'll probably do this on mine as it's of little consequence. I'll leave the thread open, let us know if the issue comes back.
Re: Occasional missing replica warnings
Posted: Mon Feb 22, 2016 5:07 am
by WillemDH
Ok, issue is back and I was able to get some logs. I'll PM them to jesse, as they contain some sensible info.
Re: Occasional missing replica warnings
Posted: Mon Feb 22, 2016 11:37 am
by jolson
WillemDH wrote:Ok, issue is back and I was able to get some logs. I'll PM them to jesse, as they contain some sensible info.
These look like simple timestamp parse failures - do these errors directly correspond with the detaching replica problem? If so, it's likely worth resolving the timestamp parser so that the issue might be resolved.
This kind of timestamp issue is almost always caused by the 'syslog' input not matching incoming logs properly. The resolution I like to use is replacing the 'syslog' input with two inputs - one bare tcp and one bare udp. After that you can assign the syslog filter yourself, like so:
Code: Select all
"match" => { "message" => "<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"
The 'SYSLOGTIMESTAMP' pattern is the one to be concerned about. Feel free to try TIMESTAMP here if you still experience parse failures. I'm not convinced that timestamp parsing is causing your replicas to detach, but if there's a correlation it's worth a try.
Otherwise, I'd like the following output:
Code: Select all
free -m
top | head -n5
curl 'localhost:9200/_cluster/health?level=indices&pretty'