Occasional missing replica warnings

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Occasional missing replica warnings

Post by WillemDH »

Hello,

It seems we occasionally get alerts from the check_elasticsearch plugin. (https://support.nagios.com/forum/viewto ... h&start=10) I added the -vv option so they are listed.

Code: Select all

One or more indexes are missing replica shards. Use -vv to list them. Index 'logstash-2016.02.15' replica down on shard 2
What could be causing this? The problem started this morning 15/01/16 around 01:00. Nothing related has changed the last weeks.

Grtz

Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Occasional missing replica warnings

Post by jolson »

Could we get the output of your elasticsearch log around the time of the shard detachment?

Code: Select all

tail -n500 /var/log/elasticsearch/*.log
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Occasional missing replica warnings

Post by WillemDH »

Welll typical.. Seems like since today the issue is magically solved. There must have been something wrong with yesterday's index.... Since 00:40 this morning we had no more warnings about missing replica's..
You do not have the required permissions to view the files attached to this post.
Last edited by WillemDH on Tue Feb 16, 2016 12:07 pm, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Occasional missing replica warnings

Post by jolson »

Naturally. ;) It's probably still worth taking a look at the logs to see what happened that day. I'd be happy to look at it - maybe we'll get lucky?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Occasional missing replica warnings

Post by WillemDH »

Well I tried looking through those tail logs but they didnt go really far.. Let's agree to not pursue this for now. If it comes back I'll spend some more time on it. Going through those logs makes my head spin, just thinking about all the amounts of things I still got to do..
Unless you can provide me with a command to only show the logs in /var/log/elasticsearch/*.log between specific date/times, but 'm not really sure if this is possible. Maybe I need another log server to monitor the logserver's logs? :)
Nagios XI 5.8.1
https://outsideit.net
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Occasional missing replica warnings

Post by hsmith »

You could set up rsyslog on localhost to forward the elasticsearch log to itself, so you can read through it. Providing a command to sort through those logs may be difficult though, especially since elasticsearch compresses its own logs live. You could probably do a fancy grep with -B and -A to get a timeperiod, but I think the rsyslog idea is the best. You could set up another logserver to monitor the logs of your current logserver, but what would monitor the logs of the logserver monitoring the logs of the logserver you're currently running? :P
Former Nagios Employee.
me.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Occasional missing replica warnings

Post by WillemDH »

Well, considering the huge amounts of things I need to do the coming weeks, I think i'm gonna pass. Thanks for the suggestion though. I fear that sending the elasticsearch logs to NLS might be a bad idea and create some kind of very bad loop thing if things go wrong.
A bit like sending logs of your F5 load balancer through the F5 load balancer to NLS.

Let's put this thread on hold for a bit, if it doesn't return i'll you know it can be closed.
Nagios XI 5.8.1
https://outsideit.net
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Occasional missing replica warnings

Post by hsmith »

That makes sense, I could see where there could be some sort of problem with it. I'll probably do this on mine as it's of little consequence. I'll leave the thread open, let us know if the issue comes back.
Former Nagios Employee.
me.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Occasional missing replica warnings

Post by WillemDH »

Ok, issue is back and I was able to get some logs. I'll PM them to jesse, as they contain some sensible info.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Occasional missing replica warnings

Post by jolson »

WillemDH wrote:Ok, issue is back and I was able to get some logs. I'll PM them to jesse, as they contain some sensible info.
These look like simple timestamp parse failures - do these errors directly correspond with the detaching replica problem? If so, it's likely worth resolving the timestamp parser so that the issue might be resolved.

This kind of timestamp issue is almost always caused by the 'syslog' input not matching incoming logs properly. The resolution I like to use is replacing the 'syslog' input with two inputs - one bare tcp and one bare udp. After that you can assign the syslog filter yourself, like so:

Code: Select all

"match" => { "message" => "<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"
The 'SYSLOGTIMESTAMP' pattern is the one to be concerned about. Feel free to try TIMESTAMP here if you still experience parse failures. I'm not convinced that timestamp parsing is causing your replicas to detach, but if there's a correlation it's worth a try.

Otherwise, I'd like the following output:

Code: Select all

free -m
top | head -n5
curl 'localhost:9200/_cluster/health?level=indices&pretty'
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked