Page 1 of 1

Error: All shards failed for phase

Posted: Tue Mar 19, 2019 3:01 am
by li_alm
Hello,

Our Nagios Log Server has stopped working (the elasticsearch service seemed to have stopped).
We restarted the elasticsearch service and a lot of messages of the following type appeared in the log:
MSG1:
All shards failed for phase: [query] org.elasticsearch.action.NoShardAvailableActionException: [nagioslogserver][4] null

Then, at some point, the following message appeared:
MSG2:
[2019-03-19 07:46:56,742][DEBUG][action.search.type ] [04c4efb4-9365-45d3-9c7b-162e3cbcc051] All shards failed for phase: [query]
org.elasticsearch.index.shard.IllegalIndexShardStateException: [nagioslogserver][0] CurrentState[RECOVERING] operations only allowed when started/relocated

After this message, we were able to use Nagios Log Server.

Q1: What is the meaning for MSG1 and MSG2?
Q2: How can we understand what happened, so we can avoid this kind of issues in the future?

Important note: very often, we receive a lot of messages of the following type:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000)
Nagios Log Server Forum discussion on the 'rejected execution' issue:
https://support.nagios.com/forum/viewto ... 37&t=49189
Q3: Is the current issue ('All shards failed for phase') related to the 'rejected execution (queue capacity 1000)' issue?

I am using nagios Log Server on one node only: Nagios Log Server 1.4.4, Elasticsearch 1.6.0

Thank you!
Regards,
Liviu

Re: Error: All shards failed for phase

Posted: Tue Mar 19, 2019 4:06 pm
by npolovenko
Hello, @li_alm. The first two messages you showed could be normal at the elastic search startup. But the third error message could potentially indicate the lack of system resources, such as CPU or RAM. How much RAM and CPU cores does this server have? Can you generate a system profile by running the script I attached from the /tmp/ folder in the log server? That should generate a system profile archive that you can share with us in this thread.
profile.sh

Re: Error: All shards failed for phase

Posted: Wed Mar 20, 2019 3:32 am
by li_alm
Hello, @npolovenko,

Thank you for your reply.

I have 2 nagios deployments (completely independent, separate), both behave the same (a lot of "rejected" messages in the logs).
Deployment1:
1 CPU core (Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz)
2 GB RAM
Deployment2:
1 CPU core (Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz)
4 GB RAM

I ran the script you have me.

I attached the result.

Regards,
Liviu

Re: Error: All shards failed for phase

Posted: Thu Mar 21, 2019 11:05 am
by cdienger
The profile provided doesn't contain any rejected messages, but it does show that the java heap usage is 65% percent which can be pretty high for an idle machine. A large query could cause spikes and the reject message. Can you increase the memory on this machine to 4GB to match the other? By default Elasticsearch will only use half of the total system memory so by only having 2GB on the system, Elasticsearch is limited to just 1GB.

Re: Error: All shards failed for phase

Posted: Fri Mar 22, 2019 4:03 am
by li_alm
OK, @cdienger, thanks, we will try to increase the RAM for the machine using only 2GB.

My main concern was about MSG1 and MSG2 (see my initial post), because I had the impression Nagios Log Servers would not start.

Regards,
Liviu

Re: Error: All shards failed for phase

Posted: Fri Mar 22, 2019 9:10 am
by cdienger
Those messages are typical of a service restarting. You can verify the services are up from the command line:

service elasticsearch status

or in the web UI under Admin > System > System Status.