Error: All shards failed for phase

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
li_alm
Posts: 19
Joined: Thu Oct 13, 2016 4:44 am

Error: All shards failed for phase

Post by li_alm »

Hello,

Our Nagios Log Server has stopped working (the elasticsearch service seemed to have stopped).
We restarted the elasticsearch service and a lot of messages of the following type appeared in the log:
MSG1:
All shards failed for phase: [query] org.elasticsearch.action.NoShardAvailableActionException: [nagioslogserver][4] null

Then, at some point, the following message appeared:
MSG2:
[2019-03-19 07:46:56,742][DEBUG][action.search.type ] [04c4efb4-9365-45d3-9c7b-162e3cbcc051] All shards failed for phase: [query]
org.elasticsearch.index.shard.IllegalIndexShardStateException: [nagioslogserver][0] CurrentState[RECOVERING] operations only allowed when started/relocated

After this message, we were able to use Nagios Log Server.

Q1: What is the meaning for MSG1 and MSG2?
Q2: How can we understand what happened, so we can avoid this kind of issues in the future?

Important note: very often, we receive a lot of messages of the following type:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000)
Nagios Log Server Forum discussion on the 'rejected execution' issue:
https://support.nagios.com/forum/viewto ... 37&t=49189
Q3: Is the current issue ('All shards failed for phase') related to the 'rejected execution (queue capacity 1000)' issue?

I am using nagios Log Server on one node only: Nagios Log Server 1.4.4, Elasticsearch 1.6.0

Thank you!
Regards,
Liviu
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Error: All shards failed for phase

Post by npolovenko »

Hello, @li_alm. The first two messages you showed could be normal at the elastic search startup. But the third error message could potentially indicate the lack of system resources, such as CPU or RAM. How much RAM and CPU cores does this server have? Can you generate a system profile by running the script I attached from the /tmp/ folder in the log server? That should generate a system profile archive that you can share with us in this thread.
profile.sh
You do not have the required permissions to view the files attached to this post.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
li_alm
Posts: 19
Joined: Thu Oct 13, 2016 4:44 am

Re: Error: All shards failed for phase

Post by li_alm »

Hello, @npolovenko,

Thank you for your reply.

I have 2 nagios deployments (completely independent, separate), both behave the same (a lot of "rejected" messages in the logs).
Deployment1:
1 CPU core (Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz)
2 GB RAM
Deployment2:
1 CPU core (Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz)
4 GB RAM

I ran the script you have me.

I attached the result.

Regards,
Liviu
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Error: All shards failed for phase

Post by cdienger »

The profile provided doesn't contain any rejected messages, but it does show that the java heap usage is 65% percent which can be pretty high for an idle machine. A large query could cause spikes and the reject message. Can you increase the memory on this machine to 4GB to match the other? By default Elasticsearch will only use half of the total system memory so by only having 2GB on the system, Elasticsearch is limited to just 1GB.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
li_alm
Posts: 19
Joined: Thu Oct 13, 2016 4:44 am

Re: Error: All shards failed for phase

Post by li_alm »

OK, @cdienger, thanks, we will try to increase the RAM for the machine using only 2GB.

My main concern was about MSG1 and MSG2 (see my initial post), because I had the impression Nagios Log Servers would not start.

Regards,
Liviu
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Error: All shards failed for phase

Post by cdienger »

Those messages are typical of a service restarting. You can verify the services are up from the command line:

service elasticsearch status

or in the web UI under Admin > System > System Status.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked