Page 2 of 2

Re: Alerting

Posted: Mon Aug 24, 2015 4:55 pm
by CFT6Server
Thanks for the explanation. I have successfully upgrade both nodes in the test instance by clearing the state files that it complained about. This was required on both nodes in our test instance.

For bigger implementation, do you guys recommend temporarily disabling shard allocation? I've been doing that for our production cluster, just because that's the elasticsearch recommended method for rolling restarts.

Re: Alerting

Posted: Mon Aug 24, 2015 5:02 pm
by jolson
do you guys recommend temporarily disabling shard allocation? I've been doing that for our production cluster, just because that's the elasticsearch recommended method for rolling restarts.
We're using a very untouched version of elasticsearch under the hood - that is to say that most elasticsearch recommendations are also applicable to Nagios Log Server.

You may also be interested in recovery settings: https://www.elastic.co/guide/en/elastic ... y_settings

Specifically, gateway.recover_after_time and gateway.expected_nodes.

Re: Alerting

Posted: Mon Aug 24, 2015 5:15 pm
by CFT6Server
I would suggest adding that into the update instructions for larger clustered implementaions. We have millions of events coming in, and without disabling the shard allocation temporarily, the restarts can take a long time as the shards are reallocation happens during the reboots. This are the steps that I perform.

Disabling allocation

Code: Select all

curl -XPUT localhost:9200/_cluster/settings -d '
{
    "transient" : {
        "cluster.routing.allocation.enable" : "none"
    }
}
Shutdown the node and reboot (or perform maintenance task)

Code: Select all

curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
Re-enable allocation

Code: Select all

curl -XPUT localhost:9200/_cluster/settings -d '
{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}
Then I wait until the cluster status is back to green before proceeding to the next node. Depending on how much data there is and how long the maintenance window is, this could take minutes to hours. But this ensures a healthy cluster and less issues for all the restarts.

Re: Alerting

Posted: Tue Aug 25, 2015 9:59 am
by jolson
Great idea. I'll propose that the documentation be changed. I can imagine this being very helpful for our larger clients - thanks for your detailed procedure!

Would it be alright if I went ahead and locked this thread?

Jesse

Re: Alerting

Posted: Tue Aug 25, 2015 10:11 am
by CFT6Server
Sure thing. FYI I am still working on our upgrades. After the node is restarted after upgrade, there are number of shards it sees as unassigned and it goes through them. We have very large daily indexes, and it takes a long time for the cluster to go green. For large implementations, rolling upgrades could take a couple days.

Re: Alerting

Posted: Tue Aug 25, 2015 11:08 am
by jolson
Understood - of course be sure to let us know if you run in to any trouble.

In the meantime, the document has been updated: https://assets.nagios.com/downloads/nag ... Server.pdf

Let me know if there's anything I might have missed.