Alerting

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Alerting

Post by CFT6Server »

Thanks for the explanation. I have successfully upgrade both nodes in the test instance by clearing the state files that it complained about. This was required on both nodes in our test instance.

For bigger implementation, do you guys recommend temporarily disabling shard allocation? I've been doing that for our production cluster, just because that's the elasticsearch recommended method for rolling restarts.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Alerting

Post by jolson »

do you guys recommend temporarily disabling shard allocation? I've been doing that for our production cluster, just because that's the elasticsearch recommended method for rolling restarts.
We're using a very untouched version of elasticsearch under the hood - that is to say that most elasticsearch recommendations are also applicable to Nagios Log Server.

You may also be interested in recovery settings: https://www.elastic.co/guide/en/elastic ... y_settings

Specifically, gateway.recover_after_time and gateway.expected_nodes.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Alerting

Post by CFT6Server »

I would suggest adding that into the update instructions for larger clustered implementaions. We have millions of events coming in, and without disabling the shard allocation temporarily, the restarts can take a long time as the shards are reallocation happens during the reboots. This are the steps that I perform.

Disabling allocation

Code: Select all

curl -XPUT localhost:9200/_cluster/settings -d '
{
    "transient" : {
        "cluster.routing.allocation.enable" : "none"
    }
}
Shutdown the node and reboot (or perform maintenance task)

Code: Select all

curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
Re-enable allocation

Code: Select all

curl -XPUT localhost:9200/_cluster/settings -d '
{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}
Then I wait until the cluster status is back to green before proceeding to the next node. Depending on how much data there is and how long the maintenance window is, this could take minutes to hours. But this ensures a healthy cluster and less issues for all the restarts.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Alerting

Post by jolson »

Great idea. I'll propose that the documentation be changed. I can imagine this being very helpful for our larger clients - thanks for your detailed procedure!

Would it be alright if I went ahead and locked this thread?

Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Alerting

Post by CFT6Server »

Sure thing. FYI I am still working on our upgrades. After the node is restarted after upgrade, there are number of shards it sees as unassigned and it goes through them. We have very large daily indexes, and it takes a long time for the cluster to go green. For large implementations, rolling upgrades could take a couple days.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Alerting

Post by jolson »

Understood - of course be sure to let us know if you run in to any trouble.

In the meantime, the document has been updated: https://assets.nagios.com/downloads/nag ... Server.pdf

Let me know if there's anything I might have missed.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked