Nagios Support Forum

Posted: **Thu Dec 20, 2018 11:55 am**

Hello,

I think I might be experiencing an issue after a recent logstash failure.

I ran the following command on the primary node in my Log Server cluster: curl -XGET 'http://localhost:9200/_cat/shards?v'

It appears that the: nagioslogserver_log is stuck INITIALIZING and it's keeping my cluster status in Yellow.

Please see attached export from Log Server.

Not sure what to do to get it back to Green. Also, I'm having trouble with snapshots again. It seems that once we try to snap 30 days, the system just can't handle it and we get logstash failures.

Posted: **Thu Dec 20, 2018 6:01 pm**

Was the output modified to remove the node information(uuid/ip)? I would start by restarting the elasticsearch service where the replica's are found and if that doesn't do the trick, restart the service on the primary(after the first server's service comes back).

Posted: **Thu Dec 20, 2018 6:06 pm**

Yes, I omitted the IPs. I didn't think they would matter for my question.

Thank you, I'll try that.

Posted: **Thu Dec 20, 2018 7:02 pm**

Unfortunately, that did not work. There are still two log shards that are stuck initializing.

I even took the cluster completely down and brought it back up.

Any other ideas?

Posted: **Fri Dec 21, 2018 9:48 am**

Actually, disregard my last post. We're back. It looks like taking down Log Server completely and bringing it back up did the trick. There's a snapshot running now and the system is Green.

Now onto my second question, we seem to be having a lot of inconsistency with our snapshots. We're good doing anything between 14-29 days, but as soon as we try to snap 30 days logstash starts failing or the system becomes unresponsive.

I used this article to resolve our issues before: https://support.nagios.com/kb/article/n ... g-576.html

Is it possible we need to increase those numbers even further?

Thank you.

Posted: **Fri Dec 21, 2018 4:04 pm**

If you see those errors in the console, you can try and increase those variables again to see if it helps.

Also, check the log files in the following folders when the server is having the issue and if you find any errors, post them here.
/var/log/logstash/ and /var/log/elasticsearch/

Posted: **Fri Dec 21, 2018 4:15 pm**

Thank you.

What command would I use to see if there is currently a snapshot running or the status of previous snapshot?

Posted: **Fri Dec 21, 2018 4:43 pm**

What does it mean with the Command Subsystem says WAITING for the snapshot, but Snapshots and Maintenance still shows the snapshot IN PROGRESS?

This is the issue we keep seeing. The subsystem says the snapshot is complete, but when we go to check the system locks up and says the snapshot is still running.

Posted: **Fri Dec 21, 2018 5:09 pm**

I ran the following command and received this output. Currently, my log server system is completely locked up...

[root@nagioslscc2 xxxxxxx]# curl -s -XGET 'http://localhost:9200/_cluster/state?pretty' | grep snapshot -A 100
"snapshot" : {
"_timestamp" : {
"enabled" : true
},
"properties" : {
"path" : {
"type" : "string"
},
"auto" : {
"type" : "long"
},
"filename" : {
"type" : "string"
},
"created" : {
"type" : "long"
},
"name" : {
"type" : "string"
},
"clean_filename" : {
"type" : "string"
}
}
},
"_default_" : {
"_timestamp" : {
"enabled" : true
},
"properties" : { }
},
"commands" : {
"_timestamp" : {
"enabled" : true
},
"properties" : {
"last_run_status" : {
"type" : "string"
},
"run_time" : {
"type" : "long"
},
"created" : {
"type" : "string"
},
"active" : {
"type" : "long"
},
"type" : {
"type" : "string"
},
"created_by" : {
"type" : "string"
},
"command" : {
"type" : "string"
},
"last_run_output" : {
"type" : "string"
},
"last_run_time" : {
"type" : "string"
},
"frequency" : {
"type" : "string"
},
"args" : {
"properties" : {
"sh_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"timezone" : {
"type" : "string"
},
"sh_created" : {
"type" : "long"
},
"id" : {
"type" : "string"
}
}
},
"node" : {
"index" : "not_analyzed",
"type" : "string"
},
"modified_by" : {
"type" : "string"
},
"modified" : {
"type" : "string"
},
"status" : {
"type" : "string"
}
}
},
"node" : {
--
"snapshots" : {
"snapshots" : [ {
"repository" : "nlsrep",
"snapshot" : "curator-20181221123022",
"include_global_state" : true,
"state" : "STARTED",
"indices" : [ "logstash-2018.11.21", "logstash-2018.11.22", "logstash-2018.11.23", "logstash-2018.11.24", "logstash-2018.11.25", "logstash-2018.11.26", "logstash-2018.11.27", "logstash-2018.11.28", "logstash-2018.11.29", "logstash-2018.11.30", "logstash-2018.12.01", "logstash-2018.12.02", "logstash-2018.12.03", "logstash-2018.12.04", "logstash-2018.12.05", "logstash-2018.12.06", "logstash-2018.12.07", "logstash-2018.12.08", "logstash-2018.12.09", "logstash-2018.12.10", "logstash-2018.12.11", "logstash-2018.12.12", "logstash-2018.12.13", "logstash-2018.12.14", "logstash-2018.12.15", "logstash-2018.12.16", "logstash-2018.12.17", "logstash-2018.12.18", "logstash-2018.12.19", "logstash-2018.12.20" ],
"shards" : [ {
"index" : "logstash-2018.11.22",
"shard" : 2,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.03",
"shard" : 2,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.15",
"shard" : 3,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.11.22",
"shard" : 1,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.03",
"shard" : 1,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.15",
"shard" : 2,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.11.22",
"shard" : 4,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.03",
"shard" : 0,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.15",
"shard" : 1,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.11.22",
"shard" : 3,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.15",
"shard" : 0,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.02",
"shard" : 4,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.11.23",
"shard" : 1,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.02",
"shard" : 3,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.14",
"shard" : 4,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.11.23",
"shard" : 0,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.02",
"shard" : 2,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.14",
"shard" : 3,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.11.23",
"shard" : 3,
"state" : "SUCCESS",
"node" : "OMSbsV-8RXeTUT1LcgeotQ"
}, {
"index" : "logstash-2018.12.04",

Posted: **Fri Dec 21, 2018 6:47 pm**

See attached screenshots for issue described. There are two snapshots that show IN_PROCESS simultaneously even though the Command Subsystem says WAITING.

Nagios Support Forum

nagioslogserver_log INITIALIZING - after logstash failure

nagioslogserver_log INITIALIZING - after logstash failure

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur

Re: nagioslogserver_log INITIALIZING - after logstash failur