Page 2 of 4

Re: Nagios Log Server is showing RED in it's status

Posted: Tue May 09, 2017 3:42 pm
by mcapra
It looks like the primary shard for the nagioslogserver index is stuck on INITIALIZING:

Code: Select all

nagioslogserver     0 p INITIALIZING                  127.0.0.1 2e8d09bc-4a49-4284-a85c-16159954531a 
Can you try running the following command from the CLI of your Nagios Log Server machine and see if it allows you to login afterwards:

Code: Select all

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"nagioslogserver","shard":0,"node":"2e8d09bc-4a49-4284-a85c-16159954531a","allow_primary":true}}]}'
If that doesn't work, please share the output of the following command:

Code: Select all

curl -s localhost:9200/_cat/shards
As well as the most recent contents of your Elasticsearch log (/var/log/elasticsearch/*.log). We don't need all the tarballs again, just the most recent log.

Re: Nagios Log Server is showing RED in it's status

Posted: Wed May 10, 2017 6:15 am
by srinivasmandalika
When I give that command, I see the error that is in the attached file as well as the output of the command given...

Srinivas Mandalika

Re: Nagios Log Server is showing RED in it's status

Posted: Wed May 10, 2017 12:24 pm
by mcapra
I would try restarting the Elasticsearch service, then waiting 5-10 minutes for the cluster to quiesce.

Afterwards, try rerouting the shard again:

Code: Select all

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"nagioslogserver","shard":0,"node":"2e8d09bc-4a49-4284-a85c-16159954531a","allow_primary":true}}]}'
If none of that works, can you share a fresh copy of the Elasticsearch logs? It's very strange for a shard to be stuck "Initializing" without memory/storage issues present in the logs.

Re: Nagios Log Server is showing RED in it's status

Posted: Wed May 10, 2017 1:22 pm
by srinivasmandalika
Did as you said... But, same error when I enter the command given... Please find the logs in JumpShare...

http://jmp.sh/5oAXPYH

Srinivas Mandalika

Re: Nagios Log Server is showing RED in it's status

Posted: Wed May 10, 2017 1:33 pm
by mcapra
I think the nagioslogserver index is totally busted:

Code: Select all

[2017-05-10 13:51:47,517][WARN ][indices.cluster          ] [2e8d09bc-4a49-4284-a85c-16159954531a] [[nagioslogserver][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [nagioslogserver][0] failed to recover shard
	at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:297)
	at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
	at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
	at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:267)
	... 4 more
Caused by: org.elasticsearch.ElasticsearchException: failed to read [alert][AVn6Gp1RrenxnjZ7-11S]
	at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:522)
	at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
	... 5 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
	at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
	at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:519)
	... 6 more
You might try to restore from a backup. You can find your backups here:

Code: Select all

/store/backups/nagioslogserver
And use our restore script found here:

Code: Select all

/usr/local/nagioslogserver/scripts/restore_backup.sh
For example, to restore from a backup I have from March 27th:

Code: Select all

/usr/local/nagioslogserver/scripts/restore_backup.sh /store/backups/nagioslogserver/nagioslogserver.2017-03-27.1490648552.tar.gz

Re: Nagios Log Server is showing RED in it's status

Posted: Wed May 10, 2017 2:24 pm
by srinivasmandalika
All I can see is only one backup in the location said...

[root@localhost nagioslogserver]# ls
nagioslogserver.2017-05-10.1494439148.tar.gz


Any other way?

Srinivas Mandalika

Re: Nagios Log Server is showing RED in it's status

Posted: Thu May 11, 2017 11:22 am
by mcapra
Damage to databases is tricky to resolve, particularly so in distributed databases. You might try starting from scratch with a brand new nagioslogserver index, but this would remove any queries/alerts/configurations you had previously defined. You might be able to restore the configurations from these files, and I would suggest backing them up if you have complex configurations:

Code: Select all

[root@nls1 templates]# ls -al /usr/local/nagioslogserver/logstash/etc/conf.d/
total 12
drwxrwxr-x. 2 nagios nagios   74 Mar 16 13:31 .
drwxrwxr-x. 3 nagios nagios   19 Aug 31  2016 ..
-rwxrwxr-x  1 nagios nagios 1606 Apr 24 14:51 000_inputs.conf
-rwxrwxr-x  1 nagios nagios 1132 Apr 24 14:51 500_filters.conf
-rwxrwxr-x  1 nagios nagios  537 Apr 24 14:51 999_outputs.conf
The following commands should start fresh for version 1.4.4. Again, this will clear out your queries/alerts/users/configs. In case someone reads this months from now, I absolutely would not do this on systems newer than 1.4.4.

Code: Select all

#This will delete all the configs/alerts/queries/users etc
curl -XDELETE 'http://localhost:9200/nagioslogserver/'

#This will create the appropriate command subsystem jobs 
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/backups' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"do_backups","run_time":1479846591,"frequency":"86400","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/backup_maintenance' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"do_maintenance","run_time":1479846572,"frequency":"86400","last_run_output":"Maintenance and Backup jobs are being executed","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/cleanup_cmdsubsys' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"cleanup","run_time":1479839486,"frequency":"3600","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/run_all_alerts' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"run_alerts","run_time":1479838091,"frequency":"20","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/run_update_check' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"update_check","run_time":1479846591,"frequency":"86400","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'

# This will restore the default configurations/queries/filters:
curl -XPUT 'http://localhost:9200/nagioslogserver/node/global' -d '{"config_inputs":[{"raw":"syslog {\r\n    type => '\''syslog'\''\r\n}","name":"Syslog (Default)","active":"1"},{"raw":"tcp {\r\n    type => '\''eventlog'\''\r\n    port => 3515\r\n    codec => json {\r\n        charset => '\''CP1252'\''\r\n    }\r\n}","name":"Windows Event Log (Default)","active":"1"},{"raw":"tcp {\r\n    type => '\''import_raw'\''\r\n    tags => '\''import_raw'\''\r\n    port => 2056\r\n}\r\nudp {\r\n    type => '\''import_raw'\''\r\n    tags => '\''import_raw'\''\r\n    port => 2056\r\n}","name":"Import Files - Raw (Default)","active":"1"},{"raw":"tcp {\r\n    type => '\''import_json'\''\r\n    tags => '\''import_json'\''\r\n    port => 2057\r\n    codec => json\r\n}","name":"Import Files - JSON (Default)","active":"0"}],"config_filters":[{"raw":"if [program] == '\''apache_access'\'' {\r\n    grok {\r\n        match => [ '\''message'\'', '\''%{COMBINEDAPACHELOG}'\'']\r\n    }\r\n    date {\r\n        match => [ '\''timestamp'\'', '\''dd/MMM/yyyy:HH:mm:ss Z'\'', '\''MMM dd HH:mm:ss'\'', '\''ISO8601'\'' ]\r\n    }\r\n    mutate {\r\n        replace => [ '\''type'\'', '\''apache_access'\'' ]\r\n         convert => [ '\''bytes'\'', '\''integer'\'' ]\r\n         convert => [ '\''response'\'', '\''integer'\'' ]\r\n    }\r\n}\r\n \r\nif [program] == '\''apache_error'\'' {\r\n    grok {\r\n        match => [ '\''message'\'', '\''\\[(?<timestamp>%{DAY:day} %{MONTH:month} %{MONTHDAY} %{TIME} %{YEAR})\\] \\[%{WORD:class}\\] \\[%{WORD:originator} %{IP:clientip}\\] %{GREEDYDATA:errmsg}'\'']\r\n    }\r\n    mutate {\r\n        replace => [ '\''type'\'', '\''apache_error'\'' ]\r\n    }\r\n}","name":"Apache (Default)","active":"1"}],"config_outputs":[]}'


#This will create a user named "someuser" with the password "nagiosls".  You can use this account to log in and optionally create the accounts that you need, then delete the "someuser" account as one of the other users when done.
curl -XPUT 'http://localhost:9200/nagioslogserver/user/1' -d '{"username":"someuser","password":"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19","auth_type":"admin","email":"[email protected]","language":"default","apiaccess":"1","apikey":"1396e08757545557073844695e5b64caa0bd3ad3","created":"2015-01-23 10:00:00","created_by":0,"default_dashboard":"/dashboard/elasticsearch/default"}'

The other option is to start with a completely fresh installation.

Re: Nagios Log Server is showing RED in it's status

Posted: Mon May 15, 2017 8:38 am
by srinivasmandalika
Where can I find alerts? I am not able to find them in conf.d folder... we have created many number of alerts and I would like to have a copy of them for minimizing the time of restore, if possible...

Thanks!

Srinivas Mandalika

Re: Nagios Log Server is showing RED in it's status

Posted: Mon May 15, 2017 8:50 am
by srinivasmandalika
Fortunately, after I restarted my nagios server, just few mminutes back, I was able to login to Nagios Server... When I entered the following command to see status of the server, I can see the status as yellow and Log collector service in stopped mode... I started log collector service, it is up and running now but I still see this status to be in yellow... any suggestion to get this to green? OR is it okay to be in yellow?

[root@localhost ~]# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "4aa88b36-6e32-4c15-992e-78e1784d646c",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 341,
"active_shards" : 341,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 341,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0


Let me know...

Srinivas Mandalika

Re: Nagios Log Server is showing RED in it's status

Posted: Mon May 15, 2017 11:26 am
by cdienger
The yellow status is expected if you have only one data instance.