Nagios Support Forum

Posted: **Tue May 09, 2017 3:42 pm**

It looks like the primary shard for the nagioslogserver index is stuck on INITIALIZING:

nagioslogserver     0 p INITIALIZING                  127.0.0.1 2e8d09bc-4a49-4284-a85c-16159954531a

Can you try running the following command from the CLI of your Nagios Log Server machine and see if it allows you to login afterwards:

Code: Select all

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"nagioslogserver","shard":0,"node":"2e8d09bc-4a49-4284-a85c-16159954531a","allow_primary":true}}]}'

If that doesn't work, please share the output of the following command:

Code: Select all

curl -s localhost:9200/_cat/shards

As well as the most recent contents of your Elasticsearch log (/var/log/elasticsearch/*.log). We don't need all the tarballs again, just the most recent log.

Posted: **Wed May 10, 2017 6:15 am**

When I give that command, I see the error that is in the attached file as well as the output of the command given...

Srinivas Mandalika

Posted: **Wed May 10, 2017 12:24 pm**

I would try restarting the Elasticsearch service, then waiting 5-10 minutes for the cluster to quiesce.

Afterwards, try rerouting the shard again:

Code: Select all

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate":{"index":"nagioslogserver","shard":0,"node":"2e8d09bc-4a49-4284-a85c-16159954531a","allow_primary":true}}]}'

If none of that works, can you share a fresh copy of the Elasticsearch logs? It's very strange for a shard to be stuck "Initializing" without memory/storage issues present in the logs.

Posted: **Wed May 10, 2017 1:22 pm**

Did as you said... But, same error when I enter the command given... Please find the logs in JumpShare...

http://jmp.sh/5oAXPYH

Srinivas Mandalika

Posted: **Wed May 10, 2017 1:33 pm**

I think the nagioslogserver index is totally busted:

Code: Select all

[2017-05-10 13:51:47,517][WARN ][indices.cluster          ] [2e8d09bc-4a49-4284-a85c-16159954531a] [[nagioslogserver][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [nagioslogserver][0] failed to recover shard
	at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:297)
	at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
	at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
	at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:267)
	... 4 more
Caused by: org.elasticsearch.ElasticsearchException: failed to read [alert][AVn6Gp1RrenxnjZ7-11S]
	at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:522)
	at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
	... 5 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [48]
	at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
	at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:519)
	... 6 more

You might try to restore from a backup. You can find your backups here:

Code: Select all

/store/backups/nagioslogserver

And use our restore script found here:

Code: Select all

/usr/local/nagioslogserver/scripts/restore_backup.sh

For example, to restore from a backup I have from March 27th:

Code: Select all

/usr/local/nagioslogserver/scripts/restore_backup.sh /store/backups/nagioslogserver/nagioslogserver.2017-03-27.1490648552.tar.gz

Posted: **Wed May 10, 2017 2:24 pm**

All I can see is only one backup in the location said...

[root@localhost nagioslogserver]# ls
nagioslogserver.2017-05-10.1494439148.tar.gz

Any other way?

Srinivas Mandalika

Posted: **Thu May 11, 2017 11:22 am**

Damage to databases is tricky to resolve, particularly so in distributed databases. You might try starting from scratch with a brand new nagioslogserver index, but this would remove any queries/alerts/configurations you had previously defined. You might be able to restore the configurations from these files, and I would suggest backing them up if you have complex configurations:

Code: Select all

[root@nls1 templates]# ls -al /usr/local/nagioslogserver/logstash/etc/conf.d/
total 12
drwxrwxr-x. 2 nagios nagios   74 Mar 16 13:31 .
drwxrwxr-x. 3 nagios nagios   19 Aug 31  2016 ..
-rwxrwxr-x  1 nagios nagios 1606 Apr 24 14:51 000_inputs.conf
-rwxrwxr-x  1 nagios nagios 1132 Apr 24 14:51 500_filters.conf
-rwxrwxr-x  1 nagios nagios  537 Apr 24 14:51 999_outputs.conf

The following commands should start fresh for version 1.4.4. Again, this will clear out your queries/alerts/users/configs. In case someone reads this months from now, I absolutely would not do this on systems newer than 1.4.4.

Code: Select all

#This will delete all the configs/alerts/queries/users etc
curl -XDELETE 'http://localhost:9200/nagioslogserver/'

#This will create the appropriate command subsystem jobs 
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/backups' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"do_backups","run_time":1479846591,"frequency":"86400","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/backup_maintenance' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"do_maintenance","run_time":1479846572,"frequency":"86400","last_run_output":"Maintenance and Backup jobs are being executed","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/cleanup_cmdsubsys' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"cleanup","run_time":1479839486,"frequency":"3600","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/run_all_alerts' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"run_alerts","run_time":1479838091,"frequency":"20","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'
curl -XPUT 'http://localhost:9200/nagioslogserver/commands/run_update_check' -d '{"created":"2016-11-22 00:00:00","created_by":"1","active":1,"status":"waiting","type":"system","node":"global","command":"update_check","run_time":1479846591,"frequency":"86400","last_run_time":"1970-01-01 00:00:00","last_run_status":"SUCCESS"}'

# This will restore the default configurations/queries/filters:
curl -XPUT 'http://localhost:9200/nagioslogserver/node/global' -d '{"config_inputs":[{"raw":"syslog {\r\n    type => '\''syslog'\''\r\n}","name":"Syslog (Default)","active":"1"},{"raw":"tcp {\r\n    type => '\''eventlog'\''\r\n    port => 3515\r\n    codec => json {\r\n        charset => '\''CP1252'\''\r\n    }\r\n}","name":"Windows Event Log (Default)","active":"1"},{"raw":"tcp {\r\n    type => '\''import_raw'\''\r\n    tags => '\''import_raw'\''\r\n    port => 2056\r\n}\r\nudp {\r\n    type => '\''import_raw'\''\r\n    tags => '\''import_raw'\''\r\n    port => 2056\r\n}","name":"Import Files - Raw (Default)","active":"1"},{"raw":"tcp {\r\n    type => '\''import_json'\''\r\n    tags => '\''import_json'\''\r\n    port => 2057\r\n    codec => json\r\n}","name":"Import Files - JSON (Default)","active":"0"}],"config_filters":[{"raw":"if [program] == '\''apache_access'\'' {\r\n    grok {\r\n        match => [ '\''message'\'', '\''%{COMBINEDAPACHELOG}'\'']\r\n    }\r\n    date {\r\n        match => [ '\''timestamp'\'', '\''dd/MMM/yyyy:HH:mm:ss Z'\'', '\''MMM dd HH:mm:ss'\'', '\''ISO8601'\'' ]\r\n    }\r\n    mutate {\r\n        replace => [ '\''type'\'', '\''apache_access'\'' ]\r\n         convert => [ '\''bytes'\'', '\''integer'\'' ]\r\n         convert => [ '\''response'\'', '\''integer'\'' ]\r\n    }\r\n}\r\n \r\nif [program] == '\''apache_error'\'' {\r\n    grok {\r\n        match => [ '\''message'\'', '\''\\[(?<timestamp>%{DAY:day} %{MONTH:month} %{MONTHDAY} %{TIME} %{YEAR})\\] \\[%{WORD:class}\\] \\[%{WORD:originator} %{IP:clientip}\\] %{GREEDYDATA:errmsg}'\'']\r\n    }\r\n    mutate {\r\n        replace => [ '\''type'\'', '\''apache_error'\'' ]\r\n    }\r\n}","name":"Apache (Default)","active":"1"}],"config_outputs":[]}'


#This will create a user named "someuser" with the password "nagiosls".  You can use this account to log in and optionally create the accounts that you need, then delete the "someuser" account as one of the other users when done.
curl -XPUT 'http://localhost:9200/nagioslogserver/user/1' -d '{"username":"someuser","password":"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19","auth_type":"admin","email":"[email protected]","language":"default","apiaccess":"1","apikey":"1396e08757545557073844695e5b64caa0bd3ad3","created":"2015-01-23 10:00:00","created_by":0,"default_dashboard":"/dashboard/elasticsearch/default"}'

The other option is to start with a completely fresh installation.

Posted: **Mon May 15, 2017 8:38 am**

Where can I find alerts? I am not able to find them in conf.d folder... we have created many number of alerts and I would like to have a copy of them for minimizing the time of restore, if possible...

Thanks!

Srinivas Mandalika

Posted: **Mon May 15, 2017 8:50 am**

Fortunately, after I restarted my nagios server, just few mminutes back, I was able to login to Nagios Server... When I entered the following command to see status of the server, I can see the status as yellow and Log collector service in stopped mode... I started log collector service, it is up and running now but I still see this status to be in yellow... any suggestion to get this to green? OR is it okay to be in yellow?

[root@localhost ~]# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "4aa88b36-6e32-4c15-992e-78e1784d646c",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 341,
"active_shards" : 341,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 341,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0

Let me know...

Srinivas Mandalika

Posted: **Mon May 15, 2017 11:26 am**

The yellow status is expected if you have only one data instance.

Nagios Support Forum

Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status

Re: Nagios Log Server is showing RED in it's status