Page 1 of 1

Nagios XI alert for Log Server backup failure

Posted: Tue Jun 20, 2017 10:17 pm
by james.liew
Hi Support,

Am not quite sure where to post this but it looks like my Nagios XI is showing some backup snapshot alerts for my Log Server cluster.

Strange this is that I'm looking at Nagios Log Server and the backup snapshot was run successfully. This is occurring across two datacentres.

Where do I start to troubleshoot on this?


***** Nagios XI Alert *****

Nagios has detected a problem with this service.

Notification Type: PROBLEM

Service: Last NLS Backup
Host: hs3-nagcluster
Address: *ip removed*
State: UNKNOWN
Info:
UNKNOWN: Unable to determine result within last 25 hours: No snapshots found in Elasticsearch.
Date/Time: 21/06/2017 13:10:29

Re: Nagios XI alert for Log Server backup failure

Posted: Wed Jun 21, 2017 11:00 am
by cdienger
Run the following on the NLS:

Code: Select all

curator --loglevel warn show snapshots --repository REPONAME --newer-than 25 --time-unit hours
curl -v -XGET "http://localhost:9200/_snapshot/REPONAME/_all?pretty"
REPONAME is the name of the repository as seen under Administration > System > Backup & Maintenance > Repositories. The first command is the command that the check runs and may give us more information as to why it is unable to detect a current snapshot. The second command will show us all available snapshots.

Re: Nagios XI alert for Log Server backup failure

Posted: Wed Jun 28, 2017 8:25 pm
by james.liew
Output as below:
repo.png
root@hs3-log-01 ~]# curator --loglevel warn show snapshots --repository SharedBackupRepo --newer-than 25 --time-unit hours
2017-06-29 11:20:16,516 ERROR No snapshots found in Elasticsearch.
No snapshots found in Elasticsearch.
[root@hs3-log-01 ~]#
[root@hs3-log-01 ~]# curl -v -XGET "http://localhost:9200/_snapshot/SharedB ... all?pretty"
* About to connect() to localhost port 9200 (#0)
* Trying ::1...
* Connection refused
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 9200 (#0)
> GET /_snapshot/SharedBackupRepo/_all?pretty HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 24
<
{
"snapshots" : [ ]
}

Re: Nagios XI alert for Log Server backup failure

Posted: Thu Jun 29, 2017 11:15 am
by cdienger
Well that's interesting. It seems the backups are in fact failing despite the backup script returning a successful message. I guess it wasn't explicitly stated, but I assume you don't see snapshots listed under Backup & Maintenance? It looks like we'll have to dig into this a bit and it may be best to do so through our ticketing system. If you'd like to send an email to [email protected] I'd be glad to take the case. If you do open a ticket please provide the logs in /var/log/elasticsearch/*, /var/log/logstash/*, /var/log/httpd/*, and profiles gathered under Administration > System > System Status. Gather all this from all NLS servers.

Re: Nagios XI alert for Log Server backup failure

Posted: Thu Jun 29, 2017 7:30 pm
by james.liew
Will do.

Thanks!

Re: Nagios XI alert for Log Server backup failure

Posted: Fri Jun 30, 2017 9:22 am
by tmcdonald
Got the ticket, going to close this up and we will continue there.