Daily Snapshot Missing/Backups Job Failing

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

Good morning Nagios Team,

I noticed yesterday there was no daily snapshot in Log Server. I decided to verify/restart services, and check again today. Today, again, there was no daily snapshot.

I see in the Command Subsystem that the job "Backups" has failed. It says to check the audit log for JOBS. When I check for jobs in the audit log, I see only the message "Created 'create_backup' jobs for all nodes" (no error/fail message).

In digging, I found I can watch the log server job log live using

Code: Select all

tail -f /usr/local/nagioslogserver/var/jobs.log
. However, that shows a success, per the attached picture(although I don't trust that, as it also shows as processing zero jobs).

It looks like someone had a similar issue back in 2016: https://support.nagios.com/forum/viewto ... 38&t=40570
When running that curl command (modified to point to our repository), I'm getting the same output f HTTP/1.1 400 bad request. However, it would be problematic (at best!) to follow the advice in that forum entry and delete the repository. I am currently backing up our snapshots outside of our cluster, just in case, but would appreciate any guidance on this!

Thank you,
-Frank
You do not have the required permissions to view the files attached to this post.
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

As an additional note: we also checked the space on the drives: we're not even to 80%, so they should be good on space for quite a while. Finally, we confirmed the nagios account has RWX access to the repository.

Thank you,
-Frank
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

I apologize for the constant additions, but I know more info is always (usually/sometimes) better:

It appears that, in /store/backups/nagioslogserver, I can see the last good backups. But I can also now see dozens of folders, each with a .json file for creating backups. So Nagios can clearly write to the directory, but something is going awry between that initial command, and the actual backup being created.

Thank you,
-Frank
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

Good morning,

We found we had a snapshot stuck...snapshotting, I guess?
After cancelling the job, it took a couple hours to actually stop, so we let the system run over the weekend.
This morning, we still don't have any new shapshots, and backup is still failing (despite still appearing to succeed when we watch it run, although I'll admit other than "Success" I'm not sure what specifics I'm looking for).

While we wait for an update on your end: is there a command to take a snapshot now? That could speed up troubleshooting on our end, instead of waiting for the scheduled snapshot.

Thank you,
-Frank
You do not have the required permissions to view the files attached to this post.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by pbroste »

Hello @lazzarinof

Thanks for reaching out and following up with the details on the failed snapshots.

The option to run API would look like this:

To get information about your 'snapshot' repositories:

Code: Select all

curl -X GET "localhost:9200/_snapshot/_all?pretty=true" --verbose
Sample output:
{
"nameofyourrepository" : {
"type" : "fs",
"settings" : {
"compress" : "true",
"location" : "/your/store/logs/for/repositories/"
}
To run a snapshot job:

Code: Select all

curl -X PUT "localhost:9200/_snapshot/<nameofyourrepository>/thenameofthesnapshotyouwanttocallit?wait_for_completion=true&pretty" --verbose
Sample output:
"snapshot" : {
"snapshot" : "snapshot_test",
"version_id" : 1070699,
"version" : "1.7.6",
"indices" : [ "kibana-int", "logstash-2021.10.09", "logstash-2021.10.06", "logstash-2021.10.10", "logstash-2021.10.05", "logstash-2021.10.07", "nagioslogserver", "logstash-2021.09.30", "logstash-2021.10.11", "logstash-2021.10.02", "logstash-2021.10.04", "logstash-2021.10.01", "nagioslogserver_log", "logstash-2021.10.08", "logstash-2021.10.03", "nagioslogserver_history" ],
"state" : "SUCCESS",
"start_time" : "2021-10-11T16:19:23.820Z",
"start_time_in_millis" : 1633969163820,
"end_time" : "2021-10-11T16:19:28.218Z",
"end_time_in_millis" : 1633969168218,
"duration_in_millis" : 4398,
"failures" : [ ],
"shards" : {
"total" : 72,
"failed" : 0,
"successful" : 72
Let us know the results,
Perry
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

Running now...a couple hours in and it seems to be stuck at the attached screen. I'll let it run through the night (I'm not sure how long it normally takes, as it runs at night).

I shall return tomorrow, hopefully with better news!

Thank you,
-Frank
You do not have the required permissions to view the files attached to this post.
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

Good morning,

Okay: It looks like the snapshot portion is back up and running. However, I'm getting the same error when I run the 'backups" job ("FAILED: Check audit log for JOBS"). When I check the audit log, I'm still seeing "Error creating LS backup. Check permissions of backup directory /store/backups/nagioslogserver and disk space." The Nagios account has rwx to the repository, and there's plenty of space, so it appears to just be a general error message.

Thank you,
-Frank
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by pbroste »

Hello @lazzarinof

Want to have you run the backup script and let me know how that looks?

Code: Select all

bash -x /usr/local/nagioslogserver/scripts/create_backup.sh
Thanks,
Perry
lazzarinof
Posts: 50
Joined: Thu Sep 23, 2021 12:26 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by lazzarinof »

Good afternoon Perry,

It looks like we're closing in on the culprit!
I've attached just part of the output, as it looped 61 times then exited.

It appears that Log Server sees that the maximum number of backup slots are filled and, instead of deleting the oldest, just exits.

I went to /store/backups/nagioslogserver, and can see four backups (and I'm guessing those other files are the .json files from failed backups, that seem to stay until manually removed).


Because the August 6th backup appears to be smaller than I'd expect it to be, I deleted it, and reran the backup script:

Code: Select all

bash -x /usr/local/nagioslogserver/scripts/create_backup.sh
Unfortunately, I saw the same behavior: it cycled through waiting for an available slot until it timed out.

Thank you,
-Frank
You do not have the required permissions to view the files attached to this post.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Daily Snapshot Missing/Backups Job Failing

Post by pbroste »

Hello @@lazzarinof

Thanks for the details.

Appears that it is getting hung-up, as it is looping through:
curl -s -XPOST 'http://localhost:9200/_export/state' > state.json
count=$($PYTHON -c "import sys; import json; print(json.loads(sys.stdin.read())['count'])" < state.json)
It has a ongoing count loop export:
while [[ $count -gt 0 ]]; do
echo "Waiting for available slot."
sleep 5
timeout=$(($timeout+1))
if [[ $timeout -gt 60 ]]; then
echo "Timeout. Could not get any export slots."
exit 1
fi
Is there a particular '_export' that it is hanging on? Option to increase the timeout on the loop if [[ $timeout -gt 60 ]];..... increase to 300 (if [[ $timeout -gt 300 ]]; ) and then run the backup script again to check the results.

Let us know what export it is hanging on too.

Thanks,
Perry
Locked