Page 2 of 4

Re: Jobs refreshing slower on one cluster?

Posted: Mon Oct 26, 2015 6:25 pm
by Jklre
jolson wrote:
If you guys have any other ideas in the meantime let me know thanks.
Until the memory is added, I can't think of anything off of the top of my head. Thanks for the detailed information, as always.

The fact that your jobs system is backing up on the CA server is concerning to me. I understand that your CA server has more alerts than the US server - do the alerts also look back over large periods of time? It's possible that the system is backing up because long lookups cost a decent amount of server time.

The CA server actually has less alerts 948 vs 2000 in the US and they are a bit less complicated only jboss alerts vs jboss and various other rules. Part of me is tempted to just blow up this cluster and make a new one and reload all the alerts into it. Or make another temporary cluster in tandem and compare. I cant see anything different but we will see what the additional memory will do.

Re: Jobs refreshing slower on one cluster?

Posted: Tue Oct 27, 2015 1:22 pm
by jolson
Part of me is tempted to just blow up this cluster and make a new one and reload all the alerts into it.
I'm tempted to agree with you since the slowness is causing problems. I can give you exact instructions on how to approach this in the most sane way possible if you're interested.
I cant see anything different but we will see what the additional memory will do.
Sounds good! Looking forward to your results. ;)

Re: Jobs refreshing slower on one cluster?

Posted: Tue Oct 27, 2015 3:40 pm
by Jklre
jolson wrote:
Part of me is tempted to just blow up this cluster and make a new one and reload all the alerts into it.
I'm tempted to agree with you since the slowness is causing problems. I can give you exact instructions on how to approach this in the most sane way possible if you're interested.
I like sane. If you guys have any magic tricks you can clue me in on that would be awesome. Currently we bulk load our alerts in using a "load test" and loadrunner. it takes about a night to load them up. If there a way to export them and import them into a different instance that would be ideal.

So far I don't see any effect from the additional memory from the additional memory.
25hours.jpg

Re: Jobs refreshing slower on one cluster?

Posted: Wed Oct 28, 2015 10:41 am
by jolson
If there a way to export them and import them into a different instance that would be ideal.
You're in luck - there's a configuration backup stored in /store/backups/nagioslogserver. This configuration backup contains all of your dashboards, alerts, users, logstash configs, etc. What I recommend is setting up your new cluster side-by-side with the old one and move one of those backups to the new cluster. After is has been moved, run our restore script:

Code: Select all

cd /usr/local/nagioslogserver/scripts
./restore_backup.sh /store/backups/nagioslogserver/nagioslogserver.2015-10-27.1445979677.tar.gz
This will restore all of your configuration data. After you've verified that everything duplicated properly, you should be good to go!

Notes:
If you need to change the IP address of any node in your new cluster _after_ the cluster is live, there is no problem with doing so - the cluster should pick up the changed IP with no issues.

If you need to restore your already-made logs to the new cluster, design an NFS backup repository per this document: https://assets.nagios.com/downloads/nag ... enance.pdf
After you've designed the repository and backed up your current information, hook your new cluster up to the repository and restore the information.

Be sure to enable logstash privileged ports if you're using them: https://assets.nagios.com/downloads/nag ... Server.pdf

Re: Jobs refreshing slower on one cluster?

Posted: Mon Nov 02, 2015 12:49 pm
by Jklre
Also on a side note do you guys have an eta for a hotfix or something for the alert skipping issue we are experiencing? My management has pretty much put this project on hold until this is resolved. Thank you!
I spoke with a developer, and have word that the fix will likely be out within the month.

Just wanted to check up and see if that hot fix is available. Thank you.

Re: Jobs refreshing slower on one cluster?

Posted: Mon Nov 02, 2015 6:12 pm
by jolson
This hot fix is not available yet - I will get you an updated roadmap, but the developers have left for the day. I will update this thread tomorrow with the relevant information. Thanks!

Re: Jobs refreshing slower on one cluster?

Posted: Wed Nov 11, 2015 2:12 pm
by Jklre
jolson wrote:This hot fix is not available yet - I will get you an updated roadmap, but the developers have left for the day. I will update this thread tomorrow with the relevant information. Thanks!
Looks like the hotfix has fixed the skipping issue in both our US environments but we still have the issue with slowness and skipping in Canada. Canada has stopped alerting since Monday. There was only 2 alerts that should have fired but both did not come through. The audit logs show it detecting the alerts and going critical but the email alerts are not received. The test email button works fine in the settings menu. We are using SMTP.

Code: Select all


Alert 1
{
  "_index": "nagioslogserver_log",
  "_type": "ALERT",
  "_id": "AVDzc4FlCzVc0qk_F3xY",
  "_score": null,
  "_source": {
    "created": 1447193444708,
    "type": "ALERT",
    "message": "Alert Name 159401 CANADA: Work Process Service - Error   alert, Priority 3/2 returned CRITICAL: 1 matching entries found |logs=1;0;0",
    "source": "Nagios Log Server"
  },
  "sort": [
    1447193444708,
    1447193444708
  ]
}

Alert 2
{
  "_index": "nagioslogserver_log",
  "_type": "ALERT",
  "_id": "AVDzOPPFCzVc0qk_F3Ic",
  "_score": null,
  "_source": {
    "created": 1447189607365,
    "type": "ALERT",
    "message": "Alert Name 106929 CANADA: StdAssignmentDeliveryWF Error -   - Priority 3/2 returned CRITICAL: 1 matching entries found |logs=1;0;0",
    "source": "Nagios Log Server"
  },
  "sort": [
    1447189607365,
    1447189607365
  ]
}

We rebuilt the cluster prior to applying the hotfix. After the rebuild The freshness went up dramatically from around a 400 second wait to over 1000. Im not sure this really matters as long as it is alerting properly. We used the template for the rebuild which was the same process that we did for all the other sites.
canadajobtimes.jpg

Re: Jobs refreshing slower on one cluster?

Posted: Wed Nov 11, 2015 6:12 pm
by Jklre
I also found something odd looking through log files /var/log/logstash/logstash.log has been ballooning. Today's log is 37000KB as opposed to the US 4000KB.

not sure if this would be related though. They all are messages that state ":message=>"failed action with response of 400, dropping action:" and a whole few pages more details per line.

Re: Jobs refreshing slower on one cluster?

Posted: Thu Nov 12, 2015 1:52 pm
by scottwilkerson
Jklre wrote:I also found something odd looking through log files /var/log/logstash/logstash.log has been ballooning. Today's log is 37000KB as opposed to the US 4000KB.

not sure if this would be related though. They all are messages that state ":message=>"failed action with response of 400, dropping action:" and a whole few pages more details per line.
Are these relates to syslog messages? It appears that there are messages coming into one of your inputs that cannot be be processed properly

Re: Jobs refreshing slower on one cluster?

Posted: Thu Nov 12, 2015 1:53 pm
by Jklre
Ok more weirdness stay with me on this. When I click on the run now button for the alerts the "last run time" increases by about 10 -15 minutes into the future. If i keep on clicking it it just goes more and more into the future.
runalertnow.jpg
Here is the date from the console and all of my time and date settings. The time from the console seems to match the time in the sub system but the last run times on the alerts are all off. I tested this on one of my working clusters and the same thing happens. I was thinking it was just this bad one but that's not the case.
timezoneanddatesettings.jpg