Jklre wrote:I also found something odd looking through log files /var/log/logstash/logstash.log has been ballooning. Today's log is 37000KB as opposed to the US 4000KB.
not sure if this would be related though. They all are messages that state ":message=>"failed action with response of 400, dropping action:" and a whole few pages more details per line.
Are these relates to syslog messages? It appears that there are messages coming into one of your inputs that cannot be be processed properly
This looks like it stopped after a few hours. They are all syslog messages being processed. I say a drop in the amount of messages being received by the cluster during that time.
Jklre wrote:I say a drop in the amount of messages being received by the cluster during that time.
Did that have an effect on the freshness? I assumed the problems were related.
No effect on freshness. I decided to revert back to the original cluster. we rebuild a new cluster because we were having issues with fresness. it was taking about 400 seconds instead of 30 - 60 on our two us clusters.
revert.jpg
as you can see after the red line is where we reverted back to the old one. We decided 400 seconds is much better than 1500 seconds. We are going to see if this cluster is experiencing the same issue as the new one after the hotfix. It was detecting critical events but no emails were being sent for the alerts. Any Ideas where to troubleshoot this? We are ready to go to production once we solve this last issue with this Canadian cluster. I'm pretty anxious to have this resolved. Thank you.
You do not have the required permissions to view the files attached to this post.
here is the original cluster back in place. This behavior is very odd and I'm not sure where to look.
The new cluster was just the same template VM with the settings restored from backup. same specs, same ip address same syslog stream with the same rules. Very weird.
You do not have the required permissions to view the files attached to this post.
Jklre wrote:This behavior is very odd and I'm not sure where to look.
Concur
1) While the VM hardware was the same, was the hardware itself the same? Same host? Same disks/SAN?
2) Do you still have the old server? I'd love to see a comparison of the elasticsearch logs and also a comparison of sar output (of course you'll have to choose sar output for a prior day on the old server)
Jklre wrote:This behavior is very odd and I'm not sure where to look.
Concur
1) While the VM hardware was the same, was the hardware itself the same? Same host? Same disks/SAN?
2) Do you still have the old server? I'd love to see a comparison of the elasticsearch logs and also a comparison of sar output (of course you'll have to choose sar output for a prior day on the old server)
The hardware is different on the backend for the VM's but if anything it should be faster. Faster CPU's better performing storage. We did some benchmarks with DD and it showed better performance on the VM than the one in the US.
We still have the old servers they are just in a powered off state. Let me know what logs you guys need and I can pull that data. Thank you.
I'm back from a company trip, and I'd like to catch up on this thread. I've read through the history, and I would like to clarify a few points. Let me know if I'm off-base anywhere.
1. Your US cluster is up and fully functioning with the hostpatch applied.
2. Your Canada cluster is experiencing strange alerting issues as well as increased freshness.
3. You stood up a completely fresh replacement Canada cluster and it had the same issues with the hotfix applied.
4. You re-implemented to old cluster without the hotfix.
I would like you to use a cluster with the hotfix applied and navigate to the 'Alerts' screen. Create a new alert. Please be sure that the 'next check date' looks proper and is not 1969 - I noticed this issue on a test cluster of mine with the hotfix just before I left. I want to make sure you don't have the same problems (new alerts have their next runtime stuck in 1969).
Let me know - we can do another remote if you need one.
Jesse
TwitsBlog Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
1. Your US cluster is up and fully functioning with the hostpatch applied. Correct. There has been no confirmed skipped alerts on both of the US clusters since the hotfix has been applied.
2. Your Canada cluster is experiencing strange alerting issues as well as increased freshness.
3. You stood up a completely fresh replacement Canada cluster and it had the same issues with the hotfix applied.
yes. This was around 300 - 400 seconds so we decided to built out the replacement cluster and restored the settings from backup. That seemed to make things worse as the freshness went from 300 - 400 to 1500+ seconds.
4. You re-implemented to old cluster without the hotfix. We re implemented the old cluster with the hotfix applied.
I would like you to use a cluster with the hotfix applied and navigate to the 'Alerts' screen. Create a new alert. Please be sure that the 'next check date' looks proper and is not 1969 - I noticed this issue on a test cluster of mine with the hotfix just before I left. I want to make sure you don't have the same problems (new alerts have their next runtime stuck in 1969).
Looks like we are experiencing the issue of new alerts being stuck in the past as well.
test alert.jpg
I'm letting the old canadain cluster with the hotfix run to gather up more alerts to validate and see if its still skipping or not. Since the issue is so intermittent and the volume of alerts is so low for canada it may be a few days before we have enough data to see a problem.
You do not have the required permissions to view the files attached to this post.
I'm letting the old canadain cluster with the hotfix run to gather up more alerts to validate and see if its still skipping or not. Since the issue is so intermittent and the volume of alerts is so low for canada it may be a few days before we have enough data to see a problem.
Let me know what you find out. When we fixed the alert subsystem with the hotfix, a new bug cropped up that set the alert runtime in the past. I noticed this and it has been resolved for our next release.
If you'd like to perform another remote, I'd be happy to take a look at your Canada cluster. I think that this remote should happen after you have collected data and verified whether or not alerts are still skipping. When you're ready, feel free to email customersupport@nagios.com with a reference to this thread and I'll pick it up. Thanks!
TwitsBlog Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
I'm letting the old canadain cluster with the hotfix run to gather up more alerts to validate and see if its still skipping or not. Since the issue is so intermittent and the volume of alerts is so low for canada it may be a few days before we have enough data to see a problem.
Let me know what you find out. When we fixed the alert subsystem with the hotfix, a new bug cropped up that set the alert runtime in the past. I noticed this and it has been resolved for our next release.
If you'd like to perform another remote, I'd be happy to take a look at your Canada cluster. I think that this remote should happen after you have collected data and verified whether or not alerts are still skipping. When you're ready, feel free to email customersupport@nagios.com with a reference to this thread and I'll pick it up. Thanks!
Definitely. I'm gong to collect a few more days worth of data.
I have no idea what happened but the freshness on the Canadian cluster seems to have stabilized on its own since Monday. No changes to the syslog stream or system have been made to my knowledge and I'm going to do some digging to see if there has been any changes from other teams. Id still be curious for you to take a look at these systems and see if there is anything you can find. Hopefully this is resolved but I'm pretty sure this will re-occur again. I'll update this thread with more info as I get it and can setup something via email with you guys later. Looking forward for the next release. Any rough estimates on the timeframe for that? Thank you.
.
freshness.jpg
You do not have the required permissions to view the files attached to this post.