Jobs refreshing slower on one cluster?

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

scottwilkerson wrote:
Jklre wrote:I also found something odd looking through log files /var/log/logstash/logstash.log has been ballooning. Today's log is 37000KB as opposed to the US 4000KB.

not sure if this would be related though. They all are messages that state ":message=>"failed action with response of 400, dropping action:" and a whole few pages more details per line.
Are these relates to syslog messages? It appears that there are messages coming into one of your inputs that cannot be be processed properly

This looks like it stopped after a few hours. They are all syslog messages being processed. I say a drop in the amount of messages being received by the cluster during that time.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by jdalrymple »

Jklre wrote:I say a drop in the amount of messages being received by the cluster during that time.
Did that have an effect on the freshness? I assumed the problems were related.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

jdalrymple wrote:
Jklre wrote:I say a drop in the amount of messages being received by the cluster during that time.
Did that have an effect on the freshness? I assumed the problems were related.

No effect on freshness. I decided to revert back to the original cluster. we rebuild a new cluster because we were having issues with fresness. it was taking about 400 seconds instead of 30 - 60 on our two us clusters.
revert.jpg
as you can see after the red line is where we reverted back to the old one. We decided 400 seconds is much better than 1500 seconds. We are going to see if this cluster is experiencing the same issue as the new one after the hotfix. It was detecting critical events but no emails were being sent for the alerts. Any Ideas where to troubleshoot this? We are ready to go to production once we solve this last issue with this Canadian cluster. I'm pretty anxious to have this resolved. Thank you.
You do not have the required permissions to view the files attached to this post.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

oldclusterbackonline.jpg
here is the original cluster back in place. This behavior is very odd and I'm not sure where to look.

The new cluster was just the same template VM with the settings restored from backup. same specs, same ip address same syslog stream with the same rules. Very weird.
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by jdalrymple »

Jklre wrote:This behavior is very odd and I'm not sure where to look.
Concur

1) While the VM hardware was the same, was the hardware itself the same? Same host? Same disks/SAN?
2) Do you still have the old server? I'd love to see a comparison of the elasticsearch logs and also a comparison of sar output (of course you'll have to choose sar output for a prior day on the old server)
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

jdalrymple wrote:
Jklre wrote:This behavior is very odd and I'm not sure where to look.
Concur

1) While the VM hardware was the same, was the hardware itself the same? Same host? Same disks/SAN?
2) Do you still have the old server? I'd love to see a comparison of the elasticsearch logs and also a comparison of sar output (of course you'll have to choose sar output for a prior day on the old server)

The hardware is different on the backend for the VM's but if anything it should be faster. Faster CPU's better performing storage. We did some benchmarks with DD and it showed better performance on the VM than the one in the US.

We still have the old servers they are just in a powered off state. Let me know what logs you guys need and I can pull that data. Thank you.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Jobs refreshing slower on one cluster?

Post by jolson »

Jklre,

I'm back from a company trip, and I'd like to catch up on this thread. I've read through the history, and I would like to clarify a few points. Let me know if I'm off-base anywhere.

1. Your US cluster is up and fully functioning with the hostpatch applied.

2. Your Canada cluster is experiencing strange alerting issues as well as increased freshness.

3. You stood up a completely fresh replacement Canada cluster and it had the same issues with the hotfix applied.

4. You re-implemented to old cluster without the hotfix.


I would like you to use a cluster with the hotfix applied and navigate to the 'Alerts' screen. Create a new alert. Please be sure that the 'next check date' looks proper and is not 1969 - I noticed this issue on a test cluster of mine with the hotfix just before I left. I want to make sure you don't have the same problems (new alerts have their next runtime stuck in 1969).

Let me know - we can do another remote if you need one.

Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

1. Your US cluster is up and fully functioning with the hostpatch applied.
Correct. There has been no confirmed skipped alerts on both of the US clusters since the hotfix has been applied.

2. Your Canada cluster is experiencing strange alerting issues as well as increased freshness.

3. You stood up a completely fresh replacement Canada cluster and it had the same issues with the hotfix applied.

yes. This was around 300 - 400 seconds so we decided to built out the replacement cluster and restored the settings from backup. That seemed to make things worse as the freshness went from 300 - 400 to 1500+ seconds.

4. You re-implemented to old cluster without the hotfix.
We re implemented the old cluster with the hotfix applied.
I would like you to use a cluster with the hotfix applied and navigate to the 'Alerts' screen. Create a new alert. Please be sure that the 'next check date' looks proper and is not 1969 - I noticed this issue on a test cluster of mine with the hotfix just before I left. I want to make sure you don't have the same problems (new alerts have their next runtime stuck in 1969).
Looks like we are experiencing the issue of new alerts being stuck in the past as well.
test alert.jpg
I'm letting the old canadain cluster with the hotfix run to gather up more alerts to validate and see if its still skipping or not. Since the issue is so intermittent and the volume of alerts is so low for canada it may be a few days before we have enough data to see a problem.
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Jobs refreshing slower on one cluster?

Post by jolson »

I'm letting the old canadain cluster with the hotfix run to gather up more alerts to validate and see if its still skipping or not. Since the issue is so intermittent and the volume of alerts is so low for canada it may be a few days before we have enough data to see a problem.
Let me know what you find out. When we fixed the alert subsystem with the hotfix, a new bug cropped up that set the alert runtime in the past. I noticed this and it has been resolved for our next release.

If you'd like to perform another remote, I'd be happy to take a look at your Canada cluster. I think that this remote should happen after you have collected data and verified whether or not alerts are still skipping. When you're ready, feel free to email customersupport@nagios.com with a reference to this thread and I'll pick it up. Thanks!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

jolson wrote:
I'm letting the old canadain cluster with the hotfix run to gather up more alerts to validate and see if its still skipping or not. Since the issue is so intermittent and the volume of alerts is so low for canada it may be a few days before we have enough data to see a problem.
Let me know what you find out. When we fixed the alert subsystem with the hotfix, a new bug cropped up that set the alert runtime in the past. I noticed this and it has been resolved for our next release.

If you'd like to perform another remote, I'd be happy to take a look at your Canada cluster. I think that this remote should happen after you have collected data and verified whether or not alerts are still skipping. When you're ready, feel free to email customersupport@nagios.com with a reference to this thread and I'll pick it up. Thanks!
Definitely. I'm gong to collect a few more days worth of data.

I have no idea what happened but the freshness on the Canadian cluster seems to have stabilized on its own since Monday. No changes to the syslog stream or system have been made to my knowledge and I'm going to do some digging to see if there has been any changes from other teams. Id still be curious for you to take a look at these systems and see if there is anything you can find. Hopefully this is resolved but I'm pretty sure this will re-occur again. I'll update this thread with more info as I get it and can setup something via email with you guys later. Looking forward for the next release. Any rough estimates on the timeframe for that? Thank you.

.
freshness.jpg
You do not have the required permissions to view the files attached to this post.
Locked