Jobs refreshing slower on one cluster?

Jklre · Post by **Jklre** » Tue Oct 20, 2015 6:19 pm

I've been seeing jobs refreshing more slowly on one cluster. The settings are exactly the same as far as i can tell. I tried resetting the subsystem, checking the timezone and also bouncing the system.

US

freshnessUS.jpg

Canada

freshnessCAN.jpg

I do notice a difference in the amount of messages being processed by each node.

Canada (the slow one) has less rules but around 94,491 messages in the past 12 hours vs the US which is only getting 37,981 messages in the past 12 hours. The US does have 2000 or so rules where Canada only has 900. The weird thing is that the UI for both shows the jobs refreshing every minute.

The freshness data is based off of a check you guys helped me put together which seems to be working great. I did change it a little bit to be called with the check_mk agent we are using. Do you guys think that the ammount of traffic its receiving is effecting how long it takes these alerts to run? its quite a difference by a factor of 10 or so. Just looking for ideas.

Thank you guys.

Code: Select all

#!/bin/bash

latestalerttime=$(curl -s -XGET 'localhost:9200/nagioslogserver_log/_search?q=type:alert' -d '{
  "query": {
    "match_all": {}
  },
  "size": 1,
  "sort": [
    {
      "created": {
        "order": "desc"
      }
    }
  ]
}' | cut -d":" -f17 | cut -d"," -f1 | cut -c 1-10)

currenttime=$(date +%s)

#echo $latestalerttime
#echo $currenttime

#diff current time vs last alert runtime
diff=$(($currenttime - $latestalerttime))
if [ $diff -gt 300 ]; then
        echo "2" " " "NagiosLogServerJobs" "Freshness=$diff" "All Jobs are Not Happy Freshness=$diff" 
else
        echo "0" " " "NagiosLogServerJobs" "Freshness=$diff" "All Jobs are Happy Freshness=$diff" 
fi

jolson · Post by **jolson** » Wed Oct 21, 2015 2:28 pm

This is very interesting information, and I'm wondering if we can figure out what's causing the discrepancy by using correlation.

The US server seems just fine, but the CA server needs a little bit of tweaking. I'm wondering if the high log volume has anything to do with the alert subsystem being not as responsive as it should be.

Are the clusters similar in terms of total resources - CPU/memory/disk speed/latency? 800 seconds is concerning, as it means that alerts could be hanging for quite some time (resulting in 'missed' alerts).

Do you guys think that the ammount of traffic its receiving is effecting how long it takes these alerts to run?

Possibly. I would recommend taking one of two steps.

A) Increase the amount of memory in your CA cluster.

B) Increase the duration at which the subsystem fires (from 1m to 2m).

How have your alerts been? Are you still missing one or so per day, or has it been less noticeable?

Jklre · Post by **Jklre** » Wed Oct 21, 2015 4:02 pm

jolson wrote:This is very interesting information, and I'm wondering if we can figure out what's causing the discrepancy by using correlation.

The US server seems just fine, but the CA server needs a little bit of tweaking. I'm wondering if the high log volume has anything to do with the alert subsystem being not as responsive as it should be.

Are the clusters similar in terms of total resources - CPU/memory/disk speed/latency? 800 seconds is concerning, as it means that alerts could be hanging for quite some time (resulting in 'missed' alerts).

Do you guys think that the ammount of traffic its receiving is effecting how long it takes these alerts to run?
Possibly. I would recommend taking one of two steps.

A) Increase the amount of memory in your CA cluster.

B) Increase the duration at which the subsystem fires (from 1m to 2m).

How have your alerts been? Are you still missing one or so per day, or has it been less noticeable?

The missing alerts for production have reduced to about one every 3 days. CA is still missing one to three a day. I'm thinking its related to this issue.

The systems should be identical as far as specs go. I actually found that the majority of logs we were getting was from 2 network devices that didn't need to be logging. We disabled them and the amount of traffic has decreased dramatically. So far it does not seem to effect the job freshness but its only been a few hours.

I can see about adding more memory would adding 2gb seem fair? night now it has 6gb Also I will change the subsystem to run every 2 minutes

jolson · Post by **jolson** » Wed Oct 21, 2015 4:27 pm

Sure, any amount of memory would be a good amount - 2GB seems like a good starting point. After adding the memory, you will need to restart the elasticsearch and logstash services. Let me know about the status of those graphs - thanks for the good information. Hopefully this will further decreased the amount of missed alerts until the system is reworked.

Jklre · Post by **Jklre** » Thu Oct 22, 2015 11:30 am

jolson wrote:Sure, any amount of memory would be a good amount - 2GB seems like a good starting point. After adding the memory, you will need to restart the elasticsearch and logstash services. Let me know about the status of those graphs - thanks for the good information. Hopefully this will further decreased the amount of missed alerts until the system is reworked.

I haven't increased the memory on this system yet but with the reduced traffic we definitely notice a drop in the freshness time but its still on average about the 400 second mark.

freshnessCANdrop.jpg

jdalrymple · Post by **jdalrymple** » Thu Oct 22, 2015 5:10 pm

Right now it's a bit confusing what's going on. It's possible that the difference is related to the types of messages and thus the applied filters, or it could simply be a matter of resources. I think it would be best if we could put the troubleshooting process on hold until after we saw the results of adding memory so that we can see what aspect of troubleshooting correlates strongest to the desired outcome.

Make sense?

Jklre · Post by **Jklre** » Fri Oct 23, 2015 12:00 pm

jdalrymple wrote:Right now it's a bit confusing what's going on. It's possible that the difference is related to the types of messages and thus the applied filters, or it could simply be a matter of resources. I think it would be best if we could put the troubleshooting process on hold until after we saw the results of adding memory so that we can see what aspect of troubleshooting correlates strongest to the desired outcome.

Make sense?

Makes sense. I'm leaning more to the message type.. Looking at the chart today it seems that it has dropped to a more acceptable level just with the reduced traffic from muting those network devices. (they shouldn't of been sending this cluster syslog messages anyways) Now this cluster should only be receiving similar types of messages (syslog Jboss) I could add additional memory to the system but that would bring it up to 8gb vs the US that only has 4gb. I'm going to let this run as is over the weekend first and see if it continues to drop then I will add the memory on Monday. That way we can be sure of what is causing the drop in refresh time messages vs memory.

The job queue still seems to be high
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
18
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
18

but the over all freshness has gotten better:

chartupdate.jpg

Also on a side note do you guys have an eta for a hotfix or something for the alert skipping issue we are experiencing? My management has pretty much put this project on hold until this is resolved. Thank you!

jolson · Post by **jolson** » Fri Oct 23, 2015 12:31 pm

Makes sense. I'm leaning more to the message type.. Looking at the chart today it seems that it has dropped to a more acceptable level just with the reduced traffic from muting those network devices. (they shouldn't of been sending this cluster syslog messages anyways) Now this cluster should only be receiving similar types of messages (syslog Jboss) I could add additional memory to the system but that would bring it up to 8gb vs the US that only has 4gb. I'm going to let this run as is over the weekend first and see if it continues to drop then I will add the memory on Monday. That way we can be sure of what is causing the drop in refresh time messages vs memory.

Sounds good - looking forward to your results!

Also on a side note do you guys have an eta for a hotfix or something for the alert skipping issue we are experiencing? My management has pretty much put this project on hold until this is resolved. Thank you!

I spoke with a developer, and have word that the fix will likely be out within the month.

Jklre · Post by **Jklre** » Mon Oct 26, 2015 1:08 pm

Here's what we have over the weekend. Quite the drop from just muting those messages but still above what we would expect. Canada is getting around 6000 messages in a 12 hour period vs USA thats getting 35000 + I'm going to go ahead and increase the memory on this box (this will being it up form 6gb to 8gb) and see if it makes any difference. USA is currently running with 4gb. We did some benchmarks on the underlying virtual hardware / storage behind these vm's and we actually got slightly better results from canada than the US which is the opposite of what we were expecting to see.

Jobs in canada seem to be backing up more

Code: Select all

[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24

vs USA

Code: Select all

[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4

drop.jpg

The average is still around 150ish

drop1.jpg

Vs the US which is around 40

usa.jpg

If you guys have any other ideas in the meantime let me know thanks.

jolson · Post by **jolson** » Mon Oct 26, 2015 4:39 pm

If you guys have any other ideas in the meantime let me know thanks.

Until the memory is added, I can't think of anything off of the top of my head. Thanks for the detailed information, as always.

The fact that your jobs system is backing up on the CA server is concerning to me. I understand that your CA server has more alerts than the US server - do the alerts also look back over large periods of time? It's possible that the system is backing up because long lookups cost a decent amount of server time.

Nagios Support Forum

Jobs refreshing slower on one cluster?

Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?

Re: Jobs refreshing slower on one cluster?