Jobs refreshing slower on one cluster?

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Jobs refreshing slower on one cluster?

Post by Jklre »

I've been seeing jobs refreshing more slowly on one cluster. The settings are exactly the same as far as i can tell. I tried resetting the subsystem, checking the timezone and also bouncing the system.

US
freshnessUS.jpg

Canada
freshnessCAN.jpg
I do notice a difference in the amount of messages being processed by each node.

Canada (the slow one) has less rules but around 94,491 messages in the past 12 hours vs the US which is only getting 37,981 messages in the past 12 hours. The US does have 2000 or so rules where Canada only has 900. The weird thing is that the UI for both shows the jobs refreshing every minute.

The freshness data is based off of a check you guys helped me put together which seems to be working great. I did change it a little bit to be called with the check_mk agent we are using. Do you guys think that the ammount of traffic its receiving is effecting how long it takes these alerts to run? its quite a difference by a factor of 10 or so. Just looking for ideas.

Thank you guys.

Code: Select all

#!/bin/bash

latestalerttime=$(curl -s -XGET 'localhost:9200/nagioslogserver_log/_search?q=type:alert' -d '{
  "query": {
    "match_all": {}
  },
  "size": 1,
  "sort": [
    {
      "created": {
        "order": "desc"
      }
    }
  ]
}' | cut -d":" -f17 | cut -d"," -f1 | cut -c 1-10)

currenttime=$(date +%s)

#echo $latestalerttime
#echo $currenttime

#diff current time vs last alert runtime
diff=$(($currenttime - $latestalerttime))
if [ $diff -gt 300 ]; then
        echo "2" " " "NagiosLogServerJobs" "Freshness=$diff" "All Jobs are Not Happy Freshness=$diff" 
else
        echo "0" " " "NagiosLogServerJobs" "Freshness=$diff" "All Jobs are Happy Freshness=$diff" 
fi
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Jobs refreshing slower on one cluster?

Post by jolson »

This is very interesting information, and I'm wondering if we can figure out what's causing the discrepancy by using correlation.

The US server seems just fine, but the CA server needs a little bit of tweaking. I'm wondering if the high log volume has anything to do with the alert subsystem being not as responsive as it should be.

Are the clusters similar in terms of total resources - CPU/memory/disk speed/latency? 800 seconds is concerning, as it means that alerts could be hanging for quite some time (resulting in 'missed' alerts).
Do you guys think that the ammount of traffic its receiving is effecting how long it takes these alerts to run?
Possibly. I would recommend taking one of two steps.

A) Increase the amount of memory in your CA cluster.

B) Increase the duration at which the subsystem fires (from 1m to 2m).

How have your alerts been? Are you still missing one or so per day, or has it been less noticeable?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

jolson wrote:This is very interesting information, and I'm wondering if we can figure out what's causing the discrepancy by using correlation.

The US server seems just fine, but the CA server needs a little bit of tweaking. I'm wondering if the high log volume has anything to do with the alert subsystem being not as responsive as it should be.

Are the clusters similar in terms of total resources - CPU/memory/disk speed/latency? 800 seconds is concerning, as it means that alerts could be hanging for quite some time (resulting in 'missed' alerts).
Do you guys think that the ammount of traffic its receiving is effecting how long it takes these alerts to run?
Possibly. I would recommend taking one of two steps.

A) Increase the amount of memory in your CA cluster.

B) Increase the duration at which the subsystem fires (from 1m to 2m).

How have your alerts been? Are you still missing one or so per day, or has it been less noticeable?
The missing alerts for production have reduced to about one every 3 days. CA is still missing one to three a day. I'm thinking its related to this issue.

The systems should be identical as far as specs go. I actually found that the majority of logs we were getting was from 2 network devices that didn't need to be logging. We disabled them and the amount of traffic has decreased dramatically. So far it does not seem to effect the job freshness but its only been a few hours.

I can see about adding more memory would adding 2gb seem fair? night now it has 6gb Also I will change the subsystem to run every 2 minutes
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Jobs refreshing slower on one cluster?

Post by jolson »

Sure, any amount of memory would be a good amount - 2GB seems like a good starting point. After adding the memory, you will need to restart the elasticsearch and logstash services. Let me know about the status of those graphs - thanks for the good information. Hopefully this will further decreased the amount of missed alerts until the system is reworked.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

jolson wrote:Sure, any amount of memory would be a good amount - 2GB seems like a good starting point. After adding the memory, you will need to restart the elasticsearch and logstash services. Let me know about the status of those graphs - thanks for the good information. Hopefully this will further decreased the amount of missed alerts until the system is reworked.

I haven't increased the memory on this system yet but with the reduced traffic we definitely notice a drop in the freshness time but its still on average about the 400 second mark.
freshnessCANdrop.jpg
this cluster is now only getting about 4000 messages in the past 12 hours vs 34000 in the US. So it seems to have helped but it still seems like something is up.

plus looking at the job queue it seems really high as well.

CAN
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16

vs

USA
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4

I was just checking the stats of the US system and it actually has less memory than then CA cluster the us has 4gb per node vs CA has 6gb
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by jdalrymple »

Right now it's a bit confusing what's going on. It's possible that the difference is related to the types of messages and thus the applied filters, or it could simply be a matter of resources. I think it would be best if we could put the troubleshooting process on hold until after we saw the results of adding memory so that we can see what aspect of troubleshooting correlates strongest to the desired outcome.

Make sense?
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

jdalrymple wrote:Right now it's a bit confusing what's going on. It's possible that the difference is related to the types of messages and thus the applied filters, or it could simply be a matter of resources. I think it would be best if we could put the troubleshooting process on hold until after we saw the results of adding memory so that we can see what aspect of troubleshooting correlates strongest to the desired outcome.

Make sense?

Makes sense. I'm leaning more to the message type.. Looking at the chart today it seems that it has dropped to a more acceptable level just with the reduced traffic from muting those network devices. (they shouldn't of been sending this cluster syslog messages anyways) Now this cluster should only be receiving similar types of messages (syslog Jboss) I could add additional memory to the system but that would bring it up to 8gb vs the US that only has 4gb. I'm going to let this run as is over the weekend first and see if it continues to drop then I will add the memory on Monday. That way we can be sure of what is causing the drop in refresh time messages vs memory.

The job queue still seems to be high
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
16
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
18
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
18

but the over all freshness has gotten better:
chartupdate.jpg
Also on a side note do you guys have an eta for a hotfix or something for the alert skipping issue we are experiencing? My management has pretty much put this project on hold until this is resolved. Thank you!
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Jobs refreshing slower on one cluster?

Post by jolson »

Makes sense. I'm leaning more to the message type.. Looking at the chart today it seems that it has dropped to a more acceptable level just with the reduced traffic from muting those network devices. (they shouldn't of been sending this cluster syslog messages anyways) Now this cluster should only be receiving similar types of messages (syslog Jboss) I could add additional memory to the system but that would bring it up to 8gb vs the US that only has 4gb. I'm going to let this run as is over the weekend first and see if it continues to drop then I will add the memory on Monday. That way we can be sure of what is causing the drop in refresh time messages vs memory.
Sounds good - looking forward to your results!
Also on a side note do you guys have an eta for a hotfix or something for the alert skipping issue we are experiencing? My management has pretty much put this project on hold until this is resolved. Thank you!
I spoke with a developer, and have word that the fix will likely be out within the month.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Jklre
Posts: 163
Joined: Wed May 28, 2014 1:56 pm

Re: Jobs refreshing slower on one cluster?

Post by Jklre »

Here's what we have over the weekend. Quite the drop from just muting those messages but still above what we would expect. Canada is getting around 6000 messages in a 12 hour period vs USA thats getting 35000 + I'm going to go ahead and increase the memory on this box (this will being it up form 6gb to 8gb) and see if it makes any difference. USA is currently running with 4gb. We did some benchmarks on the underlying virtual hardware / storage behind these vm's and we actually got slightly better results from canada than the US which is the opposite of what we were expecting to see.

Jobs in canada seem to be backing up more

Code: Select all

[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
[root@pcanls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
24
vs USA

Code: Select all

[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
[root@pnls01lxv ~]# ps -ef | egrep "jobs|poller" | grep -v grep | wc -l
4
drop.jpg
The average is still around 150ish
drop1.jpg
Vs the US which is around 40
usa.jpg
If you guys have any other ideas in the meantime let me know thanks.
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Jobs refreshing slower on one cluster?

Post by jolson »

If you guys have any other ideas in the meantime let me know thanks.
Until the memory is added, I can't think of anything off of the top of my head. Thanks for the detailed information, as always.

The fact that your jobs system is backing up on the CA server is concerning to me. I understand that your CA server has more alerts than the US server - do the alerts also look back over large periods of time? It's possible that the system is backing up because long lookups cost a decent amount of server time.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked