Trying to figure out why logstash changed to active (exited)

Post by **cdienger** » Mon Nov 25, 2019 1:49 pm

How many CPUs are on the machine? Researching the garbage collection options and upping the number of CPUs can speed this process up.

rferebee · Post by **rferebee** » Thu Dec 05, 2019 11:20 am

So, while looking into this I found some discrepancies between the nodes, I don't know how much it matters:

LSCC2

Code: Select all

root@nagioslscc2:/root> lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                6
On-line CPU(s) list:   0-5
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             6
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Stepping:              0
CPU MHz:               2199.998
BogoMIPS:              4399.99
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K

LSCC1 and LSCC3

Code: Select all

root@nagioslscc1:/root>lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                6
On-line CPU(s) list:   0-5
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             3
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Stepping:              0
CPU MHz:               2199.998
BogoMIPS:              4399.99
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K

If the available CPU MHz matches on each node, do the socket and core differences matter?

Post by **mbellerue** » Thu Dec 05, 2019 12:52 pm

It shouldn't make a difference. Though I, personally, would recommend having the VMs in the same layout. The only thing this would affect would be the number of NUMA nodes. That's more of a performance thing than anything else.

Edit:
Also, Craig mentioned a couple of things. First he was wondering if this is still happening regularly. And if so, does it also correlate to when Java is doing garbage collection?

The second thing was that originally apparently the VMs had more CPUs to work with. Would it make sense to try to add a couple more cores to the VMs to see if that helps?

rferebee · Post by **rferebee** » Thu Dec 05, 2019 3:01 pm

Also, Craig mentioned a couple of things. First he was wondering if this is still happening regularly. And if so, does it also correlate to when Java is doing garbage collection?

It occurs semi-regularly. I PM'd Craig the log file he requested on November 22nd, you folks would need to tell me whether or not it correlates. I honestly have no clue.

The second thing was that originally apparently the VMs had more CPUs to work with. Would it make sense to try to add a couple more cores to the VMs to see if that helps?

These servers have always had 36 cores each. At least since I moved over to this group over a year ago. Do you think they need more than 36 CPU cores?

Post by **mbellerue** » Thu Dec 05, 2019 3:17 pm

I can't imagine that they would need more than 36 cores. But right now they definitely do not have 36 cores.

LSCC2

Code: Select all

Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             6

LSCC1 and LSCC3

Code: Select all

Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             3

Looks like each VM has 6 cores to play with. The important piece here is that the more cores a VM has, the more threads Java will spin up. If there is a correlation between garbage collection and logstash crashing, then having more cores could help speed up garbage collection, which could shorten the window in which logstash crashes.

rferebee · Post by **rferebee** » Thu Dec 05, 2019 3:43 pm

Oh ok, I was interpreting that output completely differently.

Let me speak with my bosses and figure out if upping the core count is an option for us at this time.

Thank you.

Post by **mbellerue** » Thu Dec 05, 2019 4:33 pm

Okay, excellent. We will keep this open and wait to hear back.

rferebee · Post by **rferebee** » Fri Dec 13, 2019 1:17 pm

Good morning, we experienced a crash this morning, but it was totally user related and not an issue with Log Server.

My question though, when someone is in the console running a query and they attempt to queue up more than 7 days worth of logs, we experience extreme system slowness.

Is there a way to remove the 30 day search option? Or, even better, how can we provide more resources to the environment so if someone runs a 14 day query it doesn't bog down as much? Do those get queued up in memory or is it taxing the CPU when users run a large query like that?

Thank you!

rferebee · Post by **rferebee** » Fri Dec 13, 2019 1:21 pm

Also, I'm still having the issue (on just one of my nodes) where it won't let me restart the elasticsearch service. I made the proposed changes to the memory config for elastisearch and logstash, but when I attempted to restart the elasticsearch service it failed to stop it and I had to manually run: systemctl stop elasticsearch to ensure it was stopped.

I don't know if that's a memory issue or what, but since each node is identical I doubt that.

Post by **mbellerue** » Fri Dec 13, 2019 3:22 pm

If you go to Admin -> Snapshots & Maintenance -> Maintenance and Repository Settings, what do you have set for your Maintenance Settings? I'm wondering if you have indexes closing after 7 days. I think that's default. But let's take a look at all of the options.

Other than that, searching is going to rely on 2 things:
CPU power
and how fast you can get the data to the CPU

You've got 6 CPU cores to work with, so let's make sure we're making the most of them. As root, run,

Code: Select all

ulimit -a

Let's see what that outputs. There shouldn't be any real restrictions on what root can do, but let's just check to be sure. Assuming root can spawn several thousand processes, we should be good there.

The other thing is, if I recall correctly, your log data is actually on network attached storage. Is that on something like a 10 gigabit connection or better? Or was that just for backups and snapshots?

Nagios Support Forum

Trying to figure out why logstash changed to active (exited)

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi

Re: Trying to figure out why logstash changed to active (exi