Page 5 of 7

Re: Nagios user java command using over 200% CPU

Posted: Fri Mar 22, 2019 12:25 pm
by rferebee
Good morning...

I'm having a major issue. One of the nodes in my primary cluster has crashed everyday for the last 3 days. I don't know what's going on.

This all started because we had a VMware issue and lost communication with our SAN storage while the system was running. I think that all the indexes might be corrupt on the secondary node, but I have no clue how to get the system back stable again.

I have attached an updated system profile. Can you please help me?

Thank you.

Re: Nagios user java command using over 200% CPU

Posted: Fri Mar 22, 2019 2:55 pm
by npolovenko
@rferebee, The indices don't appear to be corrupt. However, the drive is getting full and hitting the low watermark.
Please run the following command in the console and show us the output:
Thank you.

Re: Nagios user java command using over 200% CPU

Posted: Fri Mar 22, 2019 3:01 pm
by rferebee
Please see screenshots from both nodes in the cluster. Thank you.

Re: Nagios user java command using over 200% CPU

Posted: Fri Mar 22, 2019 3:44 pm
by rferebee
I just ran 'top -H' on one of my nodes and I was wondering why there are so many separate nagios java processes running? See attached screenshot.

This this typical behavior for a Nagios Log Server system?

Re: Nagios user java command using over 200% CPU

Posted: Mon Mar 25, 2019 7:48 am
by scottwilkerson
The -H argument to top show individual threads.

this is normal because it is a multi-thread application, you have a different thread for each connection to both elasticsearch and logstash which are both java applications.

Re: Nagios user java command using over 200% CPU

Posted: Mon Mar 25, 2019 12:56 pm
by rferebee
Is there any maintenance tasks that you would recommended to ensure we keeping our Log Server as "junk free" as possible?

For example, are there any log files we can purge or errant files/directories we can remove?

We plan on expanding the drive this afternoon, but we would like to ensure we free up as much space as possible beforehand.

Thank you.

Re: Nagios user java command using over 200% CPU

Posted: Mon Mar 25, 2019 1:35 pm
by scottwilkerson
The only cleanup of logs would be in the following directories but these should be taken care of by logrotate on the system already

Code: Select all

/var/log/logstash/*
/var/log/elasticsearch/*

Re: Nagios user java command using over 200% CPU

Posted: Mon Apr 01, 2019 10:39 am
by rferebee
Good morning, we're trying to make a decision internally and would like your assistance.

Currently, we have a singe two node cluster and we were considering adding an additional 2 node cluster and breaking off our WAN monitoring devices onto that cluster. You're aware of the performance issues we've been facing and the fact that we've been throwing resources at this thing to no avail.

Would it be better to add the two additional nodes we have to our existing cluster or have the two separate 2 node clusters like we were considering?

If we do decide to go with a 4 node cluster, how does the data get spread across the nodes? I have limited knowledge of ELK, but from what I've read it seems like it stripes the data across the nodes. The only difference being is that no more than 1 node can go offline at time?

Re: Nagios user java command using over 200% CPU

Posted: Mon Apr 01, 2019 2:30 pm
by scottwilkerson
rferebee wrote:Would it be better to add the two additional nodes we have to our existing cluster or have the two separate 2 node clusters like we were considering?
I personally would add them to the existing cluster, the cluster is more efficient the larger it is.

Each index is split into 5 shards, and the default behavior is have one primary and one replica shard (the replica is stored on a different server than the primary). With a larger cluster this data is spread across all 4 nodes, just keeping one replica, if one of the instances in your cluster goes down, the cluster automatically relocates the shards to make sure you still have one primary and one replica.

Re: Nagios user java command using over 200% CPU

Posted: Mon Apr 01, 2019 3:52 pm
by rferebee
Are you aware of or can you provide any technical documentation that describes the performance benefits we might see by expanding our existing cluster? I'm having a heck of a time finding anything online myself which says something to the effect of, "3+ nodes provide a more stable and efficient cluster for Nagios Log Server/ELK".

The issue being, I've spent a considerable amount of time trying to get this new cluster online and now we may switch directions. I'd love to have some concrete data as to why it's the best course of action.

Full disclosure, I'm all for one cluster as I'd rather manage one instead of two.