Elasticsearch tuning

Post by **cdienger** » Mon Aug 19, 2019 4:29 pm

To get an idea about how much data is coming in and where it's going, run the following on each NLS machine:

yum -y install tcpdump
tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap

Let them run for 30 seconds or so and use CTRL+C to stop the tcpdumps and compare the sizes of the output.pcap files that are created. Note that he ports used in the above command assume most data is coming in on ports 3515, 3333, and 5544. You may need to add or modify the command based on what ports are opened by the listeners found under the Configure section of the NLS web UI.

rferebee · Post by **rferebee** » Tue Aug 20, 2019 11:49 am

Here are the results:

LSCC2

[root@nagioslscc2 ~]# tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
^C133659 packets captured
140684 packets received by filter
6977 packets dropped by kernel

LSCC1

root@nagioslscc1:/root>tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C58378 packets captured
64406 packets received by filter
5869 packets dropped by kernel

LSCC3

root@nagioslscc3:/root>tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C6876 packets captured
7443 packets received by filter
523 packets dropped by kernel

Post by **cdienger** » Tue Aug 20, 2019 1:57 pm

The primary machine is definitely taking the bulk of the data assuming the packet sizes are roughly equal across all machines. Just curious, what is the size of the files when you run "ll output.pcap" ?

What do the configurations(nxlog, syslog, etc...) look like on the clients that are sending data to NLS ? Are they pointing to one of the NLS machines or are they using a load balancer? Reconfiguring some of the clients pointing to nagioslscc2 to instead go to nagioslscc3 should help nagioslscc2.

rferebee · Post by **rferebee** » Tue Aug 20, 2019 2:33 pm

Here are the file sizes:

LSCC2

[root@nagioslscc2 ~]# ll output.pcap
-rw-r--r-- 1 tcpdump tcpdump 77152375 Aug 20 09:49 output.pcap

LSCC1

root@nagioslscc1:/root>ll output.pcap
-rw-r--r-- 1 tcpdump tcpdump 56030884 Aug 20 09:47 output.pcap

LSCC3

root@nagioslscc3:/root>ll output.pcap
-rw-r--r-- 1 tcpdump tcpdump 6957064 Aug 20 09:46 output.pcap

So, in the configuration file we point all our clients to a hostname (nagioslogserver.domain) which contains the IP addresses of all 3 Log Servers in our cluster. We have an F5 load balancer in our environment, but only network devices are pointed to that and it just routes to the hostname.

I don't know if that's the best way to accomplish the task, but that's how it was setup when I arrived to the team.

Post by **mbellerue** » Wed Aug 21, 2019 11:56 am

Alright, something is going on with your load balancing. That part is out of our hands. Let's focus on the out of memory issue.

You've worked on Log Server quite a bit in the past. I'd like to get some information on the current running configuration. If you could PM me the latest System Profile for each of the log servers, that would be great. In addition to that, could you also PM me these files from LSCC2,

/etc/sysconfig/elasticsearch
/etc/sysconfig/logstash
/proc/cpuinfo

That would be awesome.

The next thing I would recommend is bringing up LSCC2's swap from 256MB to 4GB like the other two servers have.

rferebee · Post by **rferebee** » Wed Aug 21, 2019 12:16 pm

Do you have a best practice guide for load balancing Log Server? Something I can take to the appropriate groups in my organization to get the issue resolved?

PM sent.

Thank you.

Post by **mbellerue** » Wed Aug 21, 2019 2:10 pm

Unfortunately we don't have a best practices for load balancing. It's really up to what works best for your environment. And your setup sounds really good. It just seems like something may not be working correctly if one server is getting the majority of the work all the time.

Your memory settings look good. Elasticsearch is using 50% of your system's memory, Logstash is using 8GB. Going back to your post about not being able to restart Elasticsearch when it hangs, it makes sense that it's not able to start up, because the old instance of Elasticsearch is hung and still in memory. If it's taking up 32GB of memory, and Logstash is using another 8GB, a new instance won't have 32GB of memory for startup.

If this happens again, you can kill the old Elasticsearch PID, and then you should be able to start Elasticsearch normally. The bigger question is: Why is Elasticsearch hanging?

Nothing is jumping out of the log files right now, but I don't know how long it has been since the last time Elasticsearch has hung. Do you happen to recall the last time? Also, how often does it hang? Is this a daily or weekly occurrence, or is it less common than that? What we will need is to get a system profile right away after you recover from Elasticsearch hanging. That should get us to the root of the problem.

rferebee · Post by **rferebee** » Wed Aug 21, 2019 2:47 pm

August 19th would be the last date I could provide that Elasticsearch hung. It would be +/- 12 hours from August 19th around 7AM.

Sometime around there it happened.

Post by **mbellerue** » Wed Aug 21, 2019 3:05 pm

Alright, I will check the logs for that date.

Post by **mbellerue** » Thu Aug 22, 2019 2:06 pm

I can see that Elasticsearch stopped logging at about 3:18AM on the 19th, but I'm not seeing anything in any of the other logs that would point to why Elasticsearch was hung.

One thing I did notice was that LSCC2 is a CentOS 6 VM where the other two are CentOS 7 VMs. I wouldn't think this is the reason that anything is crashing, but best practices would have all of the machines in a cluster be as close to identical as possible. And at the end of the day, there are a number of discrepancies in the version of software running on LSCC2 versus the other two servers.

Other than that, I'm seeing a lot of errors stating that Elasticsearch can't parse a SerialNumber field, though I don't know what filter that would be for. In any case, it's nothing that should make Elasticsearch crash.

Nagios Support Forum

Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning