yum -y install tcpdump
tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
Let them run for 30 seconds or so and use CTRL+C to stop the tcpdumps and compare the sizes of the output.pcap files that are created. Note that he ports used in the above command assume most data is coming in on ports 3515, 3333, and 5544. You may need to add or modify the command based on what ports are opened by the listeners found under the Configure section of the NLS web UI.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
[root@nagioslscc2 ~]# tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
^C133659 packets captured
140684 packets received by filter
6977 packets dropped by kernel
LSCC1
root@nagioslscc1:/root>tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C58378 packets captured
64406 packets received by filter
5869 packets dropped by kernel
LSCC3
root@nagioslscc3:/root>tcpdump -s 0 -i any port 3515 or port 3333 or port 5544 -w output.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C6876 packets captured
7443 packets received by filter
523 packets dropped by kernel
The primary machine is definitely taking the bulk of the data assuming the packet sizes are roughly equal across all machines. Just curious, what is the size of the files when you run "ll output.pcap" ?
What do the configurations(nxlog, syslog, etc...) look like on the clients that are sending data to NLS ? Are they pointing to one of the NLS machines or are they using a load balancer? Reconfiguring some of the clients pointing to nagioslscc2 to instead go to nagioslscc3 should help nagioslscc2.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
So, in the configuration file we point all our clients to a hostname (nagioslogserver.domain) which contains the IP addresses of all 3 Log Servers in our cluster. We have an F5 load balancer in our environment, but only network devices are pointed to that and it just routes to the hostname.
I don't know if that's the best way to accomplish the task, but that's how it was setup when I arrived to the team.
Alright, something is going on with your load balancing. That part is out of our hands. Let's focus on the out of memory issue.
You've worked on Log Server quite a bit in the past. I'd like to get some information on the current running configuration. If you could PM me the latest System Profile for each of the log servers, that would be great. In addition to that, could you also PM me these files from LSCC2,
Do you have a best practice guide for load balancing Log Server? Something I can take to the appropriate groups in my organization to get the issue resolved?
Unfortunately we don't have a best practices for load balancing. It's really up to what works best for your environment. And your setup sounds really good. It just seems like something may not be working correctly if one server is getting the majority of the work all the time.
Your memory settings look good. Elasticsearch is using 50% of your system's memory, Logstash is using 8GB. Going back to your post about not being able to restart Elasticsearch when it hangs, it makes sense that it's not able to start up, because the old instance of Elasticsearch is hung and still in memory. If it's taking up 32GB of memory, and Logstash is using another 8GB, a new instance won't have 32GB of memory for startup.
If this happens again, you can kill the old Elasticsearch PID, and then you should be able to start Elasticsearch normally. The bigger question is: Why is Elasticsearch hanging?
Nothing is jumping out of the log files right now, but I don't know how long it has been since the last time Elasticsearch has hung. Do you happen to recall the last time? Also, how often does it hang? Is this a daily or weekly occurrence, or is it less common than that? What we will need is to get a system profile right away after you recover from Elasticsearch hanging. That should get us to the root of the problem.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
I can see that Elasticsearch stopped logging at about 3:18AM on the 19th, but I'm not seeing anything in any of the other logs that would point to why Elasticsearch was hung.
One thing I did notice was that LSCC2 is a CentOS 6 VM where the other two are CentOS 7 VMs. I wouldn't think this is the reason that anything is crashing, but best practices would have all of the machines in a cluster be as close to identical as possible. And at the end of the day, there are a number of discrepancies in the version of software running on LSCC2 versus the other two servers.
Other than that, I'm seeing a lot of errors stating that Elasticsearch can't parse a SerialNumber field, though I don't know what filter that would be for. In any case, it's nothing that should make Elasticsearch crash.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!