Elasticsearch tuning

rferebee · Post by **rferebee** » Wed Aug 07, 2019 5:45 pm

Hello,

I was wondering if there is any tuning for Elasticsearch like there is for Logstash?

Specifically I'm referring to this article: https://support.nagios.com/kb/article/n ... g-576.html

Whenever my Log Server environment crashes and I have to restart the services, I always get a Java error when I attempt to restart Elasticsearch, but never for Logstash.

The error implies there isn't enough RAM available to support the Java process.

Thank you.

Post by **cdienger** » Thu Aug 08, 2019 9:09 am

You're seeing memory errors when you try to restart? It sounds like it may need a moment to free the memory. When you restart it, run "service elasticsearch stop; ps aux | grep elasticsearch" to make sure the Elasticsearch has stopped and then starting it back up.

rferebee · Post by **rferebee** » Thu Aug 08, 2019 9:59 am

Ok, I'll try to stop it first next time.

Post by **cdienger** » Thu Aug 08, 2019 1:58 pm

Sounds good. Keep us posted.

rferebee · Post by **rferebee** » Mon Aug 19, 2019 10:29 am

My Log Server environment was hung this morning. I attempted to stop Elasticsearch instead of restarting it and it failed to stop the service.

Stopping elasticsearch: [FAILED]

Then when I try to restart it, I get this:

Starting elasticsearch: [ OK ]
[root@nagioslscc2 ~]# OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f3dd1b30000, 33324597248, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 33324597248 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid13033.log

This is what I've been getting every time on this particular box. It has the same amount of RAM as the other 2 nodes, but seems to run as the "primary" all the time.

rferebee · Post by **rferebee** » Mon Aug 19, 2019 10:51 am

Here's the log the error is referencing.

Post by **mbellerue** » Mon Aug 19, 2019 11:53 am

Thanks for posting that log file!

It looks like Java is trying to commit more memory than is available on the system.

Native memory allocation (mmap) failed to map 33324597248 bytes for committing reserved memory.
...
Memory: 4k page, physical 66109536k(3027960k free), swap 262140k(230244k free)

Do the other servers in your environment have more swap space? I wonder if the ~30GB of memory used at this point is Elasticsearch still in memory.

but seems to run as the "primary" all the time.

Can you expand on this a little? What do you mean by it running as the primary? What are you looking at to determine this?

Stopping elasticsearch: [FAILED]

When you get this message, can you run,

Code: Select all

journalctl -xe

and send the output to us? It may give us another hint as to why the service is hung.

Can you PM me a profile of the system the next time Elasticsearch is in a hung state?

rferebee · Post by **rferebee** » Mon Aug 19, 2019 1:23 pm

When I say "primary" what I mean to say is that it always seems like there is one server working harder than the others in our cluster.

For example, the server we're talking about now has 34GBs of memory active where as the other two only have 17GBs and the CPUs seem to be running at a higher usage than the other servers. I am basing all of this on the statistics in vmWare/vCenter.

Here's the swap information (you can see the swap space is less on the server we're talking about):

NAGIOSLSCC2

total used free shared buffers cached
Mem: 66109536 65742956 366580 96 30604 29592048
-/+ buffers/cache: 36120304 29989232
Swap: 262140 217192 44948

NAGIOSLSCC1

total used free shared buff/cache available
Mem: 65789932 35339688 658528 20592 29791716 29719240
Swap: 4190204 26212 4163992

NAGIOSLSCC3

total used free shared buff/cache available
Mem: 66821096 35768048 355428 20700 30697620 30319020
Swap: 4190204 0 4190204

Post by **mbellerue** » Mon Aug 19, 2019 2:18 pm

Excellent, thank you! Next set of questions. When you configured your devices, were they all configured to point at LSCC2, or another device, or maybe you spread them out across all 3 VMs?

When LSCC2 displays high CPU usage, could you run the top command and get us the output?

And finally, has LSCC2 been rebooted to try and give it a fresh start since the last hang? LSCC2 seems to be running double what the other Log Server VMs are running.

rferebee · Post by **rferebee** » Mon Aug 19, 2019 2:46 pm

When you configured your devices, were they all configured to point at LSCC2, or another device, or maybe you spread them out across all 3 VMs?

Honestly, I have no idea. These were setup years ago and I wouldn't even know how to check.

When LSCC2 displays high CPU usage, could you run the top command and get us the output?

I've actually brought up this "issue" before, you can look at this thread: https://support.nagios.com/forum/viewto ... 38&t=52386

The screen shot on the first page is basically what the top output looks like all the time on LSCC2.

And finally, has LSCC2 been rebooted to try and give it a fresh start since the last hang? LSCC2 seems to be running double what the other Log Server VMs are running.

I rebooted all three servers this morning after an SSH session kept disconnecting me on LSCC2 and I couldn't figure out what was going on. It looks like LSCC2 was hung up trying to complete a snapshot from Saturday night. When I finally got back in it showed the snapshot still in progress.

But, to go back to the issue, LSCC2 is always working harder than the other two servers.

Nagios Support Forum

Elasticsearch tuning

Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning

Re: Elasticsearch tuning