Elasticsearch tuning

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Elasticsearch tuning

Post by rferebee »

Hello,

I was wondering if there is any tuning for Elasticsearch like there is for Logstash?

Specifically I'm referring to this article: https://support.nagios.com/kb/article/n ... g-576.html

Whenever my Log Server environment crashes and I have to restart the services, I always get a Java error when I attempt to restart Elasticsearch, but never for Logstash.

The error implies there isn't enough RAM available to support the Java process.

Thank you.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Elasticsearch tuning

Post by cdienger »

You're seeing memory errors when you try to restart? It sounds like it may need a moment to free the memory. When you restart it, run "service elasticsearch stop; ps aux | grep elasticsearch" to make sure the Elasticsearch has stopped and then starting it back up.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Elasticsearch tuning

Post by rferebee »

Ok, I'll try to stop it first next time.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Elasticsearch tuning

Post by cdienger »

Sounds good. Keep us posted.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Elasticsearch tuning

Post by rferebee »

My Log Server environment was hung this morning. I attempted to stop Elasticsearch instead of restarting it and it failed to stop the service.

Stopping elasticsearch: [FAILED]

Then when I try to restart it, I get this:

Starting elasticsearch: [ OK ]
[root@nagioslscc2 ~]# OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f3dd1b30000, 33324597248, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 33324597248 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid13033.log


This is what I've been getting every time on this particular box. It has the same amount of RAM as the other 2 nodes, but seems to run as the "primary" all the time.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Elasticsearch tuning

Post by rferebee »

Here's the log the error is referencing.
You do not have the required permissions to view the files attached to this post.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: Elasticsearch tuning

Post by mbellerue »

Thanks for posting that log file!

It looks like Java is trying to commit more memory than is available on the system.
Native memory allocation (mmap) failed to map 33324597248 bytes for committing reserved memory.
...
Memory: 4k page, physical 66109536k(3027960k free), swap 262140k(230244k free)
Do the other servers in your environment have more swap space? I wonder if the ~30GB of memory used at this point is Elasticsearch still in memory.
but seems to run as the "primary" all the time.
Can you expand on this a little? What do you mean by it running as the primary? What are you looking at to determine this?
Stopping elasticsearch: [FAILED]
When you get this message, can you run,

Code: Select all

journalctl -xe
and send the output to us? It may give us another hint as to why the service is hung.

Can you PM me a profile of the system the next time Elasticsearch is in a hung state?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Elasticsearch tuning

Post by rferebee »

When I say "primary" what I mean to say is that it always seems like there is one server working harder than the others in our cluster.

For example, the server we're talking about now has 34GBs of memory active where as the other two only have 17GBs and the CPUs seem to be running at a higher usage than the other servers. I am basing all of this on the statistics in vmWare/vCenter.

Here's the swap information (you can see the swap space is less on the server we're talking about):

NAGIOSLSCC2

total used free shared buffers cached
Mem: 66109536 65742956 366580 96 30604 29592048
-/+ buffers/cache: 36120304 29989232
Swap: 262140 217192 44948

NAGIOSLSCC1

total used free shared buff/cache available
Mem: 65789932 35339688 658528 20592 29791716 29719240
Swap: 4190204 26212 4163992

NAGIOSLSCC3

total used free shared buff/cache available
Mem: 66821096 35768048 355428 20700 30697620 30319020
Swap: 4190204 0 4190204
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: Elasticsearch tuning

Post by mbellerue »

Excellent, thank you! Next set of questions. When you configured your devices, were they all configured to point at LSCC2, or another device, or maybe you spread them out across all 3 VMs?

When LSCC2 displays high CPU usage, could you run the top command and get us the output?

And finally, has LSCC2 been rebooted to try and give it a fresh start since the last hang? LSCC2 seems to be running double what the other Log Server VMs are running.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Elasticsearch tuning

Post by rferebee »

When you configured your devices, were they all configured to point at LSCC2, or another device, or maybe you spread them out across all 3 VMs?
Honestly, I have no idea. These were setup years ago and I wouldn't even know how to check.
When LSCC2 displays high CPU usage, could you run the top command and get us the output?
I've actually brought up this "issue" before, you can look at this thread: https://support.nagios.com/forum/viewto ... 38&t=52386

The screen shot on the first page is basically what the top output looks like all the time on LSCC2.
And finally, has LSCC2 been rebooted to try and give it a fresh start since the last hang? LSCC2 seems to be running double what the other Log Server VMs are running.
I rebooted all three servers this morning after an SSH session kept disconnecting me on LSCC2 and I couldn't figure out what was going on. It looks like LSCC2 was hung up trying to complete a snapshot from Saturday night. When I finally got back in it showed the snapshot still in progress.

But, to go back to the issue, LSCC2 is always working harder than the other two servers.
Locked