upgrade 2.1.4 => 2.1.6 performance issues

CBoekhuis · Post by **CBoekhuis** » Thu May 14, 2020 9:30 am

Hello,

today I upgraded the cluster from 2.1.4 to 2.1.6. When assigning/allocating the shards after the upgrade, the cluster created an enormous load and the /var filesystem was filled up in no time due to the elasticsearch log file (> 1GB, as fast as the disks can write).

So far I've come to the concluesion that if I disable the NagiosXI service checks (NLS queries) the cluster can come up into a functional normal state. As soon as I enable 2 generic queries, each having +/- 90 hosts, the problem reappears. The common denominator in the elasticsearch logfile is as follows:

Code: Select all

[2020-05-14 16:08:05,615][DEBUG][action.search.type       ] [bf328362-78a6-4a15-8f78-1878be74fe40] [logstash-2019.11.11][1], node[r-BKKVRzSP2uHSK70Sagjw], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@62267c5] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [dd139ec4-41a3-4780-95ef-9a564fb414ef][inet[/192.168.16.24:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@72491ac4
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:224)
        at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:114)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

For now I'm leaving the NagiosXI checks in disable state, but this I've never seen in a NLS upgrade. Has there been any changes that might throw the queue size problem?

Kind Regards,
Hans Blom

Post by **cdienger** » Thu May 14, 2020 4:41 pm

Please PM a profile from the system. It can be gathered under Admin > System > System Status > Download System Profile or from the command line with:

Code: Select all

/usr/local/nagioslogserver/scripts/profile.sh

This will create /tmp/system-profile.tar.gz.

Note that this file can be very large and may not be able to be uploaded through the ticketing system. This is usually due to the logs in the Logstash and/or Elasticsearch directories found in it. If it is too large, please open the profile, extract these directories/files and send them separately.

I'd also like to get a copy of the current settings index. This can be gathered by running:

Code: Select all

curl -XPOST http://localhost:9200/nagioslogserver/_export?path=/tmp/nagioslogserver.tar.gz

The file it creates and that we'd like to see is /tmp/nagioslogserver.tar.gz.

CBoekhuis · Post by **CBoekhuis** » Fri May 15, 2020 10:17 am

PM has been eens.

Post by **cdienger** » Fri May 15, 2020 4:37 pm

Everything looks good in the data provided. Can you enable the service checks just long enough to reproduce the message in the logs and then send a copy of the logs?

I'd also like to get a copy of the XI config. This can be done from the command line usually with:

Code: Select all

mysqldump -uroot -pnagiosxi nagiosql > nagiosql.sql

The above will work on most XI instances unless you've offloaded the database to another system. In which case you'd need to something like:

Code: Select all

mysqldump -h database_host_ip -uusername -ppassword nagiosql > nagiosql.sql

If you're not sure about the credentials or database server IP you can review /usr/local/nagiosxi/html/config.inc.php which will have database connection information.

CBoekhuis · Post by **CBoekhuis** » Mon May 18, 2020 8:17 am

Hi,

I sent you the files in a PM.
Unfortunately I also ran into another problem. Since the update the maintenance job fails, no snapshots are created. LS thinks that a snapshot is already in progres. I ran the maintenance job by hand with a tail on the jobs.log. The output of the tail is in the attached file along with a listing of the snapshot indices which shows that it stops after may the 14th.

Here's also a screenshot of the snapshot & maintenance page:

snapshots.PNG

Greetings...Hans

Post by **cdienger** » Mon May 18, 2020 4:58 pm

Please provide a directory listing(ls -alh) of the snapshot directory as well as the output of:

Code: Select all

curl -XGET 'http://localhost:9200/_snapshot/nls_prd1/_all?pretty'

CBoekhuis · Post by **CBoekhuis** » Tue May 19, 2020 1:39 am

As requested.

Post by **cdienger** » Tue May 19, 2020 1:57 pm

Run:

Code: Select all

ps aux | grep curator

and assuming you don't see any curator scripts running you can delete the stuck job:

Code: Select all

curl -XDELETE 'http://localhost:9200/_snapshot/nls_prd1/curator-20200515112510'

If there are still any issues, please try running the job from the command line:

Code: Select all

/usr/local/nagioslogserver/scripts/curator.sh snapshot --repository 'nls_prd1' --ignore_unavailable indices --older-than 1 --time-unit days --timestring %Y.%m.%d

and provide the output of:

Code: Select all

ls -alh /nfs_mounts/logserver/prd1/

CBoekhuis · Post by **CBoekhuis** » Wed May 20, 2020 7:34 am

With the delete I got the snapshots into a good state. Afterwards I ran a succesfull new snapshot.
I also enabled the Nagios XI services to see if that problem mighgt be solved aswell, but that problem is still in effect.

Post by **cdienger** » Wed May 20, 2020 1:02 pm

Run the following:

Code: Select all

curl 'localhost:9200/_cat/thread_pool?v&h=id,host,search.active,search.rejected,search.completed,search.size,search.queue,search.type,search.queueSize,search.min,search.max,search.keepAlive,search.largest'

on the machine when the check is disabled, again when it is enabled, and final time after you edit /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml on each NLS machine in the cluster and add this to the bottom/restart elasticsearch:

Code: Select all

threadpool.search.queue_size: 2000

restart the elasticsearch service after making the change to elasticsearch.yml:

Code: Select all

service elasticsearch restart

Nagios Support Forum

upgrade 2.1.4 => 2.1.6 performance issues

upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues

Re: upgrade 2.1.4 => 2.1.6 performance issues