Page 1 of 3

upgrade 2.1.4 => 2.1.6 performance issues

Posted: Thu May 14, 2020 9:30 am
by CBoekhuis
Hello,

today I upgraded the cluster from 2.1.4 to 2.1.6. When assigning/allocating the shards after the upgrade, the cluster created an enormous load and the /var filesystem was filled up in no time due to the elasticsearch log file (> 1GB, as fast as the disks can write).

So far I've come to the concluesion that if I disable the NagiosXI service checks (NLS queries) the cluster can come up into a functional normal state. As soon as I enable 2 generic queries, each having +/- 90 hosts, the problem reappears. The common denominator in the elasticsearch logfile is as follows:

Code: Select all

[2020-05-14 16:08:05,615][DEBUG][action.search.type       ] [bf328362-78a6-4a15-8f78-1878be74fe40] [logstash-2019.11.11][1], node[r-BKKVRzSP2uHSK70Sagjw], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@62267c5] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [dd139ec4-41a3-4780-95ef-9a564fb414ef][inet[/192.168.16.24:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@72491ac4
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:224)
        at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:114)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
For now I'm leaving the NagiosXI checks in disable state, but this I've never seen in a NLS upgrade. Has there been any changes that might throw the queue size problem?

Kind Regards,
Hans Blom

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Thu May 14, 2020 4:41 pm
by cdienger
Please PM a profile from the system. It can be gathered under Admin > System > System Status > Download System Profile or from the command line with:

Code: Select all

/usr/local/nagioslogserver/scripts/profile.sh
This will create /tmp/system-profile.tar.gz.

Note that this file can be very large and may not be able to be uploaded through the ticketing system. This is usually due to the logs in the Logstash and/or Elasticsearch directories found in it. If it is too large, please open the profile, extract these directories/files and send them separately.

I'd also like to get a copy of the current settings index. This can be gathered by running:

Code: Select all

curl -XPOST http://localhost:9200/nagioslogserver/_export?path=/tmp/nagioslogserver.tar.gz
The file it creates and that we'd like to see is /tmp/nagioslogserver.tar.gz.

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Fri May 15, 2020 10:17 am
by CBoekhuis
PM has been eens. ;)

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Fri May 15, 2020 4:37 pm
by cdienger
Everything looks good in the data provided. Can you enable the service checks just long enough to reproduce the message in the logs and then send a copy of the logs?

I'd also like to get a copy of the XI config. This can be done from the command line usually with:

Code: Select all

mysqldump -uroot -pnagiosxi nagiosql > nagiosql.sql
The above will work on most XI instances unless you've offloaded the database to another system. In which case you'd need to something like:

Code: Select all

mysqldump -h database_host_ip -uusername -ppassword nagiosql > nagiosql.sql​​
If you're not sure about the credentials or database server IP you can review /usr/local/nagiosxi/html/config.inc.php​ which will have database connection information.

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Mon May 18, 2020 8:17 am
by CBoekhuis
Hi,

I sent you the files in a PM.
Unfortunately I also ran into another problem. Since the update the maintenance job fails, no snapshots are created. LS thinks that a snapshot is already in progres. I ran the maintenance job by hand with a tail on the jobs.log. The output of the tail is in the attached file along with a listing of the snapshot indices which shows that it stops after may the 14th.

Here's also a screenshot of the snapshot & maintenance page:
snapshots.PNG
Greetings...Hans

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Mon May 18, 2020 4:58 pm
by cdienger
Please provide a directory listing(ls -alh) of the snapshot directory as well as the output of:

Code: Select all

curl -XGET 'http://localhost:9200/_snapshot/nls_prd1/_all?pretty'

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Tue May 19, 2020 1:39 am
by CBoekhuis
As requested.

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Tue May 19, 2020 1:57 pm
by cdienger
Run:

Code: Select all

ps aux | grep curator
and assuming you don't see any curator scripts running you can delete the stuck job:

Code: Select all

curl -XDELETE 'http://localhost:9200/_snapshot/nls_prd1/curator-20200515112510'
If there are still any issues, please try running the job from the command line:

Code: Select all

/usr/local/nagioslogserver/scripts/curator.sh snapshot --repository 'nls_prd1' --ignore_unavailable indices --older-than 1 --time-unit days --timestring %Y.%m.%d

and provide the output of:

Code: Select all

ls -alh /nfs_mounts/logserver/prd1/

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Wed May 20, 2020 7:34 am
by CBoekhuis
With the delete I got the snapshots into a good state. Afterwards I ran a succesfull new snapshot.
I also enabled the Nagios XI services to see if that problem mighgt be solved aswell, but that problem is still in effect.

Re: upgrade 2.1.4 => 2.1.6 performance issues

Posted: Wed May 20, 2020 1:02 pm
by cdienger
Run the following:

Code: Select all

curl 'localhost:9200/_cat/thread_pool?v&h=id,host,search.active,search.rejected,search.completed,search.size,search.queue,search.type,search.queueSize,search.min,search.max,search.keepAlive,search.largest'
on the machine when the check is disabled, again when it is enabled, and final time after you edit /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml on each NLS machine in the cluster and add this to the bottom/restart elasticsearch:

Code: Select all

threadpool.search.queue_size: 2000
restart the elasticsearch service after making the change to elasticsearch.yml:

Code: Select all

service elasticsearch restart