upgrade 2.1.4 => 2.1.6 performance issues

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

upgrade 2.1.4 => 2.1.6 performance issues

Post by CBoekhuis »

Hello,

today I upgraded the cluster from 2.1.4 to 2.1.6. When assigning/allocating the shards after the upgrade, the cluster created an enormous load and the /var filesystem was filled up in no time due to the elasticsearch log file (> 1GB, as fast as the disks can write).

So far I've come to the concluesion that if I disable the NagiosXI service checks (NLS queries) the cluster can come up into a functional normal state. As soon as I enable 2 generic queries, each having +/- 90 hosts, the problem reappears. The common denominator in the elasticsearch logfile is as follows:

Code: Select all

[2020-05-14 16:08:05,615][DEBUG][action.search.type       ] [bf328362-78a6-4a15-8f78-1878be74fe40] [logstash-2019.11.11][1], node[r-BKKVRzSP2uHSK70Sagjw], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@62267c5] lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [dd139ec4-41a3-4780-95ef-9a564fb414ef][inet[/192.168.16.24:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@72491ac4
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:224)
        at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:114)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
For now I'm leaving the NagiosXI checks in disable state, but this I've never seen in a NLS upgrade. Has there been any changes that might throw the queue size problem?

Kind Regards,
Hans Blom
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by cdienger »

Please PM a profile from the system. It can be gathered under Admin > System > System Status > Download System Profile or from the command line with:

Code: Select all

/usr/local/nagioslogserver/scripts/profile.sh
This will create /tmp/system-profile.tar.gz.

Note that this file can be very large and may not be able to be uploaded through the ticketing system. This is usually due to the logs in the Logstash and/or Elasticsearch directories found in it. If it is too large, please open the profile, extract these directories/files and send them separately.

I'd also like to get a copy of the current settings index. This can be gathered by running:

Code: Select all

curl -XPOST http://localhost:9200/nagioslogserver/_export?path=/tmp/nagioslogserver.tar.gz
The file it creates and that we'd like to see is /tmp/nagioslogserver.tar.gz.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by CBoekhuis »

PM has been eens. ;)
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by cdienger »

Everything looks good in the data provided. Can you enable the service checks just long enough to reproduce the message in the logs and then send a copy of the logs?

I'd also like to get a copy of the XI config. This can be done from the command line usually with:

Code: Select all

mysqldump -uroot -pnagiosxi nagiosql > nagiosql.sql
The above will work on most XI instances unless you've offloaded the database to another system. In which case you'd need to something like:

Code: Select all

mysqldump -h database_host_ip -uusername -ppassword nagiosql > nagiosql.sql​​
If you're not sure about the credentials or database server IP you can review /usr/local/nagiosxi/html/config.inc.php​ which will have database connection information.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by CBoekhuis »

Hi,

I sent you the files in a PM.
Unfortunately I also ran into another problem. Since the update the maintenance job fails, no snapshots are created. LS thinks that a snapshot is already in progres. I ran the maintenance job by hand with a tail on the jobs.log. The output of the tail is in the attached file along with a listing of the snapshot indices which shows that it stops after may the 14th.

Here's also a screenshot of the snapshot & maintenance page:
snapshots.PNG
Greetings...Hans
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by cdienger »

Please provide a directory listing(ls -alh) of the snapshot directory as well as the output of:

Code: Select all

curl -XGET 'http://localhost:9200/_snapshot/nls_prd1/_all?pretty'
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by CBoekhuis »

As requested.
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by cdienger »

Run:

Code: Select all

ps aux | grep curator
and assuming you don't see any curator scripts running you can delete the stuck job:

Code: Select all

curl -XDELETE 'http://localhost:9200/_snapshot/nls_prd1/curator-20200515112510'
If there are still any issues, please try running the job from the command line:

Code: Select all

/usr/local/nagioslogserver/scripts/curator.sh snapshot --repository 'nls_prd1' --ignore_unavailable indices --older-than 1 --time-unit days --timestring %Y.%m.%d

and provide the output of:

Code: Select all

ls -alh /nfs_mounts/logserver/prd1/
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by CBoekhuis »

With the delete I got the snapshots into a good state. Afterwards I ran a succesfull new snapshot.
I also enabled the Nagios XI services to see if that problem mighgt be solved aswell, but that problem is still in effect.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: upgrade 2.1.4 => 2.1.6 performance issues

Post by cdienger »

Run the following:

Code: Select all

curl 'localhost:9200/_cat/thread_pool?v&h=id,host,search.active,search.rejected,search.completed,search.size,search.queue,search.type,search.queueSize,search.min,search.max,search.keepAlive,search.largest'
on the machine when the check is disabled, again when it is enabled, and final time after you edit /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml on each NLS machine in the cluster and add this to the bottom/restart elasticsearch:

Code: Select all

threadpool.search.queue_size: 2000
restart the elasticsearch service after making the change to elasticsearch.yml:

Code: Select all

service elasticsearch restart
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked