Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Envera IT · Post by **Envera IT** » Wed Sep 11, 2019 4:08 pm

After upgrading to 2.0.8 from 2.0.2 we started getting this log in the main log file within /var/elasticsearch/

Seems to be related to alerts running and something with the check_interval not parsing correctly. Any idea what could be causing this?

[2019-09-11 03:31:21,675][DEBUG][action.index ] [881dac7b-e349-4fa5-aaef-41294c3b66e0] [nagioslogserver_history][0], node[5KuNYs_iTe2XiiQ3j_egjg], [P], s[STARTED]: Failed to execute [index {[nagioslogserver_history][alert][AW0fO-PCsKxNbujnOADK], source[{"alert_id":"AWi-5NRdRoAfU4O0MhUT","name":"EMAIL-DHCP-SNOOPING-DROP-OFFER","check_interval":"5m","lookback_period":"5m","warning":"1","critical":"1","start":1568186781,"end":1568187081,"query":"{\"query\":{\"filtered\":{\"query\":{\"bool\":{\"should\":[{\"query_string\":{\"query\":\"\\\"Drop offer\\\"\"}}]}},\"filter\":{\"bool\":{\"must\":[{\"range\":{\"@timestamp\":{\"from\":1568186781000,\"to\":1568187081000}}},{\"fquery\":{\"query\":{\"query_string\":{\"query\":\"host:(\\\"10.0.0.10\\\")\"}},\"_cache\":true}}]}}}}}","indexes":"logstash-2019.09.11","ran":1568187081,"status":0,"output":"OK: 0 matching entries found |logs=0;1;1"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [check_interval]
at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:706)
at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:497)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:544)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:465)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:201)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "5m"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.longValue(AbstractXContentParser.java:145)
at org.elasticsearch.index.mapper.core.LongFieldMapper.innerParseCreateField(LongFieldMapper.java:288)
at org.elasticsearch.index.mapper.core.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:239)
at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
... 12 more

Beyond that, CPU load on the cluster has spiked to 6 - 8 on average where before we were sitting around 1.5 - 2. Can you give some advice on what I should be looking for to trace the CPU spike after an upgrade?

Post by **mbellerue** » Thu Sep 12, 2019 10:30 am

If you're getting that error regularly in your logs, it may be worth looking at that particular alert. Maybe it didn't get imported properly. It might be worth building it again from the ground up, if it's not too much work.

Beyond that, I would say run top -cn2, copy/paste the output of that, and let's see what the top couple of processes are.

Envera IT · Post by **Envera IT** » Thu Sep 12, 2019 10:40 am

mbellerue wrote:If you're getting that error regularly in your logs, it may be worth looking at that particular alert. Maybe it didn't get imported properly. It might be worth building it again from the ground up, if it's not too much work.

Ok I'll give that a shot, it seems to be happening with all the alerts but we only have 16 so its not too bad to rebuild them.

mbellerue wrote:Beyond that, I would say run top -cn2, copy/paste the output of that, and let's see what the top couple of processes are.

Code: Select all

[root@srq-nagios-ls3 ~]# top -cn2
top - 11:38:32 up 13:50,  1 user,  load average: 5.34, 5.33, 5.14
Tasks: 112 total,   2 running, 110 sleeping,   0 stopped,   0 zombie
Cpu(s): 56.3%us,  7.3%sy, 29.2%ni,  6.0%id,  0.0%wa,  0.2%hi,  1.0%si,  0.0%st
Mem:  33015156k total, 32786652k used,   228504k free,   149120k buffers
Swap:   262136k total,        0k used,   262136k free, 13852788k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1106 nagios    20   0 51.2g  16g 145m S 121.9 52.3 490:23.17 /usr/bin/java -Xms16120m -Xmx1612
 1197 root      39  19 3693m 745m  14m S 60.1  2.3 226:07.85 /usr/bin/java -XX:+UseParNewGC -XX
31616 apache    20   0  237m  11m 3320 S  2.3  0.0   0:28.88 /usr/sbin/httpd
31627 apache    20   0  237m  11m 3336 S  2.3  0.0   0:28.81 /usr/sbin/httpd
10258 root      20   0 98.2m 4888 3836 S  0.3  0.0   0:00.09 sshd: root@pts/0
10276 root      20   0 15024 1340 1008 R  0.3  0.0   0:00.01 top -cn2
    1 root      20   0 19360 1392 1100 S  0.0  0.0   0:01.67 /sbin/init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kthreadd]
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.30 [migration/0]
    4 root      20   0     0    0    0 S  0.0  0.0   0:02.02 [ksoftirqd/0]
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/0]
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.18 [watchdog/0]
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.57 [migration/1]
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 [migration/1]
    9 root      20   0     0    0    0 S  0.0  0.0   0:01.52 [ksoftirqd/1]
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.15 [watchdog/1]
   11 root      20   0     0    0    0 S  0.0  0.0   0:03.50 [events/0]
   12 root      20   0     0    0    0 S  0.0  0.0   0:25.29 [events/1]
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [cgroup]
   14 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [khelper]
   15 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [netns]
   16 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [async/mgr]
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [pm]
   18 root      20   0     0    0    0 S  0.0  0.0   0:00.24 [sync_supers]
   19 root      20   0     0    0    0 S  0.0  0.0   0:00.26 [bdi-default]
   20 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kintegrityd/0]
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kintegrityd/1]
   22 root      20   0     0    0    0 S  0.0  0.0   0:01.96 [kblockd/0]
   23 root      20   0     0    0    0 S  0.0  0.0   0:02.17 [kblockd/1]
   24 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kacpid]
   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kacpi_notify]
   26 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kacpi_hotplug]
   27 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [ata/0]
   28 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [ata/1]
   29 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [ata_aux]
   30 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [ksuspend_usbd]
   31 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [khubd]
   32 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kseriod]
   33 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [md/0]
   34 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [md/1]
   35 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [md_misc/0]
   36 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [md_misc/1]
   37 root      20   0     0    0    0 S  0.0  0.0   0:00.03 [khungtaskd]
   38 root      20   0     0    0    0 S  0.0  0.0   0:04.94 [kswapd0]
   39 root      25   5     0    0    0 S  0.0  0.0   0:00.00 [ksmd]
   40 root      39  19     0    0    0 S  0.0  0.0   0:00.38 [khugepaged]
   41 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [aio/0]
   42 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [aio/1]
   43 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [crypto/0]
   44 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [crypto/1]
   49 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kthrotld/0]
   50 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kthrotld/1]
   51 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [pciehpd]
   53 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kpsmoused]
   54 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [usbhid_resumer]
  206 root      20   0     0    0    0 S  0.0  0.0   0:00.01 [scsi_eh_0]
  216 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [scsi_eh_1]
  228 root      20   0     0    0    0 S  0.0  0.0   0:01.40 [mpt_poll_0]
  229 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [mpt/0]
  230 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [scsi_eh_2]
  317 root      20   0     0    0    0 S  0.0  0.0   0:24.10 [jbd2/sda1-8]
  318 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [ext4-dio-unwrit]
  319 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [ext4-dio-unwrit]
  397 root      16  -4 11028 1104  340 S  0.0  0.0   0:00.04 /sbin/udevd -d
  483 root      20   0     0    0    0 S  0.0  0.0   0:30.68 [flush-8:0]
  546 root      20   0     0    0    0 S  0.0  0.0   0:01.00 [vmmemctl]
  709 root      20   0     0    0    0 S  0.0  0.0   0:00.00 [kstriped]
  756 root      20   0     0    0    0 S  0.0  0.0   0:01.06 [kauditd]
  970 root      16  -4 29768  848  596 S  0.0  0.0   0:02.29 auditd
  992 root      20   0  473m 5160 1180 S  0.0  0.0   0:02.18 /sbin/rsyslogd -i /var/run/syslogd
 1008 dbus      20   0 21436  812  568 S  0.0  0.0   0:00.00 dbus-daemon --system
 1047 root      20   0  196m 7424 2004 S  0.0  0.0   3:09.35 /usr/sbin/snmpd -LS0-6d -Lf /dev/n
 1064 root      20   0 66292 1236  500 S  0.0  0.0   0:00.01 /usr/sbin/sshd
 1075 ntp       20   0 26520 1868 1284 S  0.0  0.0   0:00.10 ntpd -u ntp:ntp -p /var/run/ntpd.p
 1127 root      20   0 84860 2408  696 S  0.0  0.0   0:01.31 sendmail: accepting connections
 1136 smmsp     20   0 80448 2128  652 S  0.0  0.0   0:00.00 sendmail: Queue runner@01:00:00 fo
 1153 root      20   0  233m 8320 4124 S  0.0  0.0   0:02.42 /usr/sbin/httpd
 1175 root      20   0  114m 1236  624 S  0.0  0.0   0:00.68 crond
 1191 root      39  19  127m 1424 1128 S  0.0  0.0   0:00.00 runuser -s /bin/sh -c exec /usr/lo
 1250 root      20   0  4064  588  500 S  0.0  0.0   0:00.00 /sbin/mingetty /dev/tty1
 1252 root      20   0  4064  580  500 S  0.0  0.0   0:00.00 /sbin/mingetty /dev/tty2
 1254 root      20   0  4064  580  500 S  0.0  0.0   0:00.00 /sbin/mingetty /dev/tty3
 1256 root      20   0  4064  588  500 S  0.0  0.0   0:00.00 /sbin/mingetty /dev/tty4
 1258 root      18  -2 11024 1084  324 S  0.0  0.0   0:00.00 /sbin/udevd -d
 1259 root      18  -2 11024 1072  312 S  0.0  0.0   0:00.00 /sbin/udevd -d
 1260 root      20   0  4064  588  500 S  0.0  0.0   0:00.00 /sbin/mingetty /dev/tty5
 1262 root      20   0  4064  588  500 S  0.0  0.0   0:00.00 /sbin/mingetty /dev/tty6
10198 root      20   0  132m 1704  992 S  0.0  0.0   0:00.00 CROND
10199 root      20   0  132m 1704  992 S  0.0  0.0   0:00.00 CROND
10200 nagios    20   0  103m 1292 1124 S  0.0  0.0   0:00.00 /bin/sh -c /usr/bin/php -q /var/ww
10201 nagios    20   0  103m 1296 1124 S  0.0  0.0   0:00.00 /bin/sh -c /usr/bin/php -q /var/ww
10202 nagios    20   0  216m  12m 5776 S  0.0  0.0   0:00.07 /usr/bin/php -q /var/www/html/nagi
10203 nagios    20   0  215m  13m 5828 S  0.0  0.0   0:00.13 /usr/bin/php -q /var/www/html/nagi
10260 root      20   0  105m 1896 1520 S  0.0  0.0   0:00.00 -bash
21374 apache    20   0  239m  13m 3372 S  0.0  0.0   2:17.61 /usr/sbin/httpd
31005 apache    20   0  238m  12m 3320 S  0.0  0.0   0:29.55 /usr/sbin/httpd
31010 apache    20   0  238m  12m 3352 S  0.0  0.0   0:29.45 /usr/sbin/httpd
31012 apache    20   0  238m  12m 3336 S  0.0  0.0   0:30.27 /usr/sbin/httpd
31017 apache    20   0  237m  11m 3320 S  0.0  0.0   0:29.66 /usr/sbin/httpd
31022 apache    20   0  237m  11m 3320 S  0.0  0.0   0:30.02 /usr/sbin/httpd
31110 apache    20   0  238m  12m 3320 S  0.0  0.0   0:29.10 /usr/sbin/httpd

Post by **mbellerue** » Thu Sep 12, 2019 2:12 pm

Okay. Your top output looks good, save for the load. So I'll wait to hear back on the alerts.

Envera IT · Post by **Envera IT** » Fri Sep 13, 2019 12:51 pm

So I'm still seeing those elastic search logs even after rebuilding the alerts. I tried changing the check interval from 1m to 60s to test and its still failing to parse the "check_interval".

Any ideas on what to check next?

Code: Select all

[2019-09-13 13:46:21,956][DEBUG][action.index             ] [ef70bad7-9d2f-44e1-af50-bec5f1c21973] [nagioslogserver_history][4], node[Ab0aTXO3T7iYADGjZKuxXQ], [P], s[STARTED]: Failed to execute [index {[nagioslogserver_history][alert][AW0ru6mD7SIN8iYUYnp9], source[{"alert_id":"AW0rpOUd7SIN8iYUX73j","name":"Fortigate - HQ WAN Recovery","check_interval":"60s","lookback_period":"60s","warning":"1","critical":"1","start":1568396721,"end":1568396781,"query":"{\"query\":{\"filtered\":{\"query\":{\"bool\":{\"should\":[{\"query_string\":{\"query\":\"\\\"Link Monitor changes state from failed to ok\\\" NOT interface:\\\"coco\\\"\"}}]}},\"filter\":{\"bool\":{\"must\":[{\"range\":{\"@timestamp\":{\"from\":1568396721000,\"to\":1568396781000}}}]}}}}}","indexes":"logstash-2019.09.13","ran":1568396781,"status":0,"output":"OK: 0 matching entries found |logs=0;1;1"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [check_interval]
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
        at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:706)
        at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:497)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:544)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
        at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:465)
        at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:201)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "60s"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Long.parseLong(Long.java:589)
        at java.lang.Long.parseLong(Long.java:631)
        at org.elasticsearch.common.xcontent.support.AbstractXContentParser.longValue(AbstractXContentParser.java:145)
        at org.elasticsearch.index.mapper.core.LongFieldMapper.innerParseCreateField(LongFieldMapper.java:288)
        at org.elasticsearch.index.mapper.core.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:239)
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
        ... 12 more

Post by **mbellerue** » Fri Sep 13, 2019 3:51 pm

What are the stats on the system in question? Also, is it a physical machine, or a virtual machine?

Would it be possible to destroy the alerts one by one to see if the load comes down? Thinking if we can narrow the performance issue down to a specific alert, we'll know what to look for. Or if the load continues to stay high, even with no alerts, we'll know that the alerts are just a red herring.

Also, if you could PM me a system profile, I'll take a look at it on this side.

Envera IT · Post by **Envera IT** » Sat Sep 28, 2019 12:13 pm

Sorry for the delay. So alerts are working and the CPU load seems to be variable with the time of day (amount of logs). We're planning on upgrading the resources on the cluster in the next few weeks and will see if that has an impact. You can close this out and if we continue to have issues I'll just open up another one. Thanks

scottwilkerson · Post by **scottwilkerson** » Mon Sep 30, 2019 6:45 am

Ehamby wrote:Sorry for the delay. So alerts are working and the CPU load seems to be variable with the time of day (amount of logs). We're planning on upgrading the resources on the cluster in the next few weeks and will see if that has an impact. You can close this out and if we continue to have issues I'll just open up another one. Thanks

Great. Locking

Nagios Support Forum

Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU

Re: Upgrade to 2.0.8 - Elasticsearch errors and high CPU