Logstash process exited, but running?

polarbear1 · Post by **polarbear1** » Thu Mar 23, 2017 10:17 am

Greetings,

I recently migrated my NLS clusters from RHEL 6.8 to 7.3, updated to NLS 1.4.4 while I was at it, and I'm seeing a weird behavior on a couple of them. Long story short, the dashboard shows Logstash as running and NLS seems to be behaving correctly, however on the server itself there is no indication that the logstash service is running and there is no java process for it that is active (verified by looking at top -u nagios).

On my bad node I see this:

The dashboard is happy --

Code: Select all

[root@schpnag1 ~]# systemctl -l status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (exited) since Thu 2017-03-23 09:58:31 CDT; 1min 49s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 19321 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 19330 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)

Mar 23 09:58:31 schpnag1 systemd[1]: Starting LSB: Logstash...
Mar 23 09:58:31 schpnag1 runuser[19336]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Mar 23 09:58:31 schpnag1 logstash[19330]: Starting Logstash Daemon: [  OK  ]
Mar 23 09:58:31 schpnag1 systemd[1]: Started LSB: Logstash.
Mar 23 09:58:47 schpnag1 runuser[19336]: pam_unix(runuser:session): session closed for user nagios

Elasticsearch is alright...

Code: Select all

[root@schpnag1 ~]# systemctl -l status elasticsearch
● elasticsearch.service - LSB: This service manages the elasticsearch daemon
   Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
   Active: active (running) since Tue 2017-03-07 09:42:52 CST; 2 weeks 1 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 15378 ExecStop=/etc/rc.d/init.d/elasticsearch stop (code=exited, status=0/SUCCESS)
  Process: 15944 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/elasticsearch.service
           └─15964 java -Xms32125m -Xmx32125m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Des.cluster.name=4f703585-84ab-40e0-9ff9-f72c904bdc38 -Des.node.name=dd72d9be-b8a7-484b-9f56-dcd2374c36e8 -Des.discovery.zen.ping.unicast.hosts=localhost,,schpnag2 -Des.path.repo=/ -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/local/nagioslogserver/elasticsearch -cp :/usr/local/nagioslogserver/elasticsearch/lib/elasticsearch-1.6.0.jar:/usr/local/nagioslogserver/elasticsearch/lib/*:/usr/local/nagioslogserver/elasticsearch/lib/sigar/* -Des.default.path.home=/usr/local/nagioslogserver/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/nagios/data -Des.default.path.work=/usr/local/nagioslogserver/tmp/elasticsearch -Des.default.path.conf=/usr/local/nagioslogserver/elasticsearch/config org.elasticsearch.bootstrap.Elasticsearch

Mar 07 09:42:52 schpnag1 systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Mar 07 09:42:52 schpnag1 runuser[15955]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Mar 07 09:42:52 schpnag1 runuser[15955]: pam_unix(runuser:session): session closed for user nagios
Mar 07 09:42:52 schpnag1 elasticsearch[15944]: Starting elasticsearch: [  OK  ]
Mar 07 09:42:52 schpnag1 systemd[1]: Started LSB: This service manages the elasticsearch daemon.

And it's not just the behavior of logstash because on another note everything is alright...

Code: Select all

[root@schpnag11 ~]# systemctl -l status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (running) since Tue 2017-02-21 15:09:16 CST; 4 weeks 1 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 17753 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 17762 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/logstash.service
           ├─17768 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log  -w 4 nagios
           └─17770 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xmx500m -Xss2048k -Djffi.boot.library.path=/usr/local/nagioslogserver/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xbootclasspath/a:/usr/local/nagioslogserver/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/usr/local/nagioslogserver/logstash/vendor/jruby -Djruby.lib=/usr/local/nagioslogserver/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /usr/local/nagioslogserver/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log -w 4

As far as I can tell (and I should, since I built the damn things) - the configuration between the nodes is identical. As I stated, the cluster appears to be working fine, but I'm trying to set up my infrastructure monitoring to alert me when elasticsearch or logstash dies, and that' really hard to do if it can't tell they are alive in the first place.

avandemore · Post by **avandemore** » Thu Mar 23, 2017 12:41 pm

What does /var/log/logstash/logstash.log say?

polarbear1 · Post by **polarbear1** » Thu Mar 23, 2017 2:03 pm

Ahh, I remembered what I forgot. I had a custom grok pattern defined before the update that was lost in the updates, so grok failed on that note because in my global config it did preserve the custom filter. Well then. I cleared that up and now everything is as expected. This is what was different from the other node - that cluster did not have this custom filter in place.

Still begs the question though, if logstash crashed due to a crappy filter, why was the dashboard still reporting it as up?

avandemore · Post by **avandemore** » Thu Mar 23, 2017 2:39 pm

Do you know how to induce this behavior? I'm trying to replicate it.

polarbear1 · Post by **polarbear1** » Fri Mar 24, 2017 10:24 am

avandemore wrote:Do you know how to induce this behavior? I'm trying to replicate it.

Yes sir! Just was able to recreate it on the spot.

1. In NLS -- Administration > Global Configuration > +Add Filter
2. Make a filter with an undefined pattern.

(I just grabbed a part of the default Apache filter, changed MONTHDAY to MONTHDAY_TEST)

Code: Select all

if [program] == 'apache_error' {
    grok {
        match => [ 'message', '\[(?<timestamp>%{DAY:day} %{MONTH:month} %{MONTHDAY_TEST} %{TIME} %{YEAR})\] \[%{WORD:class}\] \[%{WORD:originator} %{IP:clientip}\] %{GREEDYDATA:errmsg}']
    }
    mutate {
        replace => [ 'type', 'apache_error' ]
    }
}

3. Apply Configuration.

Outcome on the server...

Code: Select all

# cat /var/log/logstash.log
{:timestamp=>"2017-03-24T10:19:46.934000-0500", :message=>"The error reported is: \n  pattern %{MONTHDAY_TEST} not defined"}

Code: Select all

# systemctl status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (exited) since Fri 2017-03-24 10:19:54 CDT; 26s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 9623 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 9632 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)

Mar 24 10:19:53 schpnag1 systemd[1]: Starting LSB: Logstash...
Mar 24 10:19:53 schpnag1 runuser[9638]: pam_unix(runuser:session): session opened for user nagios...d=0)
Mar 24 10:19:54 schpnag1 logstash[9632]: Starting Logstash Daemon: [  OK  ]
Mar 24 10:19:54 schpnag1 systemd[1]: Started LSB: Logstash.
Mar 24 10:20:06 schpnag1 runuser[9638]: pam_unix(runuser:session): session closed for user nagios
Hint: Some lines were ellipsized, use -l to show in full.

Instance Status dashboard is still green lights.

avandemore · Post by **avandemore** » Fri Mar 24, 2017 12:15 pm

I just the filter on my 2 node cluster and I'm not able to reproduce.

Code: Select all

[root@nls1 ~]# service logstash status
Logstash Daemon● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (running) since Thu 2017-03-23 14:30:42 CDT; 21h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 26981 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 27035 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/logstash.service
           ├─27041 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log  -w 4 nagios
           └─27047 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xmx500m -Xss2048k -Djf...

Is there more in /var/log/logstash/logstash.log? I don't think that alone would be sufficient to cause it to die. Perhaps we need to increase logstash verbosity. Does /var/log/messages contain anything interesting?

polarbear1 · Post by **polarbear1** » Fri Mar 24, 2017 1:22 pm

Pretty sure there's nothing else. Just verified by doing the above steps on an entirely separate cluster (2 nodes) and the results where the same.

Code: Select all

[root@schpnag11 ~]# cat /var/log/logstash/logstash.log
{:timestamp=>"2017-03-24T13:21:26.543000-0500", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2017-03-24T13:21:38.057000-0500", :message=>"The error reported is: \n  pattern %{MONTHDAY_TEST} not defined"}
[root@schpnag11 ~]# systemctl status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (exited) since Fri 2017-03-24 13:21:28 CDT; 36s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 8876 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 8896 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)

Mar 24 13:21:27 schpnag11 systemd[1]: Starting LSB: Logstash...
Mar 24 13:21:27 schpnag11 runuser[8902]: pam_unix(runuser:session): session opened for user nagio...d=0)
Mar 24 13:21:28 schpnag11 logstash[8896]: Starting Logstash Daemon: [  OK  ]
Mar 24 13:21:28 schpnag11 systemd[1]: Started LSB: Logstash.
Mar 24 13:21:38 schpnag11 runuser[8902]: pam_unix(runuser:session): session closed for user nagios
Hint: Some lines were ellipsized, use -l to show in full.

avandemore · Post by **avandemore** » Fri Mar 24, 2017 1:42 pm

Ok, so you can induce this by specificying a grok filter which isn't defined and then sending data which will match the filter but not grok.
BTW, you can update grok like this:

Code: Select all

/usr/local/nagioslogserver/logstash/bin/plugin update logstash-filter-grok

However this appears to be different than what you originally posted unless I'm missing something?

polarbear1 · Post by **polarbear1** » Fri Mar 24, 2017 2:55 pm

avandemore wrote: However this appears to be different than what you originally posted unless I'm missing something?

I think we're on the same page. Give it a crappy pattern, restart logstash (can't say for feeding it matching logs though), it crashes, but doesn't report as dead on the dashboard.

scottwilkerson · Post by **scottwilkerson** » Mon Mar 27, 2017 9:42 am

This is odd, can you run the following from the CLI to replicate what the GUI does:

Code: Select all

sudo /etc/init.d/logstash status
echo $?

Of course you would need to run this when logstash isn't running properly but the dashboard reports it is.

We may have found an exception we need to address using old style init scripts on systemd systems

Nagios Support Forum

Logstash process exited, but running?

Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?

Re: Logstash process exited, but running?