Logstash process exited, but running?

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Logstash process exited, but running?

Post by polarbear1 »

Greetings,

I recently migrated my NLS clusters from RHEL 6.8 to 7.3, updated to NLS 1.4.4 while I was at it, and I'm seeing a weird behavior on a couple of them. Long story short, the dashboard shows Logstash as running and NLS seems to be behaving correctly, however on the server itself there is no indication that the logstash service is running and there is no java process for it that is active (verified by looking at top -u nagios).

On my bad node I see this:

The dashboard is happy --
Image

Code: Select all

[root@schpnag1 ~]# systemctl -l status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (exited) since Thu 2017-03-23 09:58:31 CDT; 1min 49s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 19321 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 19330 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)

Mar 23 09:58:31 schpnag1 systemd[1]: Starting LSB: Logstash...
Mar 23 09:58:31 schpnag1 runuser[19336]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Mar 23 09:58:31 schpnag1 logstash[19330]: Starting Logstash Daemon: [  OK  ]
Mar 23 09:58:31 schpnag1 systemd[1]: Started LSB: Logstash.
Mar 23 09:58:47 schpnag1 runuser[19336]: pam_unix(runuser:session): session closed for user nagios
Elasticsearch is alright...

Code: Select all

[root@schpnag1 ~]# systemctl -l status elasticsearch
● elasticsearch.service - LSB: This service manages the elasticsearch daemon
   Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
   Active: active (running) since Tue 2017-03-07 09:42:52 CST; 2 weeks 1 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 15378 ExecStop=/etc/rc.d/init.d/elasticsearch stop (code=exited, status=0/SUCCESS)
  Process: 15944 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/elasticsearch.service
           └─15964 java -Xms32125m -Xmx32125m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Des.cluster.name=4f703585-84ab-40e0-9ff9-f72c904bdc38 -Des.node.name=dd72d9be-b8a7-484b-9f56-dcd2374c36e8 -Des.discovery.zen.ping.unicast.hosts=localhost,,schpnag2 -Des.path.repo=/ -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/local/nagioslogserver/elasticsearch -cp :/usr/local/nagioslogserver/elasticsearch/lib/elasticsearch-1.6.0.jar:/usr/local/nagioslogserver/elasticsearch/lib/*:/usr/local/nagioslogserver/elasticsearch/lib/sigar/* -Des.default.path.home=/usr/local/nagioslogserver/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/nagios/data -Des.default.path.work=/usr/local/nagioslogserver/tmp/elasticsearch -Des.default.path.conf=/usr/local/nagioslogserver/elasticsearch/config org.elasticsearch.bootstrap.Elasticsearch

Mar 07 09:42:52 schpnag1 systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Mar 07 09:42:52 schpnag1 runuser[15955]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Mar 07 09:42:52 schpnag1 runuser[15955]: pam_unix(runuser:session): session closed for user nagios
Mar 07 09:42:52 schpnag1 elasticsearch[15944]: Starting elasticsearch: [  OK  ]
Mar 07 09:42:52 schpnag1 systemd[1]: Started LSB: This service manages the elasticsearch daemon.
And it's not just the behavior of logstash because on another note everything is alright...

Code: Select all

[root@schpnag11 ~]# systemctl -l status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (running) since Tue 2017-02-21 15:09:16 CST; 4 weeks 1 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 17753 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 17762 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/logstash.service
           ├─17768 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log  -w 4 nagios
           └─17770 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xmx500m -Xss2048k -Djffi.boot.library.path=/usr/local/nagioslogserver/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xbootclasspath/a:/usr/local/nagioslogserver/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/usr/local/nagioslogserver/logstash/vendor/jruby -Djruby.lib=/usr/local/nagioslogserver/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /usr/local/nagioslogserver/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log -w 4
As far as I can tell (and I should, since I built the damn things) - the configuration between the nodes is identical. As I stated, the cluster appears to be working fine, but I'm trying to set up my infrastructure monitoring to alert me when elasticsearch or logstash dies, and that' really hard to do if it can't tell they are alive in the first place.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Logstash process exited, but running?

Post by avandemore »

What does /var/log/logstash/logstash.log say?
Previous Nagios employee
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Logstash process exited, but running?

Post by polarbear1 »

Ahh, I remembered what I forgot. I had a custom grok pattern defined before the update that was lost in the updates, so grok failed on that note because in my global config it did preserve the custom filter. Well then. I cleared that up and now everything is as expected. This is what was different from the other node - that cluster did not have this custom filter in place.

Still begs the question though, if logstash crashed due to a crappy filter, why was the dashboard still reporting it as up?
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Logstash process exited, but running?

Post by avandemore »

Do you know how to induce this behavior? I'm trying to replicate it.
Previous Nagios employee
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Logstash process exited, but running?

Post by polarbear1 »

avandemore wrote:Do you know how to induce this behavior? I'm trying to replicate it.
Yes sir! Just was able to recreate it on the spot.

1. In NLS -- Administration > Global Configuration > +Add Filter
2. Make a filter with an undefined pattern.

(I just grabbed a part of the default Apache filter, changed MONTHDAY to MONTHDAY_TEST)

Code: Select all

if [program] == 'apache_error' {
    grok {
        match => [ 'message', '\[(?<timestamp>%{DAY:day} %{MONTH:month} %{MONTHDAY_TEST} %{TIME} %{YEAR})\] \[%{WORD:class}\] \[%{WORD:originator} %{IP:clientip}\] %{GREEDYDATA:errmsg}']
    }
    mutate {
        replace => [ 'type', 'apache_error' ]
    }
}
3. Apply Configuration.


Outcome on the server...

Code: Select all

# cat /var/log/logstash.log
{:timestamp=>"2017-03-24T10:19:46.934000-0500", :message=>"The error reported is: \n  pattern %{MONTHDAY_TEST} not defined"}

Code: Select all

# systemctl status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (exited) since Fri 2017-03-24 10:19:54 CDT; 26s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 9623 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 9632 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)

Mar 24 10:19:53 schpnag1 systemd[1]: Starting LSB: Logstash...
Mar 24 10:19:53 schpnag1 runuser[9638]: pam_unix(runuser:session): session opened for user nagios...d=0)
Mar 24 10:19:54 schpnag1 logstash[9632]: Starting Logstash Daemon: [  OK  ]
Mar 24 10:19:54 schpnag1 systemd[1]: Started LSB: Logstash.
Mar 24 10:20:06 schpnag1 runuser[9638]: pam_unix(runuser:session): session closed for user nagios
Hint: Some lines were ellipsized, use -l to show in full.
Instance Status dashboard is still green lights.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Logstash process exited, but running?

Post by avandemore »

I just the filter on my 2 node cluster and I'm not able to reproduce.

Code: Select all

[root@nls1 ~]# service logstash status
Logstash Daemon● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (running) since Thu 2017-03-23 14:30:42 CDT; 21h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 26981 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 27035 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/logstash.service
           ├─27041 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log  -w 4 nagios
           └─27047 java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xmx500m -Xss2048k -Djf...
Is there more in /var/log/logstash/logstash.log? I don't think that alone would be sufficient to cause it to die. Perhaps we need to increase logstash verbosity. Does /var/log/messages contain anything interesting?
Previous Nagios employee
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Logstash process exited, but running?

Post by polarbear1 »

Pretty sure there's nothing else. Just verified by doing the above steps on an entirely separate cluster (2 nodes) and the results where the same.

Code: Select all

[root@schpnag11 ~]# cat /var/log/logstash/logstash.log
{:timestamp=>"2017-03-24T13:21:26.543000-0500", :message=>"SIGTERM received. Shutting down the pipeline.", :level=>:warn}
{:timestamp=>"2017-03-24T13:21:38.057000-0500", :message=>"The error reported is: \n  pattern %{MONTHDAY_TEST} not defined"}
[root@schpnag11 ~]# systemctl status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: active (exited) since Fri 2017-03-24 13:21:28 CDT; 36s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 8876 ExecStop=/etc/rc.d/init.d/logstash stop (code=exited, status=0/SUCCESS)
  Process: 8896 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)

Mar 24 13:21:27 schpnag11 systemd[1]: Starting LSB: Logstash...
Mar 24 13:21:27 schpnag11 runuser[8902]: pam_unix(runuser:session): session opened for user nagio...d=0)
Mar 24 13:21:28 schpnag11 logstash[8896]: Starting Logstash Daemon: [  OK  ]
Mar 24 13:21:28 schpnag11 systemd[1]: Started LSB: Logstash.
Mar 24 13:21:38 schpnag11 runuser[8902]: pam_unix(runuser:session): session closed for user nagios
Hint: Some lines were ellipsized, use -l to show in full.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Logstash process exited, but running?

Post by avandemore »

Ok, so you can induce this by specificying a grok filter which isn't defined and then sending data which will match the filter but not grok.
BTW, you can update grok like this:

Code: Select all

/usr/local/nagioslogserver/logstash/bin/plugin update logstash-filter-grok
However this appears to be different than what you originally posted unless I'm missing something?
Previous Nagios employee
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Logstash process exited, but running?

Post by polarbear1 »

avandemore wrote: However this appears to be different than what you originally posted unless I'm missing something?
I think we're on the same page. Give it a crappy pattern, restart logstash (can't say for feeding it matching logs though), it crashes, but doesn't report as dead on the dashboard.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Logstash process exited, but running?

Post by scottwilkerson »

This is odd, can you run the following from the CLI to replicate what the GUI does:

Code: Select all

sudo /etc/init.d/logstash status
echo $?
Of course you would need to run this when logstash isn't running properly but the dashboard reports it is.

We may have found an exception we need to address using old style init scripts on systemd systems
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked