Page 1 of 1
Logstash services stopped on all nodes
Posted: Fri Oct 02, 2015 1:04 pm
by CFT6Server
Just noticed that our cluster has stopped collecting logs and after checking, looks like logstash on all nodes are not running. When starting the service manually, it does not stay running on any of the nodes. Nothing appears in the logstash logs. Any ideas?
Re: Logstash services stopped on all nodes
Posted: Fri Oct 02, 2015 1:57 pm
by jolson
Anything in /var/log/messages? Is there any possibility that your disks are full?
My guesses are that either
A) Logstash can't start due to insufficient memory
-or-
B) Logstash can't start due to no disk space
Let me know!
Re: Logstash services stopped on all nodes
Posted: Fri Oct 02, 2015 3:45 pm
by CFT6Server
Space wise it is looking OK
Code: Select all
# df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 99G 3.2G 95G 4% /
devtmpfs 9.9G 148K 9.9G 1% /dev
tmpfs 9.9G 0 9.9G 0% /dev/shm
/dev/sda1 99G 3.2G 95G 4% /
10.242.145.237:/vol/v_kdcnagiosnfs1_kdcnagls1n1_logs
2.5T 1.3T 1.3T 50% /nfs/logdata
10.242.145.250:/vol/v_kdcnagiosnfs2_repository
8.8T 7.9T 985G 90% /nfs/repository
Could be memory... but it shows that memory is fine within Log Server
Code: Select all
# free -m
total used free shared buffers cached
Mem: 20120 19544 576 0 65 4813
-/+ buffers/cache: 14664 5456
Swap: 255 33 222
resource.JPG
I've restarted the node and the service is still dead....
Code: Select all
# service logstash status
Logstash Daemon dead but pid file exists
Re: Logstash services stopped on all nodes
Posted: Mon Oct 05, 2015 5:41 am
by jdalrymple
CFT6Server wrote:Code: Select all
# free -m
total used free shared buffers cached
Mem: 20120 19544 576 0 65 4813
-/+ buffers/cache: 14664 5456
Swap: 255 33 222
I would trust `free` over the web interface, wouldn't you?
I'd have to confirm this with the developers, but I bet that data in the UI comes from elasticsearch which is likely out of date if Logstash isn't up. Although it is surprising that you're not seeing out of memory errors in the logstash log, it usually doesn't try to hide memory problems.
Re: Logstash services stopped on all nodes
Posted: Tue Oct 06, 2015 12:04 am
by CFT6Server
Looks like this was due to an error in the configuration that caused this. Not a memory issue. Not sure why it didn't pick this up during verification and also there wasn't any changes in the last 2 to 3 weeks, so timing is a bit odd. But turns out that one of the filters were referring to a pattern that is not valid. Since verification did not pick this up, I have to turn on verbose logging by editing /etc/init.d/logstash and putting in -vvv in the argument. Then it showed me the pattern that had errors which caused logstash to start running.
Re: Logstash services stopped on all nodes
Posted: Tue Oct 06, 2015 11:01 am
by jolson
That is a bit odd - I'm still working with the developers to get the configurations to verify before the Apply Configuration occurs. I have never encountered a logstash configuration that caused it to stop starting entirely (without logging). If you're willing to shed some more light before I close this thread, what exactly was the configuration mistake?
Re: Logstash services stopped on all nodes
Posted: Tue Oct 06, 2015 11:15 am
by CFT6Server
It was a filter which had a grok pattern that did not existing.
So a snippet of the grok filter...
Code: Select all
grok {
match => [
'message', '%{CISCOFW106001_1}',
'message', '%{CISCOFW106001_2}',
'message', '%{CISCOFW106006_106007_1}',
'message', '%{CISCOFW106006_106007_2}',
'message', '%{CISCOFW106006_106007_106010}',
'message', '%{CISCOFW106015}',
'message', '%{CISCOFW106021}',
'message', '%{CISCOFW106023}',
'message', '%{CISCOFW106100}',
'message', '%{CISCOFW110002}',
'message', '%{CISCOFW302010}',
'message', '%{CISCOFW302013_302014_302015_302016_1}',
'message', '%{CISCOFW302013_302014_302015_302016_2}',
'message', '%{CISCOFW302020_302021_1}',
'message', '%{CISCOFW302020_302021_2}',
'message', '%{CISCOFW305011}',
'message', '%{CISCOFW313001_313004_313008}',
'message', '%{CISCOFW313005}',
'message', '%{CISCOFW402117}',
'message', '%{CISCOFW402119}',
'message', '%{CISCOFW419001}',
'message', '%{CISCOFW419002}',
'message', '%{CISCOFW500004}',
'message', '%{CISCOFW602303_602304_1}',
'message', '%{CISCOFW602303_602304_2}',
'message', '%{CISCOFW710001_710002_710003_710005_710006}',
'message', '%{CISCOFW713172}',
'message', '%{CISCOFW733100}',
'message', '%{CISCOFW106014}'
]
So one of those were renamed in the pattern file in /usr/local/nagioslogserver/logstash/patterns/ so it was pointing to an unknown pattern.
This did finally show in the logs once I enable the verbose logging.
Code: Select all
{:timestamp=>"2015-10-01T21:02:04.867000-0700", :message=>"The error reported is: \n pattern %{CISCOFW106010} not defined"}
I noticed that there are minimal logging by default for the logstash service. I have set mine to LS_OPTS="-v" in the /etc/init.d/logstash file.
Re: Logstash services stopped on all nodes
Posted: Tue Oct 06, 2015 1:36 pm
by jolson
Thanks for the information. I attempted to replicate your problem, and logstash.log contained the following error:
Code: Select all
{:timestamp=>"2015-10-06T14:33:44.052000-0400", :message=>"The error reported is: \n pattern %{COMNEDAPACHELOG} not defined"}
This is a default install of Nagios Log Server, so I checked my init script, and I'm also in the lowers verbosity level. Any chance you're on an older version of NLS? I tested on 2.2 - the issue might be resolved already, and I am hesitant to file a bug report unless I can confirm that the bug still exists in the latest version.