Nagios Support Forum

Posted: **Fri Oct 02, 2015 1:04 pm**

Just noticed that our cluster has stopped collecting logs and after checking, looks like logstash on all nodes are not running. When starting the service manually, it does not stay running on any of the nodes. Nothing appears in the logstash logs. Any ideas?

Posted: **Fri Oct 02, 2015 1:57 pm**

Anything in /var/log/messages? Is there any possibility that your disks are full?

My guesses are that either
A) Logstash can't start due to insufficient memory
-or-
B) Logstash can't start due to no disk space

Let me know!

Posted: **Fri Oct 02, 2015 3:45 pm**

Space wise it is looking OK

Code: Select all

# df -h
Filesystem            Size  Used Avail Use% Mounted on
rootfs                 99G  3.2G   95G   4% /
devtmpfs              9.9G  148K  9.9G   1% /dev
tmpfs                 9.9G     0  9.9G   0% /dev/shm
/dev/sda1              99G  3.2G   95G   4% /
10.242.145.237:/vol/v_kdcnagiosnfs1_kdcnagls1n1_logs
                      2.5T  1.3T  1.3T  50% /nfs/logdata
10.242.145.250:/vol/v_kdcnagiosnfs2_repository
                      8.8T  7.9T  985G  90% /nfs/repository

Could be memory... but it shows that memory is fine within Log Server

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         20120      19544        576          0         65       4813
-/+ buffers/cache:      14664       5456
Swap:          255         33        222

resource.JPG

I've restarted the node and the service is still dead....

Code: Select all

# service logstash status
Logstash Daemon dead but pid file exists

Posted: **Mon Oct 05, 2015 5:41 am**

CFT6Server wrote:

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         20120      19544        576          0         65       4813
-/+ buffers/cache:      14664       5456
Swap:          255         33        222

I would trust `free` over the web interface, wouldn't you?

I'd have to confirm this with the developers, but I bet that data in the UI comes from elasticsearch which is likely out of date if Logstash isn't up. Although it is surprising that you're not seeing out of memory errors in the logstash log, it usually doesn't try to hide memory problems.

Posted: **Tue Oct 06, 2015 12:04 am**

Looks like this was due to an error in the configuration that caused this. Not a memory issue. Not sure why it didn't pick this up during verification and also there wasn't any changes in the last 2 to 3 weeks, so timing is a bit odd. But turns out that one of the filters were referring to a pattern that is not valid. Since verification did not pick this up, I have to turn on verbose logging by editing /etc/init.d/logstash and putting in -vvv in the argument. Then it showed me the pattern that had errors which caused logstash to start running.

Posted: **Tue Oct 06, 2015 11:01 am**

That is a bit odd - I'm still working with the developers to get the configurations to verify before the Apply Configuration occurs. I have never encountered a logstash configuration that caused it to stop starting entirely (without logging). If you're willing to shed some more light before I close this thread, what exactly was the configuration mistake?

Posted: **Tue Oct 06, 2015 11:15 am**

It was a filter which had a grok pattern that did not existing.

So a snippet of the grok filter...

Code: Select all

grok {
        match => [ 
			'message', '%{CISCOFW106001_1}',
			'message', '%{CISCOFW106001_2}',
			'message', '%{CISCOFW106006_106007_1}',
			'message', '%{CISCOFW106006_106007_2}',
			'message', '%{CISCOFW106006_106007_106010}',
			'message', '%{CISCOFW106015}',
			'message', '%{CISCOFW106021}',
			'message', '%{CISCOFW106023}',
			'message', '%{CISCOFW106100}',
			'message', '%{CISCOFW110002}',
			'message', '%{CISCOFW302010}',
			'message', '%{CISCOFW302013_302014_302015_302016_1}',
			'message', '%{CISCOFW302013_302014_302015_302016_2}',
			'message', '%{CISCOFW302020_302021_1}',
			'message', '%{CISCOFW302020_302021_2}',			
			'message', '%{CISCOFW305011}',
			'message', '%{CISCOFW313001_313004_313008}',
			'message', '%{CISCOFW313005}',
			'message', '%{CISCOFW402117}',
			'message', '%{CISCOFW402119}',
			'message', '%{CISCOFW419001}',
			'message', '%{CISCOFW419002}',
			'message', '%{CISCOFW500004}',
			'message', '%{CISCOFW602303_602304_1}',
			'message', '%{CISCOFW602303_602304_2}',
			'message', '%{CISCOFW710001_710002_710003_710005_710006}',
			'message', '%{CISCOFW713172}',
			'message', '%{CISCOFW733100}',
			'message', '%{CISCOFW106014}'
			]

So one of those were renamed in the pattern file in /usr/local/nagioslogserver/logstash/patterns/ so it was pointing to an unknown pattern.
This did finally show in the logs once I enable the verbose logging.

Code: Select all

{:timestamp=>"2015-10-01T21:02:04.867000-0700", :message=>"The error reported is: \n  pattern %{CISCOFW106010} not defined"}

I noticed that there are minimal logging by default for the logstash service. I have set mine to LS_OPTS="-v" in the /etc/init.d/logstash file.

Posted: **Tue Oct 06, 2015 1:36 pm**

Thanks for the information. I attempted to replicate your problem, and logstash.log contained the following error:

Code: Select all

{:timestamp=>"2015-10-06T14:33:44.052000-0400", :message=>"The error reported is: \n  pattern %{COMNEDAPACHELOG} not defined"}

This is a default install of Nagios Log Server, so I checked my init script, and I'm also in the lowers verbosity level. Any chance you're on an older version of NLS? I tested on 2.2 - the issue might be resolved already, and I am hesitant to file a bug report unless I can confirm that the bug still exists in the latest version.

Nagios Support Forum

Logstash services stopped on all nodes

Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes

Re: Logstash services stopped on all nodes