Nagios Support Forum

Posted: **Thu Nov 17, 2016 9:59 am**

Hi,

one of our server is delivering most of our logfiles.
I wanted to add an additional high load logfile (high event count and long lines).
When I add the logfile I noticed, that the throughput from this server is going down rapidly. The logs are not processed in realtime.
The server which runs logserver went up to 15% CPU usage, so there should be room for more.

For log forwarding we are using nxlog.
I assume the bottleneck is either in nxlog, which cannot forward fast enough or in logstash which cannot accept fast enough.

How can I tune up the worker threads of nxlog and logstash? I suppose that would help.
This particular nxlog instance is running on Windows 2012, if it matters.

Thanks,
Andreas

Posted: **Thu Nov 17, 2016 10:13 am**

A bit strange. I tried out to remove another high load logfile. But the problem stays. No errros in logs of nxlog or logstash.

Posted: **Thu Nov 17, 2016 10:47 am**

Generally the recommendation would be to add additional NLS instances and configure some sort of load balancing.

If there's nothing crazy happening in /var/log/logstash/logstash.log, I'm pretty sure the bottleneck is in nxlog in this case. Logstash is pretty good about throwing exceptions when it's under extreme stress. If you can provide your logstash and nxlog configurations we *might* be able to offer suggestions on better optimization of those configurations.

_asp_ wrote:How can I tune up the worker threads of nxlog and logstash?

You could certainly tune up the Java system variables by modifying /etc/init.d/logstash, but that's not something we will provide much input/advice on as a matter of policy.

Posted: **Fri Nov 18, 2016 3:38 am**

here is my nxlog configuration for the particular server:

Code: Select all

## See the nxlog reference manual at 
## http://nxlog.org/nxlog-docs/en/nxlog-reference-manual.html
 
## Please set the ROOT to the folder your nxlog was installed into,
## otherwise it will not start.
#define ROOT C:\Program Files\nxlog
define ROOT C:\Program Files (x86)\nxlog
define CERT %ROOT%\cert
 
Moduledir %ROOT%\modules
#CacheDir %ROOT%\data
#Pidfile %ROOT%\data\nxlog.pid
#SpoolDir %ROOT%\data
#LogFile %ROOT%\data\nxlog.log

CacheDir e:\nxlog\data
Pidfile g:\nxlog\data\nxlog.pid
SpoolDir e:\nxlog\data
LogFile g:\nxlog\nxlog.log
 
# Include fileop while debugging, also enable in the output module below
<Extension fileop>
	Module xm_fileop
</Extension>
 
<Extension json>
	Module      xm_json
</Extension>
 
<Extension syslog>
	Module xm_syslog
</Extension>
 
<Input internal>
	Module im_internal
</Input>
 
# Watch your own files
<Input nxlog>
	Module   im_file
	#File     '%ROOT%\data\nxlog.log'
	File     'g:\nxlog\nxlog.log'
	SavePos  TRUE
	Exec     $Message = $raw_event;
</Input>


<Input tuxedo-ulog>
	Module im_file
	File "g:\\amest\\logfile\\ULOG\\ULOG." + strftime(now(), "%m%d%y")
	SavePos TRUE

	# build the filename
	EXEC if (file_name() =~/(\d\d)(\d\d)(\d\d)$/) \
	{\
		$logTime ="20"+$3 + "-" + $1 + "-" +$2; \
		if ($raw_event =~/^(\d\d)(\d\d)(\d\d)/ )\
		{\
			$logTime = $logTime + " " + $1+":" + $2 + ":" + $3;\
		}\
	};
	
	#frop trace lines
	Exec if ($raw_event =~/.*TRACE\:.*/) drop();

	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'ULOG';
</Input>
 
<Input tuxedo-useractionlog>
	Module im_file
	File "g:\\amest\\logfile\\useractionlog\\useractions-" + strftime(now(), "%Y%m%d") + ".log"
	SavePos TRUE

	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'useractionlog';
</Input>

 <Input tuxedo_tmadmin_usercount>
	Module im_file
	File "g:/Input_for_Kibana/prod/Tuxedo/tmadmin_usercount-" + strftime(now(), "%Y%m%d") + ".log"
	SavePos TRUE

	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'tmadmin_usercount';
</Input> 
 
<Input rdautoorders>
	Module im_file
	File "g:/amest/logfile/rdAutoOrders/rdautoorders-" + strftime(now(), "%Y%m%d") + ".log"
	SavePos TRUE

	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'rdautoorders';
</Input>
  
<Input tapdispatcher>
	Module im_file
	File "g:/amest/logfile/tapDispatcher/tap-"+ strftime(now(), "%Y%m%d") + ".log"
	SavePos true

	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'tapdispatcher';
</Input>

<Input ttp>
	Module im_file
	File "g:/amest/logfile/ttp/ttp-"+ strftime(now(), "%Y%m%d") + ".log"
	SavePos true
	
	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'ttp';
</Input>

<Input topbeat>
	Module im_file
	File "g:/Input_For_Kibana/topbeat/topbeat.log"
	SavePos true

	Exec $Component = 'SystemStats';
	Exec $LogFile = 'topbeat';
	Exec $osType = 'win';
</Input>

<Input smsc-http-recv>
	Module im_file
	File "g:/amest/logfile/smsc-http-recv/smsc-http-recv-"+ strftime(now(), "%Y%m%d") + ".log"
	SavePos true

	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'smsc-http-recv';
</Input>

<Input smsc-http-send>
	Module im_file
	File "g:/amest/logfile/smsc-http-send/smsc-http-send-"+ strftime(now(), "%Y%m%d") + ".log"
	SavePos true
	
	Exec $Component = 'Tuxedo';
	Exec $LogFile = 'smsc-http-send';
</Input>
 
<Output out>
	# connection stuff
	Module om_tcp
	Host xxx.xxx.xxxx.xxxx
	Port xxxx
	
	# server stuff like name, stage, etc.
	Exec $HostName = 'xxx';
	Exec $Stage = 'Production';
	Exec $ServerType = 'yyyy';
	
	# write copy the raw event to message field
	Exec $message = $raw_event;
	
	# convert raw event including all fields to json
	Exec $raw_event = to_json();
	
	# Uncomment for debug output
	#Exec file_write('g:\nxlog\nxlog_debug_output.log', $raw_event + "\n");
</Output>
 
<Route 1>
	Path internal, tuxedo-ulog, tuxedo-useractionlog, tuxedo_tmadmin_usercount, topbeat, smsc-http-recv,smsc-http-send, rdautoorders, tapdispatcher  => out
</Route>

If I enable ttp (which is the 9. logfile) the performance is going down rapidly, much slower than realtime.

Posted: **Fri Nov 18, 2016 4:40 am**

when adding the ttp log in nxlog, top is going up like this:

Code: Select all

top - 10:37:57 up 4 days, 23:25,  3 users,  load average: 1.40, 0.90, 0.71
Tasks: 255 total,   1 running, 254 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.4%us,  0.5%sy, 14.6%ni, 77.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  20462820k total, 19616600k used,   846220k free,   263096k buffers
Swap:  8388320k total,    56356k used,  8331964k free,  6540136k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23398 nagios    39  19 5298m 575m  14m S 195.5  2.9   8:29.53 java
61938 nagios    20   0 25.5g  10g 242m S 91.4 54.5   1538:17 java

load 1 before was at 0.5
But still I don't see any bottleneck here on the NLS.

As I wrote before I doublechecked again:
everything from this server is delayed, where nxlog needs to handle ttp log.
I don't see high disk queues or high I/O on the log drive on the server running nxlog.

All other servers with their own nxlog are running smooth as before.

Posted: **Fri Nov 18, 2016 5:08 am**

I started a 2nd nxlog instance on the server in foreground run mode.
The problem stayed the same.

Posted: **Fri Nov 18, 2016 9:13 am**

I "played" a bit more.

The problem seems to be logstash / filter specific.
I set up an additional logstash instance (2.4.1 of pure elk) on the NLS server. Using an additional nxlog instance to ship the log.

I used following config:

Code: Select all

input
{
        tcp
        {
                type => 'import_json'
                tags => 'import_json'
                port => 5000
                codec => json
        }

}
filter {
    metrics {
        meter => "LogFile"
        add_tag => "metric"
        flush_interval => 10
    }
}
output {
    if "metric" in [tags] {
        stdout {
            #codec => line {
            #    format => "1m rate: %{LogFile.rate_1m} ( %{LogFile.count} )"
            #}
            codec => "rubydebug"
        }
    }
}
output
{
        null{}
}

Here I got throughput of about 60 events per second. I don't know, yet, if that was slower than realtime or if only 60 lines per second are generated.
But the interesting thing is, when I enabled the filter.

Metrics was going down: starting with 16 events per sec and throughput is falling the longer it runs. Interesting: Load of server goes up to 22 when running in separate logstash instance (8 working threads).

So I assume I need some advice in optimizing my filter code.

Code: Select all

if ([LogFile] == 'ttp' or [type] == 'ttp')
	{
		# drop if it is a comment line
		if [message] =~ /^#.*/
		{
			drop
			{
			}
		}
		else
		{
		
			# parse the message
			grok
			{
				match => ['message','^(?<msisdn>\+\d+)\s*(-)?\s+0x(?<ttpSeqIdHex>[\d\w]{4})\s*\[.*\]\s*\w+\s*(?<logTime>\d{2}\.\d{2}\.\d{4} \d{2}\:\d{2}\:\d{2})\s*(?<sidHex>[\d\w]+)\s*TTP DATA\:(?<payload>(?<ttpProtocolVersionHex>[\d\w]{2}) [\w\d]{2} (?<ttpLengthHex>[\w\d]{2} [\w\d]{2}).*) \(.*']
			}
			# convert hex to integer
			ruby
			{
				code => 
				"
					#convert ttplength
					hexString = event['ttpLengthHex']
					number = hexString[0..1] +  hexString[3..4]
					event['ttpPacketLength'] = number.hex
					
					# convert seq id
					event['ttpSeqIdInt'] = event['ttpSeqIdHex'].to_s.hex
				
					# convert protocol version
					event['ttpProtocolVersionInt'] = event['ttpProtocolVersionHex'].to_s.hex
            
					# convert SID
					event['sidInt'] = event['sidHex'].to_s.hex
					event['sidHex'] = '0x' + event['sidHex'].to_s
					
				"
			}
			
			# check for sid = 0x82 (check as integer because it should be faster)
			if [sidInt] == 130
			{
				# parse payload details (tap protocol for tbm2 only)
				
				if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 00.*/
				{
					# is position message
					mutate { add_field => { 'messageType' => 'Position' } }
				}
				else
				{
					# is maintenance data
					if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 20.*/
					{
						mutate { add_field => { 'messageType' => 'Maintenance' } }
					}
					else
					{	
						# is driving time message
						if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 40.*/
						{
							mutate { add_field => { 'messageType' => 'DrivingTime' } }
							# process details of driving time messages
							grok
							{
								match => ['payload','.*02 07 02 .. .. .. .. 28 00 40 (?<dtcoDetails>..).*']
							}
							# parse the details of dtco message
							if [dtcoDetails] == '01'
							{
								mutate { add_field => { 'messageTypeDetails' => 'tco_data_cyclic' } }
							}
							else
							{
								if [dtcoDetails] == '02'
								{
									mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_card_in' } }
								}
								else
								{
									if [dtcoDetails] == '03'
									{
										mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_card_out' } }
									}
									else
									{
										if [dtcoDetails] == '04'
										{
											mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_task_change' } }
										}
										else
										{
											if [dtcoDetails] == '0d'
											{
												mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_ignition_off' } }
											}
											else
											{
												if [dtcoDetails] == '0e'
												{
													mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_warn_driver_1' } }
												}
												else
												{
													if [dtcoDetails] == '12'
													{
														mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_change_followUpTime_1_to_2' } }
													}
													else
													{
														if [dtcoDetails] == '13'
														{
															mutate { add_field => { 'messageTypeDetails' => 'tco_data_on_motor_on' } }
														}
													}
												}
											}
										}
									}
								}
							}
							
						}
						else
						{
							if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 50.*/
							{
								mutate { add_field => { 'messageType' => 'TourHistory' } }
							}
							else
							{
								if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 60.*/
								{
									mutate { add_field => { 'messageType' => 'VOA' } }
								}
								else
								{
									if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 70.*/
									{
										mutate { add_field => { 'messageType' => 'Geofencing' } }
									}
								}
							}
						}
					}
				}
			}
			
			
			# convert extracted timestamp logTime to elasticSearch's timestamp field
			date 
			{
				match => ['logTime', 'dd.MM.YYYY HH:mm:ss']
				timezone => 'UTC'
				remove_field => ['logTime']
			}
			
			
			
			# mutate part
			mutate
			{
				# delete fields no longer used
				remove_field => ['dtcoDetails']
				remove_field => ['payload']
				
			}
		}
	}

I would be glad if you could point at some pitfalls.
I assume I am not allowed to post sample log lines here.
But each line is between 300 and 16k characters long.
The field payload up to 8k characters.

Posted: **Fri Nov 18, 2016 10:50 am**

_asp_ wrote:I assume I am not allowed to post sample log lines here.

That would depend more on your organizations policy for that sort of thing. If you can provide some sample logs I can take a closer look for sure. No promises on results though

I doubt that would fit in a forum post, but you could put it on a service like pastebin and share it that way.

_asp_ wrote: But each line is between 300 and 16k characters long.
The field payload up to 8k characters.

This information combined with the logstash filter presented definitely lead me to believe the bottleneck is logstash.

If payload has a max size of 8k characters, I can pretty much guarantee all the regex matches against that field is a large part of what's causing the bottleneck. Stuff like if [payload] =~ /.*02 07 02 .. .. .. .. 28 00 60.*/ has an awful lot of overhead since the message needs to be evaluated in it's entirety with matches like this.

If all you're examining is hexidecimal data, you could probably write a pretty lean Ruby filter to leverage some bitwise operators and tag the messages that way:
https://calleerlandsson.com/rubys-bitwise-operators/

The added benefit of using bitwise operators would also be that you could effectively knock the message's size down at each evaluation step, but that depends on how the data is structured.

The grok filter applied to the message also has a lot of overhead:

Code: Select all

grok
         {
            match => ['message','^(?<msisdn>\+\d+)\s*(-)?\s+0x(?<ttpSeqIdHex>[\d\w]{4})\s*\[.*\]\s*\w+\s*(?<logTime>\d{2}\.\d{2}\.\d{4} \d{2}\:\d{2}\:\d{2})\s*(?<sidHex>[\d\w]+)\s*TTP DATA\:(?<payload>(?<ttpProtocolVersionHex>[\d\w]{2}) [\w\d]{2} (?<ttpLengthHex>[\w\d]{2} [\w\d]{2}).*) \(.*']
         }

That particular step might be handled better at the nxlog level (or using some other sort of pre-processor). If payload makes up the bulk of the message as indicated by your numbers, I would suggest breaking that out into it's own field in nxlog and dropping it from message. That would take an awful lot of load away from logstash and move it to nxlog.

Posted: **Mon Nov 21, 2016 3:36 am**

thanks for the answer.

I will try on improving the regex in first place. I will try to get rid of some .* stuff and to reduce the steps regex needs. I assume here is the most potential.
When I have done it I will keep you updated on the result

Posted: **Mon Nov 21, 2016 4:36 am**

performance increase is 16 events /s to 160 events / s.

But what I don't understand is the following:
My cpu is only under 30% load (4cpu with hyperthreading) Logstash has 8 working threads. Is there a way to increase multithreading per logfile? There is no need for logstash keeping this log serialized since we have no multilining in this log and no multilining filter at all in logstash. Multilining I am currently solving via nxlog.

Disk (ssd) seems to be no bottleneck (disk queue is 0.5 and active time at 25%)

Nagios Support Forum

increase performance of nxlog / logstash

increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash

Re: increase performance of nxlog / logstash