Fluctuating Amount of Logs

tvoll · Post by **tvoll** » Fri Aug 16, 2019 12:48 pm

I recently noticed a strange trend going on with my Nagios Log Server install. At random times and at random intervals, the server will decrease the log intake from the hundreds of thousands (Usually seeing 500k-700k logs) down to thousands (Usually around 1k-4k). The sources that produce these logs are consistent in their output, but the Log Server will display huge dips in activity.

In the Logstash Logs I was seeing a message repeat itself: "Received an event that has a different character encoding than you configured." along with "expected_charset=>"UTF-8""
From what I was seeing elsewhere online, it could be due to a configuration issue within the Log Server's inputs. However, the examples I have seen online do not fix the issue when I implement them into our server.

The only input I current have is as follows:

Code: Select all

tcp {
  port => "5544"
  type => "syslog"
  codec => plain {
    charset => "ISO-8859-1"
  }
}

ssax · Post by **ssax** » Fri Aug 16, 2019 2:43 pm

Please attach a copy of your profile from Admin > System Status > Download System Profile so that we can review the logs.

Do you see any drops/errors on the NIC on the logserver or any of the ports/firewalls/IPS devices in the network path?

Code: Select all

ifconfig -a
ethtool -S INTERFACENAME

tvoll · Post by **tvoll** » Tue Aug 20, 2019 10:16 am

I am not seeing any errors related to ports/firewall.
After running ethtool -S on the interface being used for our log server, I see this as the output:

Code: Select all

NIC statistics:
     rx_packets: 32760508
     tx_packets: 10867519
     rx_bytes: 43505349930
     tx_bytes: 714017812
     rx_broadcast: 0
     tx_broadcast: 0
     rx_multicast: 0
     tx_multicast: 0
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 4510
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 43505349930
     rx_csum_offload_good: 32674732
     rx_csum_offload_errors: 17
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0

I've attached the System Profile as well.

Support edit: System profile downloaded, and shared with team.

Post by **cdienger** » Tue Aug 20, 2019 2:46 pm

It looks like it is having an issue with the characters in the SerialNo field that is being sent over. Are you able to view the logs on the device? I'd be curious what the logs look like there as well the raw data being sent. The raw data can be gathered with:

Code: Select all

yum -y tcpdump
tcpdump -s 0 -i any port 5544 and host w.x.y.z -w output.pcap

where w.x.y.z is the IP address of a device that is generating these errors in the logs. Let this run long enough to capture of of this traffic and use CTRL+C to stop it and them PM us the output.pcap this creates.

tvoll · Post by **tvoll** » Thu Aug 22, 2019 9:28 am

Unfortunately, I am told "The extension pcap is not allowed." in both this post and in the PM.

Post by **mbellerue** » Thu Aug 22, 2019 3:32 pm

Oh, maybe try zipping the file, or just taking the extension off.

Support edit: output.zip shared with team

tvoll · Post by **tvoll** » Tue Aug 27, 2019 1:09 pm

mbellerue wrote:Oh, maybe try zipping the file, or just taking the extension off.

Support edit: output.zip shared with team

It has been a few days since the output was received.
Are there any updates?

Post by **mbellerue** » Tue Aug 27, 2019 4:54 pm

We found that there are a number of Windows clients sending to the Syslog port (3544), when they should be sending to the Windows Event Log port (3515). As an example, 172.31.55.48 should be sending to 3515. You should double check your Windows clients to make sure they are all going to 3515.

tvoll · Post by **tvoll** » Wed Sep 04, 2019 9:47 am

Made the change.
Logs went back up to the hundreds of thousands, and then recently dipped back down to the thousands again.
Issue still persists.

Post by **cdienger** » Wed Sep 04, 2019 2:43 pm

Please gather a profile the next time you see a dip as well as a screenshot highlighting the dip and the time it occurred. I'd like to see if can see anything interesting being logged when there is a dip. Please also verify the time on the NLS machine with the "date" command as well as the time and timezone on the machine running the browser used to access the NLS web UI(it adjusts according to the browser's time).

That said, I would also suggest increasing the memory that logstash is allocated per https://support.nagios.com/kb/article/n ... g-576.html - I don't see logstash crashing, but this may help it's behavior.

Nagios Support Forum

Fluctuating Amount of Logs

Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs

Re: Fluctuating Amount of Logs