Page 1 of 1

Log Collection randomly stops almost completely

Posted: Thu Mar 11, 2021 10:17 am
by rferebee
Good morning Nagios team,

Can you take a look at the attached screen shot? I'd like someone to help me understand why log collection will randomly come to almost a complete halt in my environment from time to time. What logs can I look at to figure out why this happens? I was working when it occurred and both logstash and elasticsearch showed they were running and active on all 3 nodes in my cluster. Apart from the console being extremely slow at the time log collection dipped, I could see no discernible cause for what happened. CPU and memory usage were fine (a bit low on the CPU side).

This happens completely at random 2-3 times per month and it takes until scheduled maintenance and snapshots run to recover itself.

Thank you!

Re: Log Collection randomly stops almost completely

Posted: Fri Mar 12, 2021 10:22 am
by jdunitz
You might start looking at the auditlog to see if anything unsual shows up:

[root@jpd-nagiosls2 var]# egrep -v "Finished run|snapshot" /usr/local/nagioslogserver/var/auditlog.log | less

that will filter out a lot of the job-running and snapshot messages you don't need to look at right away.

You can also look in /usr/local/nagioslogserver/logstash for any hs_err logs and see if those look relevant.

Let's start with those.

--Jeffrey

Re: Log Collection randomly stops almost completely

Posted: Fri Mar 12, 2021 11:08 am
by rferebee
Good morning,

Looking at the auditlog on each node, I'm not seeing anything out of the ordinary besides the alerts I have setup running on their schedules.

I do se one hs_err log, but it's empty and 2 years old.

Re: Log Collection randomly stops almost completely

Posted: Mon Mar 15, 2021 10:32 am
by jdunitz
Can you PM me a system profile so we can look at your logs and configs?

Alternatively, you can open a ticket and attach your profile there, if you'd rather.

Thanks!

--Jeffrey

Re: Log Collection randomly stops almost completely

Posted: Mon Mar 15, 2021 10:44 am
by rferebee
PM sent.

Re: Log Collection randomly stops almost completely

Posted: Mon Mar 15, 2021 12:01 pm
by jdunitz
I discovered something looking at your logs:

Every line in the logstash log is like this:

{:timestamp=>"2021-03-15T06:12:13.731000-0700", :message=>"Received an event that has a different character encoding than you configured.", :text=>"{\\\"EventReceivedTime\\\":\\\"2021-03-15 06:12:10\\\",\\\"SourceModuleName\\\":



and most of the previous one...
jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*15
17620 logstash.log-20210315

jdunitz:~/.../system-profile/logstashlogs
$ grep "different chara" logstash.log*15 | wc -l
17585






And almost all of yesterday's:


jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*14
20869 logstash.log-20210314
jdunitz:~/.../system-profile/logstashlogs

$ grep "different chara" logstash.log*14 | wc -l
20833
jdunitz:~/.../system-profile/logstashlogs
$


Did something change on the sending side? Clearly, the logs that are being sent don't match the expected format.

--Jeffrey

Re: Log Collection randomly stops almost completely

Posted: Mon Mar 15, 2021 12:07 pm
by rferebee
Nothing that I'm aware of, however it's certainly possible. We're in a bit of a silo'd environment, so if another group makes a change and doesn't let my group know... we don't typically find out about it until something breaks.

How can I proceed to get this mismatch corrected?

Re: Log Collection randomly stops almost completely

Posted: Tue Mar 16, 2021 12:40 pm
by jdunitz
You'd have to look at the application that's sending the logs, and see if the format has changed. Also, consider that these are only about 20k errors; if you have millions of events coming in, this may not be a big deal.

Another thing to explore, is if you're hitting any kind of kernel limits.

Check "ulimit -S -n" and "ulimit -H -n" for the nagios user and make sure they're not still just 4096.
You may need to edit /etc/security/limits.conf, specifying bigger numbers (or even unlimited) for the nagios user.

If that doesn't help, you could try tweaking
sysctl -n net.ipv4.tcp_rmem
and
sysctl -n net.ipv4.tcp_mem

and boost them up to 2x or 4x their current values (probably 4096 and 20796 are the defaults).

Hopefully these are helpful. Let me know if anything new develops.


--Jeffrey

Re: Log Collection randomly stops almost completely

Posted: Wed Mar 17, 2021 10:44 am
by ssax
Locking thread, ticket received, we will continue support through the ticket.

Thank you!