Log Collection randomly stops almost completely

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Log Collection randomly stops almost completely

Post by rferebee »

Good morning Nagios team,

Can you take a look at the attached screen shot? I'd like someone to help me understand why log collection will randomly come to almost a complete halt in my environment from time to time. What logs can I look at to figure out why this happens? I was working when it occurred and both logstash and elasticsearch showed they were running and active on all 3 nodes in my cluster. Apart from the console being extremely slow at the time log collection dipped, I could see no discernible cause for what happened. CPU and memory usage were fine (a bit low on the CPU side).

This happens completely at random 2-3 times per month and it takes until scheduled maintenance and snapshots run to recover itself.

Thank you!
You do not have the required permissions to view the files attached to this post.
User avatar
jdunitz
Posts: 235
Joined: Wed Feb 05, 2020 2:50 pm

Re: Log Collection randomly stops almost completely

Post by jdunitz »

You might start looking at the auditlog to see if anything unsual shows up:

[root@jpd-nagiosls2 var]# egrep -v "Finished run|snapshot" /usr/local/nagioslogserver/var/auditlog.log | less

that will filter out a lot of the job-running and snapshot messages you don't need to look at right away.

You can also look in /usr/local/nagioslogserver/logstash for any hs_err logs and see if those look relevant.

Let's start with those.

--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Log Collection randomly stops almost completely

Post by rferebee »

Good morning,

Looking at the auditlog on each node, I'm not seeing anything out of the ordinary besides the alerts I have setup running on their schedules.

I do se one hs_err log, but it's empty and 2 years old.
User avatar
jdunitz
Posts: 235
Joined: Wed Feb 05, 2020 2:50 pm

Re: Log Collection randomly stops almost completely

Post by jdunitz »

Can you PM me a system profile so we can look at your logs and configs?

Alternatively, you can open a ticket and attach your profile there, if you'd rather.

Thanks!

--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Log Collection randomly stops almost completely

Post by rferebee »

PM sent.
User avatar
jdunitz
Posts: 235
Joined: Wed Feb 05, 2020 2:50 pm

Re: Log Collection randomly stops almost completely

Post by jdunitz »

I discovered something looking at your logs:

Every line in the logstash log is like this:

{:timestamp=>"2021-03-15T06:12:13.731000-0700", :message=>"Received an event that has a different character encoding than you configured.", :text=>"{\\\"EventReceivedTime\\\":\\\"2021-03-15 06:12:10\\\",\\\"SourceModuleName\\\":



and most of the previous one...
jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*15
17620 logstash.log-20210315

jdunitz:~/.../system-profile/logstashlogs
$ grep "different chara" logstash.log*15 | wc -l
17585






And almost all of yesterday's:


jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*14
20869 logstash.log-20210314
jdunitz:~/.../system-profile/logstashlogs

$ grep "different chara" logstash.log*14 | wc -l
20833
jdunitz:~/.../system-profile/logstashlogs
$


Did something change on the sending side? Clearly, the logs that are being sent don't match the expected format.

--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Log Collection randomly stops almost completely

Post by rferebee »

Nothing that I'm aware of, however it's certainly possible. We're in a bit of a silo'd environment, so if another group makes a change and doesn't let my group know... we don't typically find out about it until something breaks.

How can I proceed to get this mismatch corrected?
User avatar
jdunitz
Posts: 235
Joined: Wed Feb 05, 2020 2:50 pm

Re: Log Collection randomly stops almost completely

Post by jdunitz »

You'd have to look at the application that's sending the logs, and see if the format has changed. Also, consider that these are only about 20k errors; if you have millions of events coming in, this may not be a big deal.

Another thing to explore, is if you're hitting any kind of kernel limits.

Check "ulimit -S -n" and "ulimit -H -n" for the nagios user and make sure they're not still just 4096.
You may need to edit /etc/security/limits.conf, specifying bigger numbers (or even unlimited) for the nagios user.

If that doesn't help, you could try tweaking
sysctl -n net.ipv4.tcp_rmem
and
sysctl -n net.ipv4.tcp_mem

and boost them up to 2x or 4x their current values (probably 4096 and 20796 are the defaults).

Hopefully these are helpful. Let me know if anything new develops.


--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Log Collection randomly stops almost completely

Post by ssax »

Locking thread, ticket received, we will continue support through the ticket.

Thank you!
Locked