Good morning Nagios team,
Can you take a look at the attached screen shot? I'd like someone to help me understand why log collection will randomly come to almost a complete halt in my environment from time to time. What logs can I look at to figure out why this happens? I was working when it occurred and both logstash and elasticsearch showed they were running and active on all 3 nodes in my cluster. Apart from the console being extremely slow at the time log collection dipped, I could see no discernible cause for what happened. CPU and memory usage were fine (a bit low on the CPU side).
This happens completely at random 2-3 times per month and it takes until scheduled maintenance and snapshots run to recover itself.
Thank you!
Log Collection randomly stops almost completely
Log Collection randomly stops almost completely
You do not have the required permissions to view the files attached to this post.
Re: Log Collection randomly stops almost completely
You might start looking at the auditlog to see if anything unsual shows up:
[root@jpd-nagiosls2 var]# egrep -v "Finished run|snapshot" /usr/local/nagioslogserver/var/auditlog.log | less
that will filter out a lot of the job-running and snapshot messages you don't need to look at right away.
You can also look in /usr/local/nagioslogserver/logstash for any hs_err logs and see if those look relevant.
Let's start with those.
--Jeffrey
[root@jpd-nagiosls2 var]# egrep -v "Finished run|snapshot" /usr/local/nagioslogserver/var/auditlog.log | less
that will filter out a lot of the job-running and snapshot messages you don't need to look at right away.
You can also look in /usr/local/nagioslogserver/logstash for any hs_err logs and see if those look relevant.
Let's start with those.
--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Log Collection randomly stops almost completely
Good morning,
Looking at the auditlog on each node, I'm not seeing anything out of the ordinary besides the alerts I have setup running on their schedules.
I do se one hs_err log, but it's empty and 2 years old.
Looking at the auditlog on each node, I'm not seeing anything out of the ordinary besides the alerts I have setup running on their schedules.
I do se one hs_err log, but it's empty and 2 years old.
Re: Log Collection randomly stops almost completely
Can you PM me a system profile so we can look at your logs and configs?
Alternatively, you can open a ticket and attach your profile there, if you'd rather.
Thanks!
--Jeffrey
Alternatively, you can open a ticket and attach your profile there, if you'd rather.
Thanks!
--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Log Collection randomly stops almost completely
I discovered something looking at your logs:
Every line in the logstash log is like this:
{:timestamp=>"2021-03-15T06:12:13.731000-0700", :message=>"Received an event that has a different character encoding than you configured.", :text=>"{\\\"EventReceivedTime\\\":\\\"2021-03-15 06:12:10\\\",\\\"SourceModuleName\\\":
and most of the previous one...
jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*15
17620 logstash.log-20210315
jdunitz:~/.../system-profile/logstashlogs
$ grep "different chara" logstash.log*15 | wc -l
17585
And almost all of yesterday's:
jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*14
20869 logstash.log-20210314
jdunitz:~/.../system-profile/logstashlogs
$ grep "different chara" logstash.log*14 | wc -l
20833
jdunitz:~/.../system-profile/logstashlogs
$
Did something change on the sending side? Clearly, the logs that are being sent don't match the expected format.
--Jeffrey
Every line in the logstash log is like this:
{:timestamp=>"2021-03-15T06:12:13.731000-0700", :message=>"Received an event that has a different character encoding than you configured.", :text=>"{\\\"EventReceivedTime\\\":\\\"2021-03-15 06:12:10\\\",\\\"SourceModuleName\\\":
and most of the previous one...
jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*15
17620 logstash.log-20210315
jdunitz:~/.../system-profile/logstashlogs
$ grep "different chara" logstash.log*15 | wc -l
17585
And almost all of yesterday's:
jdunitz:~/.../system-profile/logstashlogs
$ wc -l logstash.log-*14
20869 logstash.log-20210314
jdunitz:~/.../system-profile/logstashlogs
$ grep "different chara" logstash.log*14 | wc -l
20833
jdunitz:~/.../system-profile/logstashlogs
$
Did something change on the sending side? Clearly, the logs that are being sent don't match the expected format.
--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Log Collection randomly stops almost completely
Nothing that I'm aware of, however it's certainly possible. We're in a bit of a silo'd environment, so if another group makes a change and doesn't let my group know... we don't typically find out about it until something breaks.
How can I proceed to get this mismatch corrected?
How can I proceed to get this mismatch corrected?
Re: Log Collection randomly stops almost completely
You'd have to look at the application that's sending the logs, and see if the format has changed. Also, consider that these are only about 20k errors; if you have millions of events coming in, this may not be a big deal.
Another thing to explore, is if you're hitting any kind of kernel limits.
Check "ulimit -S -n" and "ulimit -H -n" for the nagios user and make sure they're not still just 4096.
You may need to edit /etc/security/limits.conf, specifying bigger numbers (or even unlimited) for the nagios user.
If that doesn't help, you could try tweaking
sysctl -n net.ipv4.tcp_rmem
and
sysctl -n net.ipv4.tcp_mem
and boost them up to 2x or 4x their current values (probably 4096 and 20796 are the defaults).
Hopefully these are helpful. Let me know if anything new develops.
--Jeffrey
Another thing to explore, is if you're hitting any kind of kernel limits.
Check "ulimit -S -n" and "ulimit -H -n" for the nagios user and make sure they're not still just 4096.
You may need to edit /etc/security/limits.conf, specifying bigger numbers (or even unlimited) for the nagios user.
If that doesn't help, you could try tweaking
sysctl -n net.ipv4.tcp_rmem
and
sysctl -n net.ipv4.tcp_mem
and boost them up to 2x or 4x their current values (probably 4096 and 20796 are the defaults).
Hopefully these are helpful. Let me know if anything new develops.
--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Log Collection randomly stops almost completely
Locking thread, ticket received, we will continue support through the ticket.
Thank you!
Thank you!