Nagios suddenly stopped sending logs

tcsdi · Post by **tcsdi** » Mon Mar 18, 2019 4:16 am

Our Nagios Log Server suddenly stopped sending logs to our ELK SIEM, this started around March 8 when we checked the status. I have also attached a screenshot of the history of the logs sent.

How can I determine if my Nagios is functioning properly? It is seen sending logs, but very low as compared to what it used to send.

npolovenko · Post by **npolovenko** » Mon Mar 18, 2019 2:52 pm

Hello, @tcsdi. Please generate and send me a profile from each log server in the cluster. A profile can be generated under Admin > System > System Status or in the command line by running:

/usr/local/nagioslogserver/scripts/profile.sh

The profile can be found at /tmp/system-profile.tar.gz.

tcsdi · Post by **tcsdi** » Tue Mar 19, 2019 3:51 am

npolovenko wrote:Hello, @tcsdi. Please generate and send me a profile from each log server in the cluster. A profile can be generated under Admin > System > System Status or in the command line by running:
/usr/local/nagioslogserver/scripts/profile.sh
The profile can be found at /tmp/system-profile.tar.gz.

Hi @npolovenko,

Thanks for the reply, please see attached logs

tcsdi · Post by **tcsdi** » Tue Mar 19, 2019 3:54 am

npolovenko wrote:Hello, @tcsdi. Please generate and send me a profile from each log server in the cluster. A profile can be generated under Admin > System > System Status or in the command line by running:
/usr/local/nagioslogserver/scripts/profile.sh
The profile can be found at /tmp/system-profile.tar.gz.

Hi @npolovenko,

Thanks for the reply, here is the profile of the server

npolovenko · Post by **npolovenko** » Tue Mar 19, 2019 4:27 pm

@tcsdi, Thanks! I'm seeing that Log Server indices were critical since January. It seems that insufficient ram was the biggest issue that caused the elasticsearch to fail.

java.lang.OutOfMemoryError: Java heap space

My recommendation would be to increase the RAM to at least 8gb on the production system. Then delete critical indexes and restore them from a backup:
https://support.nagios.com/kb/article.php?id=90

tcsdi · Post by **tcsdi** » Wed Mar 20, 2019 9:13 pm

npolovenko wrote:@tcsdi, Thanks! I'm seeing that Log Server indices were critical since January. It seems that insufficient ram was the biggest issue that caused the elasticsearch to fail.
java.lang.OutOfMemoryError: Java heap space
My recommendation would be to increase the RAM to at least 8gb on the production system. Then delete critical indexes and restore them from a backup:
https://support.nagios.com/kb/article.php?id=90

Hi @npolovenko,

We upgraded the RAM to 8GB now but the server will stop after a few hours. I have attached another profile for you to look at.

npolovenko · Post by **npolovenko** » Thu Mar 21, 2019 3:19 pm

@tcsdi, I'm not seeing anything out of the ordinary in the profile so far. Can you take a new screenshot of the Data Source graph?
Could you also clarify which filters are shown on your graph? I see that the graph has many colors. Is it an all-inclusive graph for all outputs or just for one particular output?

tcsdi · Post by **tcsdi** » Fri Mar 22, 2019 2:31 am

npolovenko wrote:@tcsdi, I'm not seeing anything out of the ordinary in the profile so far. Can you take a new screenshot of the Data Source graph?
Could you also clarify which filters are shown on your graph? I see that the graph has many colors. Is it an all-inclusive graph for all outputs or just for one particular output?

Hi @npolovenkko, please see attached log for the list of all sources.

npolovenko · Post by **npolovenko** » Fri Mar 22, 2019 2:13 pm

@tcsdi, I noticed that you're sending some logs twice to two different destinations. For example:

Code: Select all

if [type] =~ /(dnslog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1523
                sourcehost=> "10.5.115.106"
                }
            }
     if [type] =~ /(dnslog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1523
                sourcehost=> "10.5.115.107"
                }
            }

Or:

Code: Select all

if [type] =~ /(eventlog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1522
                sourcehost=> "10.5.115.106"
                codec => json {
                charset => 'CP1252'
                }
            }
            }
     if [type] =~ /(eventlog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1522
                sourcehost=> "10.5.115.107"
                codec => json {
                charset => 'CP1252'
                }
            }

I wonder if that could be causing issues. Are these two sourcehosts clustered? Could you disable duplicate output filters for some time to see if the performance will improve?

tcsdi · Post by **tcsdi** » Mon Mar 25, 2019 10:15 am

npolovenko wrote:@tcsdi, I noticed that you're sending some logs twice to two different destinations. For example:

Code: Select all

if [type] =~ /(dnslog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1523
                sourcehost=> "10.5.115.106"
                }
            }
     if [type] =~ /(dnslog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1523
                sourcehost=> "10.5.115.107"
                }
            }

Or:

Code: Select all

if [type] =~ /(eventlog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1522
                sourcehost=> "10.5.115.106"
                codec => json {
                charset => 'CP1252'
                }
            }
            }
     if [type] =~ /(eventlog)/ {
            syslog {
                host => "172.31.108.236"
                port => 1522
                sourcehost=> "10.5.115.107"
                codec => json {
                charset => 'CP1252'
                }
            }

I wonder if that could be causing issues. Are these two sourcehosts clustered? Could you disable duplicate output filters for some time to see if the performance will improve?

Hi @npolovenko,

Yes the two sourcehosts are clustered but they were already configured that way since the beginning, even when the server is doing fine. Are there any other configuration that can cause the issue?

Nagios Support Forum

Nagios suddenly stopped sending logs

Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs

Re: Nagios suddenly stopped sending logs