Nagios Log Server listening port abruptly halts

james.liew · Post by **james.liew** » Tue Apr 18, 2017 4:26 am

Hi all,

I've had 3 occurrences of this rather weird issue spread across two Nagios Log Server clusters in two datacentres where NLG stops listening on the designated port we use for Windows hosts, say port 3500 and then refuses to receive any log traffic on said port. The Windows boxes run the nxlog agent.

I do not see any resource issues, there are no issues with RAM or CPU utilization on the Log Server. However, I am also not well-versed in Log Server and do not know the conditions in which the cluster will pre-empt, if any, conditions where ports are no longer listening and failover to the secondary node.

Where do I start looking in the logs to try and figure out this problem?

Current NLG version:
Nagios Log Server: 1.4.4
Elasticsearch: 1.6.0
Logstash: 1.5.1
Kibana: 3.1.1-nagios3

Post by **mcapra** » Tue Apr 18, 2017 11:43 am

The Logstash logs are a good place to start. Can you send them over? This command should put them all in the /tmp/43502_1.zip file:

Code: Select all

zip -r /tmp/43502_1.zip /var/log/logstash/

If the file is too big to attach to a post, I'll settle for the latest logstash.log file in that same path.

james.liew · Post by **james.liew** » Wed Apr 19, 2017 1:46 am

Hi,

I've added 3 files for you

Our Log Server stopped listening for logs around 3-5pm last Saturday.

Post by **mcapra** » Wed Apr 19, 2017 1:10 pm

This appears to be the initial problem:

Code: Select all

{:timestamp=>"2017-04-16T11:08:34.784000+0200", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}

Can you also share your Elasticsearch logs from the same day(s)? They're typically found in /var/log/elasticsearch.

james.liew · Post by **james.liew** » Wed Apr 19, 2017 8:54 pm

Elasticsearch logs uploaded.

Post by **tacolover101** » Thu Apr 20, 2017 12:49 am

from what i can tell your nodes are disconnecting at some point.

Code: Select all

[2017-04-16 11:05:02,252][DEBUG][action.admin.cluster.health] [791cc6c8-f646-495e-9e58-1ec21a24b61c] no known master node, scheduling a retry
[2017-04-16 11:05:02,337][DEBUG][action.index             ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

and then a halt -

Code: Select all

[2017-04-16 11:05:02,252][DEBUG][action.admin.cluster.health] [791cc6c8-f646-495e-9e58-1ec21a24b61c] no known master node, scheduling a retry
[2017-04-16 11:05:02,337][DEBUG][action.index             ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

perhaps a network issue somewhere? i believe there are some ports that nodes need to communicate on, 9200/9300 i think. is there a firewall between them that would be blocking that communication?

dwhitfield · Post by **dwhitfield** » Thu Apr 20, 2017 1:07 pm

Thanks @tacolover101.

OP, can you run a traceroute to the different nodes?

james.liew · Post by **james.liew** » Mon Apr 24, 2017 9:56 pm

Hi all,

There are firewalls between the Log Server and some of the windows nodes.

I will advise once I do a traceroute to some of them.

Post by **tacolover101** » Mon Apr 24, 2017 10:16 pm

james.liew wrote:Hi all,

There are firewalls between the Log Server and some of the windows nodes.

I will advise once I do a traceroute to some of them.

the logs i mentioned above pertain to the NLS clusters - you'll want to make sure port 9200/9300 can make it through. nmap might prove to be more useful than a traceroute since that's just going to measure hops.

the firewalls will definitely affect the setup depending where and how they're setup to filter.

dwhitfield · Post by **dwhitfield** » Tue Apr 25, 2017 9:25 am

Thanks @tacolover101!

What OS are the nodes running? That will help us determine the firewall command you need to use, assuming it turns out to be a firewall issue.

Nagios Support Forum

Nagios Log Server listening port abruptly halts

Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts

Re: Nagios Log Server listening port abruptly halts