Page 1 of 3

Nagios Log Server listening port abruptly halts

Posted: Tue Apr 18, 2017 4:26 am
by james.liew
Hi all,

I've had 3 occurrences of this rather weird issue spread across two Nagios Log Server clusters in two datacentres where NLG stops listening on the designated port we use for Windows hosts, say port 3500 and then refuses to receive any log traffic on said port. The Windows boxes run the nxlog agent.

I do not see any resource issues, there are no issues with RAM or CPU utilization on the Log Server. However, I am also not well-versed in Log Server and do not know the conditions in which the cluster will pre-empt, if any, conditions where ports are no longer listening and failover to the secondary node.

Where do I start looking in the logs to try and figure out this problem?

Current NLG version:
Nagios Log Server: 1.4.4
Elasticsearch: 1.6.0
Logstash: 1.5.1
Kibana: 3.1.1-nagios3

Re: Nagios Log Server listening port abruptly halts

Posted: Tue Apr 18, 2017 11:43 am
by mcapra
The Logstash logs are a good place to start. Can you send them over? This command should put them all in the /tmp/43502_1.zip file:

Code: Select all

zip -r /tmp/43502_1.zip /var/log/logstash/
If the file is too big to attach to a post, I'll settle for the latest logstash.log file in that same path.

Re: Nagios Log Server listening port abruptly halts

Posted: Wed Apr 19, 2017 1:46 am
by james.liew
Hi,

I've added 3 files for you :)

Our Log Server stopped listening for logs around 3-5pm last Saturday.

Re: Nagios Log Server listening port abruptly halts

Posted: Wed Apr 19, 2017 1:10 pm
by mcapra
This appears to be the initial problem:

Code: Select all

{:timestamp=>"2017-04-16T11:08:34.784000+0200", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
Can you also share your Elasticsearch logs from the same day(s)? They're typically found in /var/log/elasticsearch.

Re: Nagios Log Server listening port abruptly halts

Posted: Wed Apr 19, 2017 8:54 pm
by james.liew
Elasticsearch logs uploaded.

Re: Nagios Log Server listening port abruptly halts

Posted: Thu Apr 20, 2017 12:49 am
by tacolover101
from what i can tell your nodes are disconnecting at some point.

Code: Select all

[2017-04-16 11:05:02,252][DEBUG][action.admin.cluster.health] [791cc6c8-f646-495e-9e58-1ec21a24b61c] no known master node, scheduling a retry
[2017-04-16 11:05:02,337][DEBUG][action.index             ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
and then a halt -

Code: Select all

[2017-04-16 11:05:02,252][DEBUG][action.admin.cluster.health] [791cc6c8-f646-495e-9e58-1ec21a24b61c] no known master node, scheduling a retry
[2017-04-16 11:05:02,337][DEBUG][action.index             ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
perhaps a network issue somewhere? i believe there are some ports that nodes need to communicate on, 9200/9300 i think. is there a firewall between them that would be blocking that communication?

Re: Nagios Log Server listening port abruptly halts

Posted: Thu Apr 20, 2017 1:07 pm
by dwhitfield
Thanks @tacolover101.

OP, can you run a traceroute to the different nodes?

Re: Nagios Log Server listening port abruptly halts

Posted: Mon Apr 24, 2017 9:56 pm
by james.liew
Hi all,

There are firewalls between the Log Server and some of the windows nodes.

I will advise once I do a traceroute to some of them.

Re: Nagios Log Server listening port abruptly halts

Posted: Mon Apr 24, 2017 10:16 pm
by tacolover101
james.liew wrote:Hi all,

There are firewalls between the Log Server and some of the windows nodes.

I will advise once I do a traceroute to some of them.
the logs i mentioned above pertain to the NLS clusters - you'll want to make sure port 9200/9300 can make it through. nmap might prove to be more useful than a traceroute since that's just going to measure hops.

the firewalls will definitely affect the setup depending where and how they're setup to filter.

Re: Nagios Log Server listening port abruptly halts

Posted: Tue Apr 25, 2017 9:25 am
by dwhitfield
Thanks @tacolover101!

What OS are the nodes running? That will help us determine the firewall command you need to use, assuming it turns out to be a firewall issue.