Page 3 of 3

Re: New NLS node errors out when trying to add to cluster

Posted: Thu Mar 26, 2015 9:02 am
by tgriep
Can you run the following on both servers?
The node_uuid should be unique on both servers, is they are the same, that could be causing the issue you are having.

Code: Select all

cat /usr/local/nagioslogserver/var/node_uuid

Re: New NLS node errors out when trying to add to cluster

Posted: Thu Mar 26, 2015 9:13 am
by jolson
Please get the node uuids that tgriep has recommended.

Also, please check your java version:

Code: Select all

java -version
There are many Java exceptions in your logs, I want to make sure your java version is compatible.
For your information, my java version is as follows:
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
The requirement for Elasticsearch is java version 1.7.0_55 or higher.

Re: New NLS node errors out when trying to add to cluster

Posted: Thu Mar 26, 2015 11:19 am
by 2evanowen
It looks like they are different and the java is the correct version.

Code: Select all

[owen@barium ~]$ cat /usr/local/nagioslogserver/var/node_uuid
73416932-7e0c-49e9-a7b8-48601392639e
[owen@barium ~]$ java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Code: Select all

[owen@radium ~]$ cat /usr/local/nagioslogserver/var/node_uuid
91614dae-5637-43b7-8ef9-caafa2163f77
[owen@radium ~]$ java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Re: New NLS node errors out when trying to add to cluster

Posted: Thu Mar 26, 2015 1:07 pm
by jolson
My hunch at the moment is that your NLS client (Barium) cannot join Radium because the 9xxx ports are not up and listening properly as shown by the NMAP scan. Can you run a netstat on Barium and verify that these ports are not listening?

Code: Select all

netstat -na|grep LISTEN
You should see ports in the 9XXX range. Return your results. If those ports are not showing, elasticsearch (and logstash) will not be able to communicated with the rest of your cluster. I am almost certain that these ports are not up, but this netstat will verify it.

Could you also run a cat on your node hosts files please (on both Radium and Barium):

Code: Select all

cat /usr/local/nagioslogserver/var/cluster_hosts
It would worth running our upgrade script on your Barium machine to see if that will fix anything. Please backup/snapshot your Barium machine before doing so:

Code: Select all

cd /tmp
wget http://assets.nagios.com/downloads/nagios-log-server/2015/nagioslogserver-2015r1.3.tar.gz
tar xfz nagioslogserver-2015r1.3.tar.gz
cd nagioslogserver
./upgrade
Let me know if there are any errors during the upgrade procedure.

After this upgrade, try the netstat command once more:

Code: Select all

netstat -na|grep LISTEN
Also, it may be worth seeing your etc/passwd file entries:

Code: Select all

cat /etc/passwd | egrep 'apache|nagios'
The entries should look similar to this:

Code: Select all

apache:x:48:48:Apache:/var/www:/sbin/nologin
nagios:x:500:100::/home/nagios:/bin/bash
Elasticsearch logs aren't showing any bind errors. If you decided to reformat, you would only need to reformat Barium. Thanks, and let us know!

Re: New NLS node errors out when trying to add to cluster

Posted: Fri Mar 27, 2015 12:37 pm
by 2evanowen
I don't see any ports open on 9xxx

Code: Select all

[owen@barium ~]$ netstat -na|grep LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:56571           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:10080           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:199           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp6       0      0 :::2301                 :::*                    LISTEN
tcp6       0      0 :::5666                 :::*                    LISTEN
tcp6       0      0 :::37                   :::*                    LISTEN
tcp6       0      0 :::2381                 :::*                    LISTEN
tcp6       0      0 :::111                  :::*                    LISTEN
tcp6       0      0 :::80                   :::*                    LISTEN
tcp6       0      0 :::52113                :::*                    LISTEN
tcp6       0      0 :::22                   :::*                    LISTEN
unix  2      [ ACC ]     STREAM     LISTENING     25653    /var/run/avahi-daemon                           /socket
unix  2      [ ACC ]     STREAM     LISTENING     25656    /var/run/rpcbind.sock
unix  2      [ ACC ]     STREAM     LISTENING     25658    /var/run/dbus/system_                           bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     21354    /run/lvm/lvmetad.sock                           et
unix  2      [ ACC ]     STREAM     LISTENING     21612    /run/systemd/private
unix  2      [ ACC ]     SEQPACKET  LISTENING     21368    /run/udev/control
unix  2      [ ACC ]     STREAM     LISTENING     8849     /var/run/lsm/ipc/sim
unix  2      [ ACC ]     STREAM     LISTENING     1425     /run/systemd/journal/                           stdout
unix  2      [ ACC ]     STREAM     LISTENING     8851     /var/run/lsm/ipc/simc
unix  2      [ ACC ]     STREAM     LISTENING     29848    /var/run/abrt/abrt.so                           cket
unix  2      [ ACC ]     STREAM     LISTENING     28884    /var/run/rpcbind.sock

Looks like every single time it actually added it. (that might be a bug?)

Code: Select all

[owen@barium ~]$     cat /usr/local/nagioslogserver/var/cluster_hosts
localhost
10.30.216.88
https://radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium
10.30.216.88
radium
radium
radium.colo.seagate.com
10.30.216.88
10.30.216.88
10.30.216.88
barium.colo.seagate.com
barium.colo.seagate.com
10.30.216.88
10.30.216.88
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
10.30.216.88
radium.colo.seagate.com
10.30.216.88
10.30.216.88
barium.colo.seagate.com/
barium.colo.seagate.com
10.30.216.56
barium.colo.seagate.com
radium.colo.seagate.com
radium.colo.seagate.com
10.30.216.88

Code: Select all

[owen@radium ~]$  cat /usr/local/nagioslogserver/var/cluster_hosts
localhost
10.30.216.88
No errors... During the upgrade

Code: Select all

[owen@barium nagioslogserver]$ sudo ./upgrade
[sudo] password for owen:
Archive:  sourceguardian/ixed4.lin.x86-64.zip
  inflating: /usr/lib64/php/modules/ixed.5.4.lin
Sourceguardian extension found for PHP version 5.4
Sourceguardian extension already in php.ini
Redirecting to /bin/systemctl restart  httpd.service
Upgrading Kibana...
Kibana upgraded OK
Restarting elasticsearch (via systemctl):                  [  OK  ]

Nagios Log Server Upgrade Complete!

You can access the Nagios Log Server web interface by visiting:
    http:///nagioslogserver/
Still no port 9xxx open.

Code: Select all

[owen@barium nagioslogserver]$ netstat -na|grep LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:56571           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:10080           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:199           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp6       0      0 :::2301                 :::*                    LISTEN
tcp6       0      0 :::5666                 :::*                    LISTEN
tcp6       0      0 :::37                   :::*                    LISTEN
tcp6       0      0 :::2381                 :::*                    LISTEN
tcp6       0      0 :::111                  :::*                    LISTEN
tcp6       0      0 :::80                   :::*                    LISTEN
tcp6       0      0 :::52113                :::*                    LISTEN
tcp6       0      0 :::22                   :::*                    LISTEN
unix  2      [ ACC ]     STREAM     LISTENING     25653    /var/run/avahi-daemon/socket
unix  2      [ ACC ]     STREAM     LISTENING     25656    /var/run/rpcbind.sock
unix  2      [ ACC ]     STREAM     LISTENING     25658    /var/run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     21354    /run/lvm/lvmetad.socket
unix  2      [ ACC ]     STREAM     LISTENING     21612    /run/systemd/private
unix  2      [ ACC ]     SEQPACKET  LISTENING     21368    /run/udev/control
unix  2      [ ACC ]     STREAM     LISTENING     8849     /var/run/lsm/ipc/sim
unix  2      [ ACC ]     STREAM     LISTENING     1425     /run/systemd/journal/stdout
unix  2      [ ACC ]     STREAM     LISTENING     8851     /var/run/lsm/ipc/simc
unix  2      [ ACC ]     STREAM     LISTENING     29848    /var/run/abrt/abrt.socket
unix  2      [ ACC ]     STREAM     LISTENING     28884    /var/run/rpcbind.sock

Mine looks a bit different.

Code: Select all

[owen@barium nagioslogserver]$ cat /etc/passwd | egrep 'apache|nagios'
nagios:x:129:129:Nagios:/var/log/nagios:/bin/bash
apache:x:48:48:Apache:/usr/share/httpd:/sbin/nologin

Re: New NLS node errors out when trying to add to cluster

Posted: Fri Mar 27, 2015 1:14 pm
by jolson
This explains why we were seeing a resolve error to https://radium.colo.seagate.com - you must have entered the address incorrectly initially and when the list is read from top to bottom it fails at that entry. Could you please remove all entries from the Barium cluster_hosts file except for localhost, and try to re-add it to radium from the WebGUI?

After re-adding, please cat your cluster_hosts file once more and return the output here. Also, attempt another netstat for the 9XXX ports. This is interesting because the 9XXX ports will shut down for a short period when the node is initially joined to a cluster - I'm wondering if yours shut down permanently because of the incorrect entry?

Re: New NLS node errors out when trying to add to cluster

Posted: Fri Mar 27, 2015 2:22 pm
by 2evanowen
THAT WORKED!


Deleted everything

Code: Select all

[owen@barium ~]$ cat /usr/local/nagioslogserver/var/cluster_hosts
localhost
Screen Shot 2015-03-27 at 1.19.01 PM.png
looks like 9200 is open now

Code: Select all

[owen@barium ~]$ netstat -na|grep LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:56571           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:10080           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:199           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp6       0      0 :::3515                 :::*                    LISTEN
tcp6       0      0 :::2301                 :::*                    LISTEN
tcp6       0      0 :::5666                 :::*                    LISTEN
tcp6       0      0 :::37                   :::*                    LISTEN
tcp6       0      0 :::2056                 :::*                    LISTEN
tcp6       0      0 :::5544                 :::*                    LISTEN
tcp6       0      0 :::2057                 :::*                    LISTEN
tcp6       0      0 :::2381                 :::*                    LISTEN
tcp6       0      0 :::111                  :::*                    LISTEN
tcp6       0      0 127.0.0.1:9200          :::*                    LISTEN
tcp6       0      0 :::80                   :::*                    LISTEN
tcp6       0      0 :::52113                :::*                    LISTEN
tcp6       0      0 :::9300                 :::*                    LISTEN
tcp6       0      0 :::22                   :::*                    LISTEN
unix  2      [ ACC ]     STREAM     LISTENING     25653    /var/run/avahi-daemon/socket
unix  2      [ ACC ]     STREAM     LISTENING     25656    /var/run/rpcbind.sock
unix  2      [ ACC ]     STREAM     LISTENING     25658    /var/run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     21354    /run/lvm/lvmetad.socket
unix  2      [ ACC ]     STREAM     LISTENING     21612    /run/systemd/private
unix  2      [ ACC ]     SEQPACKET  LISTENING     21368    /run/udev/control
unix  2      [ ACC ]     STREAM     LISTENING     8849     /var/run/lsm/ipc/sim
unix  2      [ ACC ]     STREAM     LISTENING     1425     /run/systemd/journal/stdout
unix  2      [ ACC ]     STREAM     LISTENING     8851     /var/run/lsm/ipc/simc
unix  2      [ ACC ]     STREAM     LISTENING     29848    /var/run/abrt/abrt.socket
unix  2      [ ACC ]     STREAM     LISTENING     28884    /var/run/rpcbind.sock
and it's added. :)

Code: Select all

[owen@barium ~]$ cat /usr/local/nagioslogserver/var/cluster_hosts
localhost

radium.colo.seagate.com
10.30.216.56
10.30.216.88

Thank you jolson and tgriep!

Re: New NLS node errors out when trying to add to cluster

Posted: Mon Mar 30, 2015 9:09 am
by ssax
I'm glad that worked for you, can we mark this as resolved and lock the topic?

Re: New NLS node errors out when trying to add to cluster

Posted: Thu Jul 16, 2015 3:02 pm
by 2evanowen
Yes.

Re: New NLS node errors out when trying to add to cluster

Posted: Thu Jul 16, 2015 3:04 pm
by jolson
Thanks - will do!