NLS 210 new cluster issue

ssoliveira · Post by **ssoliveira** » Thu Sep 26, 2019 5:39 pm

Hello .. Good night.

We are planning to upgrade our nls cluster to version 210.

Before we perform the upgrade; We have a habit of updating the lab environment and installing a new cluster.

When we are trying to add a new server to the cluster, an error is occurring. Both machines are in the same vlan; with their local firewalls disabled.

We copy the data to notepads to avoid having some weird character at the time of copying, but nothing. We were able to monitor traffic arriving at the destination server with tcpdump; but we cannot identify what is happening.

n the release notes there are some notes about backend access restrictions, could this be an unforeseen bug?

We also noticed that you have a portion of the home page that is blank.

Post by **mbellerue** » Fri Sep 27, 2019 10:30 am

Regarding the error attaching to the cluster, what OS are the lab machines using?

The blank section of the page is the Total Disk Usage data. Is there anything special about the setup of the disks on these lab servers? If you could run a df -h on one of those servers where the disk usage dashlet isn't showing up, that would be good to see. Maybe try loading the page, and seeing if any logs show up in the Apache error logs.

ssoliveira · Post by **ssoliveira** » Fri Sep 27, 2019 1:20 pm

Hi good afternoon.

These are new CentOS 7.7 servers (full updated), provisioned exclusively for testing with the new version of Nagios Log Server.

I monitored the communication between the two VMs at Join time of the second computer to the cluster. There is only communication on port 9300

I did not identify errors in the "/var/log/elasticsearch/74133b7f-483e-45db-b4fd-298d8f0792d7.log" file

Provisioned 2 new servers to test. When I try to add the second, the same error is occurring. Is there any log that I can monitor; what detail the join procedure from server to cluster?

Code: Select all

[root@centos702 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)

[root@centos702 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                 1.9G     0  1.9G   0% /dev
tmpfs                    1.9G     0  1.9G   0% /dev/shm
tmpfs                    1.9G   17M  1.9G   1% /run
tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/mapper/centos-root   82G  3.5G   79G   5% /
/dev/mapper/centos-home   10G   33M   10G   1% /home
/dev/sda1                497M  255M  243M  52% /boot
tmpfs                    379M     0  379M   0% /run/user/0

ssoliveira · Post by **ssoliveira** » Fri Sep 27, 2019 2:02 pm

We identified the problem.

All our servers have 2 network cards (eth0: frondend-application and eth1: backend-backup).

For some reason the new server is after starting the communication with IP 10.144.142.12, it tries to continue communication with IP from interface eth1).

Code: Select all

[root@centos702 ~]# tail -f /var/log/elasticsearch/74133b7f-483e-45db-b4fd-298d8f0792d7.log

[2019-09-27 18:43:13,813][WARN ][discovery.zen            ] [8e6e3199-003d-4e60-ae42-e26278140ffa] failed to connect to master [[c174a8c9-5660-4200-81a8-4e7443c67e54][75oZ8uvnTJmoYSTnc-m0Kg][centos701.local][inet[/172.16.11.102:9300]]{max_local_storage_nodes=1}], retrying...
org.elasticsearch.transport.ConnectTransportException: [c174a8c9-5660-4200-81a8-4e7443c67e54][inet[/172.16.11.102:9300]] connect_timeout[30s]

This is strange; because we performed the installation through IP 10.144.142.12

Code: Select all

[root@centos701 ~]# cat /tmp/nagioslogserver/install.log

...
Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service.
daemons step completed OK
Running 'webroot'...
webroot step completed OK

Nagios Log Server Installation Success!

You can finish the final setup steps for Nagios Log Server by visiting:
    http://10.144.142.12/nagioslogserver/

Code: Select all

[root@centos701 ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.144.142.12  netmask 255.255.192.0  broadcast 10.144.191.255
        ether 00:50:56:88:71:5d  txqueuelen 1000  (Ethernet)
        RX packets 31863  bytes 2627232 (2.5 MiB)
        RX errors 0  dropped 559  overruns 0  frame 0
        TX packets 3434  bytes 1317220 (1.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.11.102  netmask 255.255.255.0  broadcast 172.16.11.255
        ether 00:50:56:88:77:d2  txqueuelen 1000  (Ethernet)
        RX packets 13  bytes 780 (780.0 B)
        RX errors 0  dropped 13  overruns 0  frame 0
        TX packets 6  bytes 360 (360.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 18802  bytes 4237891 (4.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 18802  bytes 4237891 (4.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Let's manually edit the cluster_hosts file; and try to finish building the cluster. If necessary, we will remove the eth1 interface temporarily. But this behavior is very strange, we have already performed the installation several times following this same procedure, and this behavior never occurred.

Code: Select all

[root@centos701 ~]# cat /usr/local/nagioslogserver/var/cluster_hosts
localhost
172.16.11.102

Post by **mbellerue** » Fri Sep 27, 2019 2:37 pm

Great catch! You might also check out the /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml file. Specifically this section

Code: Select all

############################## Network And HTTP ###############################

# Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens
# on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node
# communication. (the range means that if the port is busy, it will automatically
# try the next port).

# Set the bind address specifically (IPv4 or IPv6):
#
# network.bind_host: 192.168.0.1

# Set the address other nodes will use to communicate with this node. If not
# set, it is automatically derived. It must point to an actual IP address.
#
# network.publish_host: 192.168.0.1

# Set both 'bind_host' and 'publish_host':
#
# network.host: 192.168.0.1

Nagios Support Forum

NLS 210 new cluster issue

NLS 210 new cluster issue

Re: NLS 210 new cluster issue

Re: NLS 210 new cluster issue

Re: NLS 210 new cluster issue

Re: NLS 210 new cluster issue