New instance failed to join cluster

rferebee · Post by **rferebee** » Wed Aug 28, 2019 5:52 pm

Hello,

I'm trying to add my last CentOS 7 upgraded server to my Log Server cluster and the first time I added it, it had some kind of issue and now I can't get back into the GUI to try and re-add it.

I deleted it from the Instance Status page of my main cluster GUI, but the new server is now telling me that the Elasticsearch service isn't running and won't let me proceed without starting.

It looks like when I tried to join it the first time it didn't create all the necessary directories, so it won't even let me start Elasticsearch. The only directory inside of /usr/local/nagioslogserver is 'lost+found'.

Is there a way to force this box to "start over" and take me back to the New Instance/Connect screen?

Thank you.

rferebee · Post by **rferebee** » Thu Aug 29, 2019 10:46 am

I attached a screenshot of what I see when I try to access the GUI of the new server.

Here's everything in the /usr/local/nagioslogserver directory:

Code: Select all

root@nagioslscc2_temp:/root>cd /usr/local/nagioslogserver/
root@nagioslscc2_temp:/usr/local/nagioslogserver>ls
lost+found

Here's what it looks like when I check the status of elasticsearch and logstash:

Code: Select all

root@nagioslscc2_temp:/usr/local/nagioslogserver>systemctl status elasticsearch
● elasticsearch.service - LSB: This service manages the elasticsearch daemon
   Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-08-28 15:41:21 PDT; 17h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 13702 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=1/FAILURE)

Aug 28 15:41:21 nagioslscc2_temp systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Aug 28 15:41:21 nagioslscc2_temp elasticsearch[13702]: Could not open input file: /usr/local/nagioslogserver/scripts/get_es_config.php
Aug 28 15:41:21 nagioslscc2_temp systemd[1]: elasticsearch.service: control process exited, code=exited status=1
Aug 28 15:41:21 nagioslscc2_temp systemd[1]: Failed to start LSB: This service manages the elasticsearch daemon.
Aug 28 15:41:21 nagioslscc2_temp systemd[1]: Unit elasticsearch.service entered failed state.
Aug 28 15:41:21 nagioslscc2_temp systemd[1]: elasticsearch.service failed.
root@nagioslscc2_temp:/usr/local/nagioslogserver>systemctl status logstash
● logstash.service - LSB: Logstash
   Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-08-28 15:39:07 PDT; 17h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 3255 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=1/FAILURE)

Aug 28 15:39:07 nagioslscc2_temp systemd[1]: Starting LSB: Logstash...
Aug 28 15:39:07 nagioslscc2_temp logstash[3255]: cat: /usr/local/nagioslogserver/var/cluster_uuid: No such file or directory
Aug 28 15:39:07 nagioslscc2_temp logstash[3255]: Could not open input file: /usr/local/nagioslogserver/scripts/get_logstash_config.php
Aug 28 15:39:07 nagioslscc2_temp systemd[1]: logstash.service: control process exited, code=exited status=1
Aug 28 15:39:07 nagioslscc2_temp systemd[1]: Failed to start LSB: Logstash.
Aug 28 15:39:07 nagioslscc2_temp systemd[1]: Unit logstash.service entered failed state.
Aug 28 15:39:07 nagioslscc2_temp systemd[1]: logstash.service failed.

So, for whatever reason the server thinks it's still connected the cluster and is trying to collect logs even though it isn't and can't?

It took 3 weeks to get this server up and running from my Unix team (don't ask)... I'm really hoping we can salvage this and it doesn't need to be rebuilt.

Post by **mbellerue** » Thu Aug 29, 2019 12:44 pm

You mentioned that there's a Lost+Found directory in /usr/local/nagioslogserver/. Is that a mount point for additional storage? Is it possible that the Log Server files got installed to /usr/local/nagioslogserver/, and then the additional storage was mounted? Maybe try umount /usr/local/nagioslogserver/ then see what's in that directory. Just to be sure, because we can recover from that pretty easily.

rferebee · Post by **rferebee** » Thu Aug 29, 2019 12:55 pm

Yes, it is a mount point to a larger storage drive.

The instance was added to the cluster accidentally before it was mounted and it looks like that's exactly what the problem is.

All of these directories are present:

Code: Select all

root@nagioslscc2_temp:/usr/local/nagioslogserver>ls
elasticsearch  etc  logstash  mibs  scripts  snapshots  tmp  var

If I just remove the /nagioslogserver directory will it "reset" everything?

Post by **mbellerue** » Thu Aug 29, 2019 1:20 pm

It won't reset everything, because there is some configuration in other places. The easiest way to deal with this is to create a temporary mount point, say at /mnt/temp/, mount your additional storage there, then,

Code: Select all

cd /usr/local/nagioslogserver/
cp -R ./* /mnt/temp/
cd /

That will get everything on to your additional storage in exactly the way that Log Server is expecting it. You then unmount the additional storage from /mnt/temp, and re-mount it at /usr/local/nagioslogserver/. Then start your services.

rferebee · Post by **rferebee** » Mon Sep 16, 2019 5:50 pm

This issue was resolved internally. We had to rebuild the VM.

Please lock this thread. Thank you.

scottwilkerson · Post by **scottwilkerson** » Tue Sep 17, 2019 7:56 am

rferebee wrote:This issue was resolved internally. We had to rebuild the VM.

Please lock this thread. Thank you.

Great!

Locking

Nagios Support Forum

New instance failed to join cluster

New instance failed to join cluster

Re: New instance failed to join cluster

Re: New instance failed to join cluster

Re: New instance failed to join cluster

Re: New instance failed to join cluster

Re: New instance failed to join cluster

Re: New instance failed to join cluster