Page 1 of 1

Myriad log server troubles

Posted: Thu Mar 12, 2020 8:15 am
by jpconsilio
It seems like every time I connect to one or the other of my two-instance Nagios log server clusters, I find one of the following situations:
* Yellow health with half of the shards unassigned; other node unavailable
* Elasticsearch not running; please wait
* Page times out; login page fails to load
In every case, the servers are available for SSH and I am able to confirm that the logstash and elasticsearch services are running.
On another recent case where the servers could not "see" one another, the clusterhosts file no longer included the partner. Manually adding back the missing cluster instance and restarting both instances seemed to get things back on track.
In every one of these cases, rebooting both nodes seems to fix things, but only temporarily. Within a day or two I'll get a repeat.

I'm running 2.1.1 on both instances of cluster1. Cluster 2 is mixed with one instance on 2.1.1 and the other on 2.1.2. I expect all should be upgraded to 2.1.4.

Cluster 1's file systems look like this:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 1005G 276G 688G 29% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 12M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 976M 197M 713M 22% /boot
tmpfs 782M 0 782M 0% /run/user/1000
tmpfs 782M 0 782M 0% /run/user/0

cluster 2's file systems look like this:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 1005G 276G 688G 29% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 12M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 976M 197M 713M 22% /boot
tmpfs 782M 0 782M 0% /run/user/1000
tmpfs 782M 0 782M 0% /run/user/0

Any assistance appreciated.

Re: Myriad log server troubles

Posted: Thu Mar 12, 2020 2:40 pm
by mbellerue
jpconsilio wrote:On another recent case where the servers could not "see" one another, the clusterhosts file no longer included the partner. Manually adding back the missing cluster instance and restarting both instances seemed to get things back on track.
In every one of these cases, rebooting both nodes seems to fix things, but only temporarily. Within a day or two I'll get a repeat.
This sounds suspiciously like a duplicate IP address. One way to keep an eye on this might be to spin up a passive service check using NCPA, have it check arp | grep <OtherNode'sIP> | awk {'print $3'} and if it comes back with a MAC other than the specified node's IP, you've got something to work with.

Otherwise, there's a lot of issues you listed off, multiple clusters, multiple versions of Log Server. We may have to break this out into multiple threads. We'll definitely need to work with a specific issue on a specific instance of Log Server in order for our troubleshooting time to be effective.

Re: Myriad log server troubles

Posted: Fri Mar 13, 2020 9:58 am
by jpconsilio
Hi,
I have confirmed that there is no IP conflict at present. arps on clusters and switches are correct.

At present cluster 1, node 1 unreachable via https. cluster 1 node 2 shows "Waiting for Database Startup" error page for Elasticsearch.
Cluster 1, node 1:
[root@MTPVPANLM01 ~]# systemctl status elasticsearch.service
● elasticsearch.service - LSB: This service manages the elasticsearch daemon
Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
Active: active (running) since Wed 2020-03-11 00:55:20 UTC; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 8570 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/elasticsearch.service
└─8799 /bin/java -Xms3906m -Xmx3906m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC ...

Cluster 1, node 2:
[root@MTPVPANLM02 ~]# systemctl status elasticsearch.service
● elasticsearch.service - LSB: This service manages the elasticsearch daemon
Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
Active: active (exited) since Wed 2020-03-11 02:02:46 UTC; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 8559 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)

Mar 11 02:02:46 MTPVPANLM02.consilio.com systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Mar 11 02:02:46 MTPVPANLM02.consilio.com runuser[8731]: pam_unix(runuser:session): session opened for user nag...d=0)
Mar 11 02:02:46 MTPVPANLM02.consilio.com runuser[8731]: pam_unix(runuser:session): session closed for user nagios
Mar 11 02:02:46 MTPVPANLM02.consilio.com elasticsearch[8559]: Starting elasticsearch: [ OK ]
Mar 11 02:02:46 MTPVPANLM02.consilio.com systemd[1]: Started LSB: This service manages the elasticsearch daemon.
Hint: Some lines were ellipsized, use -l to show in full.
[root@MTPVPANLM02 ~]#

Cluster 2 seems fine at the moment, although a bit slow.

Re: Myriad log server troubles

Posted: Fri Mar 13, 2020 5:28 pm
by mbellerue
Okay, from cluster 1, node 2, let's get a system profile. Since you can't get in to the GUI, we'll grab it from the command line. Log in to the server as root and run /usr/local/nagioslogserver/scripts/profile.sh 20200312 and then get the system profile from the /tmp directory. You can PM it to me if you don't want to post it directly on the forum. We'll take a look and see if we can find why elasticsearch crashed.