Myriad log server troubles
Posted: Thu Mar 12, 2020 8:15 am
It seems like every time I connect to one or the other of my two-instance Nagios log server clusters, I find one of the following situations:
* Yellow health with half of the shards unassigned; other node unavailable
* Elasticsearch not running; please wait
* Page times out; login page fails to load
In every case, the servers are available for SSH and I am able to confirm that the logstash and elasticsearch services are running.
On another recent case where the servers could not "see" one another, the clusterhosts file no longer included the partner. Manually adding back the missing cluster instance and restarting both instances seemed to get things back on track.
In every one of these cases, rebooting both nodes seems to fix things, but only temporarily. Within a day or two I'll get a repeat.
I'm running 2.1.1 on both instances of cluster1. Cluster 2 is mixed with one instance on 2.1.1 and the other on 2.1.2. I expect all should be upgraded to 2.1.4.
Cluster 1's file systems look like this:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 1005G 276G 688G 29% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 12M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 976M 197M 713M 22% /boot
tmpfs 782M 0 782M 0% /run/user/1000
tmpfs 782M 0 782M 0% /run/user/0
cluster 2's file systems look like this:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 1005G 276G 688G 29% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 12M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 976M 197M 713M 22% /boot
tmpfs 782M 0 782M 0% /run/user/1000
tmpfs 782M 0 782M 0% /run/user/0
Any assistance appreciated.
* Yellow health with half of the shards unassigned; other node unavailable
* Elasticsearch not running; please wait
* Page times out; login page fails to load
In every case, the servers are available for SSH and I am able to confirm that the logstash and elasticsearch services are running.
On another recent case where the servers could not "see" one another, the clusterhosts file no longer included the partner. Manually adding back the missing cluster instance and restarting both instances seemed to get things back on track.
In every one of these cases, rebooting both nodes seems to fix things, but only temporarily. Within a day or two I'll get a repeat.
I'm running 2.1.1 on both instances of cluster1. Cluster 2 is mixed with one instance on 2.1.1 and the other on 2.1.2. I expect all should be upgraded to 2.1.4.
Cluster 1's file systems look like this:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 1005G 276G 688G 29% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 12M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 976M 197M 713M 22% /boot
tmpfs 782M 0 782M 0% /run/user/1000
tmpfs 782M 0 782M 0% /run/user/0
cluster 2's file systems look like this:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 1005G 276G 688G 29% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 12M 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 976M 197M 713M 22% /boot
tmpfs 782M 0 782M 0% /run/user/1000
tmpfs 782M 0 782M 0% /run/user/0
Any assistance appreciated.