Nagios Failover Issue

rtsupport · Post by **rtsupport** » Tue Feb 27, 2018 10:46 am

Failover happened arround 3:30 PM EST on 24th Jan . I suppose DR does not know becuase It just check the condition and perform fail over.But if any logs are required from DR i can send.

scottwilkerson · Post by **scottwilkerson** » Tue Feb 27, 2018 10:51 am

We don't have any logs from either server for that date so I don't really have much to go on. The logs I have cover
Feb 23 03:35:05
to
Feb 24 03:16:02
not sure what timezone

I would suggest having the plugin that is being used to perform the check from the DR server, output something so you know which is the case.

rtsupport · Post by **rtsupport** » Tue Feb 27, 2018 11:12 am

I have send link to all logs from both servers. Timezone is EST on both these servers.
About script that is checking the condition, i will check if this can be altered to log the activity. But from this only case can be known, actual reason can be confirmed from the master server logs i guess.

scottwilkerson · Post by **scottwilkerson** » Tue Feb 27, 2018 1:57 pm

rtsupport wrote:Failover happened arround 3:30 PM EST on 24th Jan .

did it go down Feb 24th?
all the logs you sent were in February

scottwilkerson · Post by **scottwilkerson** » Tue Feb 27, 2018 2:05 pm

I went with this assumption and I can see the DR machine picked up at ~ 15:42

Looking just before that on the Master logs I start seeing these

Code: Select all

Feb 24 15:16:08 usa7061lv1367 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Feb 24 15:16:09 usa7061lv1367 ndo2db: Warning: queue send error, retrying...

We have a document here that can assist with tuning these
https://support.nagios.com/kb/article.php?id=139

rtsupport · Post by **rtsupport** » Fri Mar 16, 2018 4:17 am

Hi,

As per your provided link we have checked the parameters on our server and these all seems to be good. Here are the values setup on our server.

$ cat /etc/sysctl.conf |grep kernel.msg
kernel.msgmni = 512000
kernel.msgmnb = 522288000
kernel.msgmax = 522288000

This week failover happended twice at arround following times:

14 March 7:48AM IST
14 March 17:46 EST

I am sending logs ove PM. Please check if you can find something new and can advise us.

scottwilkerson · Post by **scottwilkerson** » Fri Mar 16, 2018 1:34 pm

Have we been able to determine if it is failing over because the processes don't exist or because of the nagios.log not having data?

Also, We have not received new logs, however I am going on vacation, so please send them to @npolovenko

rtsupport · Post by **rtsupport** » Mon Mar 19, 2018 5:17 am

I am sending logs on PM to you and npolovenko. Let me know if you face any issue in accessing it. By "processes don't exist" which process you mean ?

Post by **tgriep** » Mon Mar 19, 2018 4:57 pm

I think what @scottwilkerson was asking is that does your system fail over if a process stops running?
Around that time, I see that the ndo2db process was having issues with the Kernel Message Queue but no reason as to why.
It could be a MYSQL issue so can you post the mysqld.log file?
How many hosts and services is the server monitoring?

rtsupport · Post by **rtsupport** » Tue Mar 20, 2018 5:59 am

Not yet able to alter DR script to log the failover condition.
I will PM you link to mysqld.log. There are around 800 Hosts and 6000 services being monitored.

Nagios Support Forum

Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue

Re: Nagios Failover Issue