Nagios Failover Issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rtsupport
Posts: 188
Joined: Mon Jan 08, 2018 4:30 am

Re: Nagios Failover Issue

Post by rtsupport »

Failover happened arround 3:30 PM EST on 24th Jan . I suppose DR does not know becuase It just check the condition and perform fail over.But if any logs are required from DR i can send.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Failover Issue

Post by scottwilkerson »

We don't have any logs from either server for that date so I don't really have much to go on. The logs I have cover
Feb 23 03:35:05
to
Feb 24 03:16:02
not sure what timezone

I would suggest having the plugin that is being used to perform the check from the DR server, output something so you know which is the case.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rtsupport
Posts: 188
Joined: Mon Jan 08, 2018 4:30 am

Re: Nagios Failover Issue

Post by rtsupport »

I have send link to all logs from both servers. Timezone is EST on both these servers.
About script that is checking the condition, i will check if this can be altered to log the activity. But from this only case can be known, actual reason can be confirmed from the master server logs i guess.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Failover Issue

Post by scottwilkerson »

rtsupport wrote:Failover happened arround 3:30 PM EST on 24th Jan .
did it go down Feb 24th?
all the logs you sent were in February
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Failover Issue

Post by scottwilkerson »

I went with this assumption and I can see the DR machine picked up at ~ 15:42

Looking just before that on the Master logs I start seeing these

Code: Select all

Feb 24 15:16:08 usa7061lv1367 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Feb 24 15:16:09 usa7061lv1367 ndo2db: Warning: queue send error, retrying...
We have a document here that can assist with tuning these
https://support.nagios.com/kb/article.php?id=139
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rtsupport
Posts: 188
Joined: Mon Jan 08, 2018 4:30 am

Re: Nagios Failover Issue

Post by rtsupport »

Hi,

As per your provided link we have checked the parameters on our server and these all seems to be good. Here are the values setup on our server.

$ cat /etc/sysctl.conf |grep kernel.msg
kernel.msgmni = 512000
kernel.msgmnb = 522288000
kernel.msgmax = 522288000

This week failover happended twice at arround following times:

14 March 7:48AM IST
14 March 17:46 EST

I am sending logs ove PM. Please check if you can find something new and can advise us.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Failover Issue

Post by scottwilkerson »

Have we been able to determine if it is failing over because the processes don't exist or because of the nagios.log not having data?

Also, We have not received new logs, however I am going on vacation, so please send them to @npolovenko
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
rtsupport
Posts: 188
Joined: Mon Jan 08, 2018 4:30 am

Re: Nagios Failover Issue

Post by rtsupport »

I am sending logs on PM to you and npolovenko. Let me know if you face any issue in accessing it. By "processes don't exist" which process you mean ?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios Failover Issue

Post by tgriep »

I think what @scottwilkerson was asking is that does your system fail over if a process stops running?
Around that time, I see that the ndo2db process was having issues with the Kernel Message Queue but no reason as to why.
It could be a MYSQL issue so can you post the mysqld.log file?
How many hosts and services is the server monitoring?
Be sure to check out our Knowledgebase for helpful articles and solutions!
rtsupport
Posts: 188
Joined: Mon Jan 08, 2018 4:30 am

Re: Nagios Failover Issue

Post by rtsupport »

Not yet able to alter DR script to log the failover condition.
I will PM you link to mysqld.log. There are around 800 Hosts and 6000 services being monitored.
Last edited by tgriep on Tue Mar 20, 2018 8:05 am, edited 1 time in total.
Reason: Downloaded file and shared with other techs.
Locked