Fusion stability issues.

This support forum board is for questions relating to Nagios Fusion.
martinnick
Posts: 11
Joined: Tue Apr 10, 2018 1:28 pm

Fusion stability issues.

Post by martinnick »

It seems that every weekend our Fusion instance becomes volatile and crashes. I would love to get help diagnosing this problem and resolve.
Currently I am getting the message below in the Fusion logs. But, I don't think it would cause the application to crash entirely. I thought it would be best to get help before I started poking around myself.
poll_server() CHECK YOUR LIVE_DATA_TIMEOUT SETTINGS. IT MAY NEED INCREASED
cnorell
Developer
Posts: 65
Joined: Mon Nov 27, 2017 3:08 pm

Re: Fusion stability issues.

Post by cnorell »

Hey martinnick,

There are two likely causes for this error:

1. The timeout setting is limiting Fusion's ability to poll your fused servers. To potentially fix this issue, navigate to:

Admin > System Settings > Data & Polling > Live Data Timeout

For testing purposes, I would bump it up to something around 90-120 seconds. If that fixes the issue, you can drop it incrementally until you reach the lowest stable value.

2. The other likely cause for this issue is incorrect setting for fused servers. Please go through your fused servers have the correct IP addresses specified, with the proper format, and correct credentials.

The server URL format should look something like:

Code: Select all

http://192.168.1.1/nagiosxi
Let me know if any of the above helps out.

Best Regards,

Cory Norell
martinnick
Posts: 11
Joined: Tue Apr 10, 2018 1:28 pm

Re: Fusion stability issues.

Post by martinnick »

Corey,

It appears that I may have failed to give you better history on other issues we have had here in the past. I will post them below to give a better reference to our on going challenges here.

https://support.nagios.com/forum/viewto ... 17&t=49945

Also ticket #100103 illustrates issues with Fusion itself. In addition, it has come to my understanding that we have already tried a few changes to Fusion and it's database. Those changes are outlined below.

The Live Timeout was at 600 from the ticket. I moved it to the default of 45. Or, it could be that we changed the Simultaneous Pollers from 1 to 6 and changed the Polling Subsystem Memory Limit from 0 to 1GB. Lastly, we made the MariaDB cache (/etc/my.cnf) cache 6GB instead of 1 GB. I tried reverting it to 1GB and saw the issue still, I also tried it at 2GB and 4GB without any change to the problem.
martinnick
Posts: 11
Joined: Tue Apr 10, 2018 1:28 pm

Re: Fusion stability issues.

Post by martinnick »

Oddly enough it appears as if my last reply hasn't posted yet. I will repost if it hasn't before this one does.
I just wanted to add that we are seeing a large number of these processes running and not terminating listed below. Just wondering if it could also be impacting stability and/or performance.
/usr/local/nagiosfusion/cron/dbmaint_subsys.php
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Fusion stability issues.

Post by ssax »

I've reached out to the developer who worked with you on the remote to see if he as any ideas given the changes that have been implemented. I will update you when I hear back.
martinnick
Posts: 11
Joined: Tue Apr 10, 2018 1:28 pm

Re: Fusion stability issues.

Post by martinnick »

Unfortunately this is seriously impacting our conversion process over to Nagios as an Enterprise Monitoring solution. We are well underway with this conversion process which has a very strict time constraint. If we could get quicker responses in this endeavor toward resolution, it would be greatly appreciated. Thank you for any help you can provide. If it would be easier working these problems during another conference call, we would be readily available at anytime.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Fusion stability issues.

Post by ssax »

I talked with a developer and the other technicians and we are wondering if you have any backups (like netbackup or another backup product, even a VM backup solution) running or any vmotion or any other IO resource intensive external processes occurring during the weekends that may be impacting it? We can't really think of anything internally in Fusion that may be causing it to fail on the weekends specifically. Does it fail at the same time every weekend? Do you see anything in /var/log/messages or any of your other logs during that time that may indicate anything further?

When you say that it becomes volatile and crashes, what specifically are you seeing that indicates this? What specifically are you doing once you realize that to get it back working at that point?

Thank you
martinnick
Posts: 11
Joined: Tue Apr 10, 2018 1:28 pm

Re: Fusion stability issues.

Post by martinnick »

So I have been informed that this happens a little more frequently than over the weekends. The lockups have also happened during the day throughout the week. There doesn't seem to be anything in /var/log/messages that would indicate a problem. This is a virtual instance however, V-motion is not currently active. It also doesn't appear to happen at any specific time. There is a backup preformed on all VMs nightly but, there
doesn't seem to be any correlation between the backups and when it crashes. What I can tell you is that the database spikes cpu usage before it dies.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Fusion stability issues.

Post by ssax »

Hmm, how many of these were you seeing? Now I'm wondering if it's related.

Code: Select all

/usr/local/nagiosfusion/cron/dbmaint_subsys.php
Can you create a ticket with a link back to this thread so that I can get a remote scheduled to dig into this?

https://support.nagios.com/tickets
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Fusion stability issues.

Post by ssax »

Additionally, please include this file as well:

Code: Select all

/usr/local/nagiosfusion/var/log/dbmaint_subsys.log
Locked