Limit the number of poll request from poll_subsys.php

This support forum board is for questions relating to Nagios Fusion.
Locked
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Limit the number of poll request from poll_subsys.php

Post by chicjo01 »

Nagios Fusion: 4.1.2

The server has 32 Cores and 65 Gig of Memory.
I have 6 Nagios XI server fused

I have added about 60 users so far and my servers memory is getting used up by the poll_subsys.php for each user. When I checked the process list, it had over 300+ poll_subsys.php running. Is there a way to only allow 50 or 60 poll_subsys to run at a time? It looks like 4 to 6 requests are being made per user. 60 x 6 = 360 process that a polling and using up the resources. If I understand the "Simultaneous Pollers" it is allow multiple poll_subsys to run at the same time. Would this mean 5 out of 360 are getting polled and the rest have to wait until a spot is opened up?

Does the poll_subsys.php script have enough validation checking to not kick off additionally polls for a user waiting to be processed? Could there be a queue system like mod_gearman that will help to ensure the server resources are not over used?

Code: Select all

while true; do ps -ef | grep poll_subsys.php | wc -l; sleep 3; done
430
438
439
436
436
446
443
443
442
442
442
441
437
436
436
434
434
434
434
434
434
434
434
434
436
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
436
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
434
436
436
434
434
Capture.PNG
You do not have the required permissions to view the files attached to this post.
Last edited by tmcdonald on Wed Mar 21, 2018 9:20 am, edited 1 time in total.
Reason: Please use [code][/code] tags around source code
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Limit the number of poll request from poll_subsys.php

Post by bheden »

I was the developer that worked on the original rewrite of Fusion for 4.0, so I'll take a stab at what's going on here. In order:
I have added about 60 users so far and my servers memory is getting used up by the poll_subsys.php for each user.
That's a lot of users on a Fusion system. I think this is the largest system in terms of users I've ever heard about in the wild.
When I checked the process list, it had over 300+ poll_subsys.php running. Is there a way to only allow 50 or 60 poll_subsys to run at a time?
I would think most of them are spawning up and jumping to the next available server/user combo to process. There is not a way currently to limit them, but that is a good idea.
It looks like 4 to 6 requests are being made per user. 60 x 6 = 360 process that a polling and using up the resources.
Are you utilizing the mapped user functionality? If you aren't it should only be 1 request per user, otherwise that sounds accurate.
If I understand the "Simultaneous Pollers" it is allow multiple poll_subsys to run at the same time. Would this mean 5 out of 360 are getting polled and the rest have to wait until a spot is opened up?
The poll_subsys is both the manager and the worker. The manager is the cron job, and it spins off workers as necessary. With that many servers and users, I can see why it thinks continuing to spawn workers is necessary.
Does the poll_subsys.php script have enough validation checking to not kick off additionally polls for a user waiting to be processed?
Yes. There are locks per server/user combo.
Could there be a queue system like mod_gearman that will help to ensure the server resources are not over used?
This is a very good idea.

My questions to you:

What are your global polling intervals? Do you have any per-server polling intervals set? If so, what are they?

Same questions, except replace 'polling interval' with authentication interval.

What are your Polling Configuration options set to? Polling Subsystem Memory Limit, Simultaneous Pollers, Live Data Timeout, Polling Lock Max Age (this is the server/user lock I mentioned above. How long until each one goes stale in the case of a borked process), Poll Record Count.

Are you using mostly Fusekey authentication or Username/Password? Rough percentages of each if you could, please.

In the meantime, I'll add some of your ideas to some internal lists we have so we can review them and decide what to do. Thank you very much for your suggestions.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: Limit the number of poll request from poll_subsys.php

Post by chicjo01 »

We currently only have 6 Nagios XI servers fused with the fuse key. We are planning to add 8 Nagios Core servers later.

We are looking to add 145 users currently, with the expectation of 180 users by the end of the year. Each user is setup with mapping to there login on each of the Nagios XI servers. The reason, is to avoid the session conflict issue from the previous version of Nagios Fusion.
In case you are note aware of this issue, the Nagios Fusion servers use a specific username and password to gather the information. So when a user logs into Nagios Fusion as themselves, then switch over to a specific Nagios XI server. The user that gets logged into the Nagios XI server is not them, but the user assigned to gather the information. Even if the user logs out and back in to the Nagios XI, it would more often then not cache the login session from the users that gathers the information and will swap between the two session randomly.
Below is the requested information about the polling intervals.
Capture.PNG
You do not have the required permissions to view the files attached to this post.
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Limit the number of poll request from poll_subsys.php

Post by bheden »

Okay. Well feedback like this is extremely helpful for us to continue to improve the product - so I really appreciate it.

For your polling configuration, here's an ideal configuration in my mind:

Polling subsystem memory limit: 512M is probably good, but the unfortunate thing is that PHP is not very efficient with associative arrays - which is the reason this needs to be anywhere near this high to begin with. You're kind of restricted based on the expected output of your largest server. This option needs to be set high enough to handle that. The good news is that other ones don't necessarily use this much, it's just what is available to them.

Increase your simultaneous pollers to 20 and see how it performs. Scale it back by 2 or 3 at a time until you find an optimal setting. If you find it's performing better at 20, increase it until it stops doing that.

Live Data timeout - 20 is probably fine. These settings were all set very high as defaults just for general coverage. If a chunk of data is taking 20 seconds to come back, you've got other problems probably. It sounds like you've got your house in good order. I would drop this to 2 or 3 seconds. You can test how long it actually takes by looking through the poll_subsys log with debugging enabled to watch the individual urls that are called. Then you can use curl to test these. I suspect each chunk takes less than second to respond.

Polling Lock max age - In your case I think this number seems fine, but it is definitely worth playing with. After making the other adjustments listed here first, see if dropping this makes a difference - I doubt it will, but you won't know until you try.

Poll Record count - This was added because grabbing entire swaths of "host detail" or "service detail" on large systems took far too long. If your XI servers have fast databases/disks, then by all means crank this up. For example, if you have an XI server with 6001 hosts, and you have 6 mapped users to that server, it will be a total of 7 (7 = (6001 / 1000 == 6 + the remaining call for the last 1)) * 7 (6 mapped users + general) = 42 calls. The caveat here is the memory limit - remember that PHP isn't very efficient with arrays, and the larger this number, the larger the array will be. But it sounds like your Fusion server has plenty of resources, so definitely tweak this.

And then the polling interval. Do you need the data every 5 minutes? Is a saner default every 10 minutes, and then some servers set to 5? I can't answer these questions, only you can.

I hope this helps. Please report back if what I've explained here has helped. This is an exciting infrastructure for me, actually, and I look forward to reading your findings very much :)
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: Limit the number of poll request from poll_subsys.php

Post by chicjo01 »

I took your suggestions and I will continue to adjustment to make them work. Something I have noticed while testing these adjustments, that it is talking a very long time for one complete poll cycle to completed.

Executed the date command before and after a single run of the poll_subsys.php script. This only took 3 mins to kick off all of the poll workers.
date; /usr/bin/php -q /usr/local/nagiosfusion/cron/poll_subsys.php --max-time=60 --master-poll >>/usr/local/nagiosfusion/var/log/poll_subsys.log; date
Thu Mar 22 08:35:24 EDT 2018
Thu Mar 22 08:38:20 EDT 2018
It took around 3 hours for all of the poll workers to complete from one poll_subsys.php request. Is this normal? Is there a way to speed this process up?
date; ps -ef | grep poll_subsys.php | wc -l
Thu Mar 22 10:05:49 EDT 2018
123
date; ps -ef | grep poll_subsys.php | wc -l
Thu Mar 22 10:13:34 EDT 2018
62
date; ps -ef | grep poll_subsys.php | wc -l
Thu Mar 22 10:31:11 EDT 2018
62
date; ps -ef | grep poll_subsys.php | wc -l
Thu Mar 22 11:24:37 EDT 2018
1
Capture.PNG
You do not have the required permissions to view the files attached to this post.
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Limit the number of poll request from poll_subsys.php

Post by bheden »

Is this normal?
Absolutely not.

Can you re-run the polling subsystem with the --debug flag and PM me the output? I may have to give you some additional commands as well to run.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
User avatar
chicjo01
Posts: 194
Joined: Tue Jul 28, 2015 2:52 pm

Re: Limit the number of poll request from poll_subsys.php

Post by chicjo01 »

Ran the requested information and created a support ticket (137034). The file was 27 megs.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Limit the number of poll request from poll_subsys.php

Post by tmcdonald »

Going to lock this up and we will continue in the thread.
Former Nagios employee
Locked