xen/common/sched/cpupool.c | 2 - xen/common/sched/credit2.c | 130 +++++++++++++++++++++++++++++++++++++++----- xen/common/sched/private.h | 2 + 3 files changed, 119 insertions(+), 15 deletions(-)
In Credit2, the CPUs are assigned to runqueues according to the host topology. For instance, if we want per-socket runqueues (which is the default), the CPUs that are in the same socket will end up in the same runqueue. This is generally good for scalability, at least until the number of CPUs that end up in the same runqueue is not too high. In fact, all this CPUs will compete for the same spinlock, for making scheduling decisions and manipulating the scheduler data structures. Therefore, if they are too many, that can become a bottleneck. This has not been an issue so far, but architectures with 128 CPUs per socket are now available, and it is certainly unideal to have so many CPUs in the same runqueue, competing for the same locks, etc. Let's therefore set a cap to the total number of CPUs that can share a runqueue. The value is set to 16, by default, but can be changed with a boot command line parameter. Note that this is orthogonal and independent from the activity of introducing runqueue arrangement mechanisms and chriteria. In fact, we very well may want to implement the capability of having, say, per-LLC runqueues, or per-die (where that is defined) runqueues. In fact, this would work well on current system, but nothing guarantees that, in future, there won't be an architecture with hundreds of CPUs per DIE or LLC... And we'll be back to where we are now. Therefore, even if we are actually planning to add some new runqueue organization criteria, fixing a limit for the number of CPUs that will ever share a runqueue, whatever the underlying topology is, is still useful. Note also that, if the host has hyperthreading (and we are not using core-scheduling), additional care is taken to avoid splitting CPUs that are hyperthread siblings among different runqueues. --- Dario Faggioli (2): xen: credit2: factor cpu to runqueue matching in a function xen: credit2: limit the max number of CPUs in a runqueue xen/common/sched/cpupool.c | 2 - xen/common/sched/credit2.c | 130 +++++++++++++++++++++++++++++++++++++++----- xen/common/sched/private.h | 2 + 3 files changed, 119 insertions(+), 15 deletions(-) -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
So, I felt like providing some additional thoughts about this series, from a release point of view (adding Paul). Timing is *beyond tight* so if this series, entirely or partly, has any chance to go in, it would be through some form of exception, which of course comes with some risks, etc. I did work hard to submit the full series, because I wanted people to be able to see the complete solution. However, I think the series itself can be logically split in two parts. Basically, if we just consider patches 1 and 4 we will end up, right after boot, with a system that has smaller runqueues. They will most likely be balanced in terms of how many CPUs each one has, so a good setup. This will likely (actual differences seems to depend *quite a bit* on the actual workload) be an improvement for very large systems. This is a path will get a decent amount of testing in OSSTests, from now until the day of the release, I think, because booting with the default CPU configuration and setup is what most (all?) OSSTests' jobs do. If the user starts to create pools, we can get to a point where the different runqueues are unbalanced, i.e., each one has a different number of CPUs in them, wrt the others. This, however: * can happen already, as of today, without this patches. Whether these patches may make things "better" or "worse", from this point of view, it's impossible to tell, because it actually depends on what CPUs the user moves among pools or put offline, etc. * this means that the scheduler has to deal with unbalanced runqueues anyway, and if it doesn't, it's a bug and, again, it seems to me that these patches don't make things particularly better or worse. So, again, for patches 1 and 4 alone, it looks to me that we'd get improvements, at least in some cases, the codepath will get some testing and they do not introduce additional or different issues than what we have already right now. They also are at their second iteration, as the original patch series was comprised of exactly those two patches. So, I think it would be interesting if these two patches would be given a chance, even just of getting some reviews... And I would be fine going through the formal process necessary for making that to happen myself. Then, there's the rest, the runqueue rebalancing thing. As said above, I personally would love if we'd release with it, but I see one rather big issue. In fact, such mechanism is triggered and stressed is stressed by the dynamic creation and manipulation of cpupools (and CPU on/off-lining), and we don't have an OSSTests test for that. Therefore, we are not in the best position for catching issues it may have introduced. I can commit to do some testing myself, but it's not the same thing has having them in our CI, I know that very well. So, I'd be interested in hearing what others think about these other patches as well, and I am happy to do my best to make sure that they are working fine, if we decide to try to include them too, but I do see this as much more of a risk myself. So, any thoughts? :-) Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
On Fri, 2020-05-29 at 13:46 +0200, Dario Faggioli wrote: > Basically, if we just consider patches 1 and 4 we will end up, right > after boot, with a system that has smaller runqueues. > Actually, to be fully precise, given how I reorganized the series, it's not patches 1 and 4, it's patches 1, 3 and 4. Hopefully, that is not a big deal, but it patch 3 is really a problem, I can re-arrange patch 4 for working without it. Apart from this, and for adding more information, on a system with 96 CPUs in 2 sockets, this is how the runqueues looks like (with these patches: (XEN) Online Cpus: 0-95 (XEN) Cpupool 0: (XEN) Cpus: 0-95 (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource (XEN) Scheduler: SMP Credit Scheduler rev2 (credit2) (XEN) Active queues: 6 (XEN) default-weight = 256 (XEN) Runqueue 0: (XEN) ncpus = 16 (XEN) cpus = 0-15 (XEN) max_weight = 256 (XEN) pick_bias = 0 (XEN) instload = 0 (XEN) aveload = 1223 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff (XEN) Runqueue 1: (XEN) ncpus = 16 (XEN) cpus = 16-31 (XEN) max_weight = 256 (XEN) pick_bias = 16 (XEN) instload = 0 (XEN) aveload = 3324 (~1%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000 (XEN) Runqueue 2: (XEN) ncpus = 16 (XEN) cpus = 32-47 (XEN) max_weight = 256 (XEN) pick_bias = 32 (XEN) instload = 1 (XEN) aveload = 8996 (~3%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,0000feff,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,0000fcff,00000000 (XEN) Runqueue 3: (XEN) ncpus = 16 (XEN) cpus = 48-63 (XEN) max_weight = 256 (XEN) pick_bias = 48 (XEN) instload = 0 (XEN) aveload = 2424 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000 (XEN) Runqueue 4: (XEN) ncpus = 16 (XEN) cpus = 64-79 (XEN) max_weight = 256 (XEN) pick_bias = 66 (XEN) instload = 0 (XEN) aveload = 1070 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000 (XEN) Runqueue 5: (XEN) ncpus = 16 (XEN) cpus = 80-95 (XEN) max_weight = 256 (XEN) pick_bias = 82 (XEN) instload = 0 (XEN) aveload = 425 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000 Without the patches, there would be just 2 of them (on with CPUs 0-47 and another with CPUs 48-95). On a system with "just" 16 CPUs, in 2 sockets, they look like this: (XEN) Online Cpus: 0-15 (XEN) Cpupool 0: (XEN) Cpus: 0-15 (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource (XEN) Scheduler: SMP Credit Scheduler rev2 (credit2) (XEN) Active queues: 2 (XEN) default-weight = 256 (XEN) Runqueue 0: (XEN) ncpus = 8 (XEN) cpus = 0-7 (XEN) max_weight = 256 (XEN) pick_bias = 0 (XEN) instload = 0 (XEN) aveload = 7077 (~2%) (XEN) idlers: 00000000,000000ff (XEN) tickled: 00000000,00000000 (XEN) fully idle cores: 00000000,000000ff (XEN) Runqueue 1: (XEN) ncpus = 8 (XEN) cpus = 8-15 (XEN) max_weight = 256 (XEN) pick_bias = 8 (XEN) instload = 1 (XEN) aveload = 11848 (~4%) (XEN) idlers: 00000000,0000fe00 (XEN) tickled: 00000000,00000000 (XEN) fully idle cores: 00000000,0000fc00 There are still 2, because there are 2 sockets, and we still honor the topology (and 8 CPUs in a runqueue is fine, because is lower than 16). I'll share the same output on a 256 CPU system, as soon as I finish installing it. Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
> On May 29, 2020, at 12:46 PM, Dario Faggioli <dfaggioli@suse.com> wrote: > > So, > > I felt like providing some additional thoughts about this series, from > a release point of view (adding Paul). > > Timing is *beyond tight* so if this series, entirely or partly, has any > chance to go in, it would be through some form of exception, which of > course comes with some risks, etc. > > I did work hard to submit the full series, because I wanted people to > be able to see the complete solution. However, I think the series > itself can be logically split in two parts. > > Basically, if we just consider patches 1 and 4 we will end up, right > after boot, with a system that has smaller runqueues. They will most > likely be balanced in terms of how many CPUs each one has, so a good > setup. This will likely (actual differences seems to depend *quite a > bit* on the actual workload) be an improvement for very large systems. Fundamentally, I feel like the reason we have the feature freeze is exactly to have to avoid questions like this. Something very much like patch 4 was posted before the last posting date; patches 1-4 received R-b’s before the feature freeze. I think they should probably go in. The rebalancing patches I’m inclined to say should wait until they’ve had a bit more time to be thought about. -George
© 2016 - 2024 Red Hat, Inc.