.../ABI/testing/sysfs-devices-system-cpu | 9 + Documentation/scheduler/sched-arch.rst | 37 +++ arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/smp.c | 1 + arch/powerpc/platforms/pseries/lpar.c | 223 ++++++++++++++++++ arch/powerpc/platforms/pseries/pseries.h | 1 + drivers/base/cpu.c | 59 +++++ include/linux/cpumask.h | 20 ++ include/linux/sched.h | 9 +- kernel/sched/core.c | 106 ++++++++- kernel/sched/debug.c | 5 +- kernel/sched/fair.c | 42 +++- kernel/sched/rt.c | 11 +- kernel/sched/sched.h | 9 + 14 files changed, 519 insertions(+), 14 deletions(-)
Detailed problem statement and some of the implementation choices were discussed earlier[1]. [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ This is likely the version which would be used for LPC2025 discussion on this topic. Feel free to provide your suggestion and hoping for a solution that works for different architectures and it's use cases. All the existing alternatives such as cpu hotplug, creating isolated partitions etc break the user affinity. Since number of CPUs to use change depending on the steal time, it is not driven by User. Hence it would be wrong to break the affinity. This series allows if the task is pinned only paravirt CPUs, it will continue running there. Changes compared v3[1]: - Introduced computation of steal time in powerpc code. - Derive number of CPUs to use and mark the remaining as paravirt based on steal values. - Provide debugfs knobs to alter how steal time values being used. - Removed static key check for paravirt CPUs (Yury) - Removed preempt_disable/enable while calling stopper (Prateek) - Made select_idle_sibling and friends aware of paravirt CPUs. - Removed 3 unused schedstat fields and introduced 2 related to paravirt handling. - Handled nohz_full case by enabling tick on it when there is CFS/RT on it. - Updated helper patch to override arch behaviour for easier debugging during development. - Kept Changes compared to v4[2]: - Last two patches were sent out separate instead of being with series. That created confusion. Those two patches are debug patches one can make use to check functionality across acrhitectures. Sorry about that. - Use DEVICE_ATTR_RW instead (greg) - Made it as PATCH since arch specific handling completes the functionality. [2]: https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/ TODO: - Get performance numbers on PowerPC, x86 and S390. Hopefully by next week. Didn't want to hold the series till then. - The CPUs to mark as paravirt is very simple and doesn't work when vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be splice the numbers based on how many CPUs each NUMA node has. It is quite tricky to do specially since cpumask can be on stack too. Given NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head into solving it yet. Maybe there is easier way. - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc specific) - Userspace tools awareness such as irqbalance. - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host informs guest which/how many CPUs it has to use at this moment. This interface should work across archs with each arch doing its specific handling. - Determine the default values for steal time related knobs empirically and document them. - Need to check safety against CPU hotplug specially in process_steal. Applies cleanly on tip/master: commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b Thanks to srikar for providing the initial code around powerpc steal time handling code. Thanks to all who went through and provided reviews. PS: I haven't found a better name. Please suggest if you have any. Shrikanth Hegde (17): sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept cpumask: Introduce cpu_paravirt_mask sched/core: Dont allow to use CPU marked as paravirt sched/debug: Remove unused schedstats sched/fair: Add paravirt movements for proc sched file sched/fair: Pass current cpu in select_idle_sibling sched/fair: Don't consider paravirt CPUs for wakeup and load balance sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task sched/core: Add support for nohz_full CPUs sched/core: Push current task from paravirt CPU sysfs: Add paravirt CPU file powerpc: method to initialize ec and vp cores powerpc: enable/disable paravirt CPUs based on steal time powerpc: process steal values at fixed intervals powerpc: add debugfs file for controlling handling on steal values sysfs: Provide write method for paravirt sysfs: disable arch handling if paravirt file being written .../ABI/testing/sysfs-devices-system-cpu | 9 + Documentation/scheduler/sched-arch.rst | 37 +++ arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/smp.c | 1 + arch/powerpc/platforms/pseries/lpar.c | 223 ++++++++++++++++++ arch/powerpc/platforms/pseries/pseries.h | 1 + drivers/base/cpu.c | 59 +++++ include/linux/cpumask.h | 20 ++ include/linux/sched.h | 9 +- kernel/sched/core.c | 106 ++++++++- kernel/sched/debug.c | 5 +- kernel/sched/fair.c | 42 +++- kernel/sched/rt.c | 11 +- kernel/sched/sched.h | 9 + 14 files changed, 519 insertions(+), 14 deletions(-) -- 2.47.3
On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote: > Detailed problem statement and some of the implementation choices were > discussed earlier[1]. > > [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ > > This is likely the version which would be used for LPC2025 discussion on > this topic. Feel free to provide your suggestion and hoping for a solution > that works for different architectures and it's use cases. > > All the existing alternatives such as cpu hotplug, creating isolated > partitions etc break the user affinity. Since number of CPUs to use change > depending on the steal time, it is not driven by User. Hence it would be > wrong to break the affinity. This series allows if the task is pinned > only paravirt CPUs, it will continue running there. > > Changes compared v3[1]: There is no "v" for this series :(
Hi Greg. On 11/24/25 10:35 PM, Greg KH wrote: > On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote: >> Detailed problem statement and some of the implementation choices were >> discussed earlier[1]. >> >> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ >> >> This is likely the version which would be used for LPC2025 discussion on >> this topic. Feel free to provide your suggestion and hoping for a solution >> that works for different architectures and it's use cases. >> >> All the existing alternatives such as cpu hotplug, creating isolated >> partitions etc break the user affinity. Since number of CPUs to use change >> depending on the steal time, it is not driven by User. Hence it would be >> wrong to break the affinity. This series allows if the task is pinned >> only paravirt CPUs, it will continue running there. >> >> Changes compared v3[1]: > > There is no "v" for this series :( > I thought about adding v1. I made it as PATCH from RFC PATCH since functionally it should be complete now with arch bits. Since it is v1, I remember usually people send out without adding v1. after v1 had tags such as v2. I will keep v2 for the next series.
Hi Shrikanth, Le 25/11/2025 à 03:39, Shrikanth Hegde a écrit : > Hi Greg. > > On 11/24/25 10:35 PM, Greg KH wrote: >> On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote: >>> Detailed problem statement and some of the implementation choices were >>> discussed earlier[1]. >>> >>> [1]: https://eur01.safelinks.protection.outlook.com/? >>> url=https%3A%2F%2Flore.kernel.org%2Fall%2F20250910174210.1969750-1- >>> sshegde%40linux.ibm.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc7e5a5830fcb4c796d4808de2bcbe09d%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638996351808032890%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cV8RTPdV3So1GwQ9uVYgUuGxSfxutSezpaNBq6RYn%2FI%3D&reserved=0 >>> >>> This is likely the version which would be used for LPC2025 discussion on >>> this topic. Feel free to provide your suggestion and hoping for a >>> solution >>> that works for different architectures and it's use cases. >>> >>> All the existing alternatives such as cpu hotplug, creating isolated >>> partitions etc break the user affinity. Since number of CPUs to use >>> change >>> depending on the steal time, it is not driven by User. Hence it would be >>> wrong to break the affinity. This series allows if the task is pinned >>> only paravirt CPUs, it will continue running there. >>> >>> Changes compared v3[1]: >> >> There is no "v" for this series :( >> > > I thought about adding v1. > > I made it as PATCH from RFC PATCH since functionally it should > be complete now with arch bits. Since it is v1, I remember usually > people send out without adding v1. after v1 had tags such as v2. > > I will keep v2 for the next series. > But you are listing changes compared to v3, how can it be a v1 ? Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here [1]. [1] https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/ Christophe
Hi Christophe, Greg >>> >>> There is no "v" for this series :( >>> >> >> I thought about adding v1. >> >> I made it as PATCH from RFC PATCH since functionally it should >> be complete now with arch bits. Since it is v1, I remember usually >> people send out without adding v1. after v1 had tags such as v2. >> >> I will keep v2 for the next series. >> > > But you are listing changes compared to v3, how can it be a v1 ? > Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here > [1]. > > [1] https://lore.kernel.org/all/20251119062100.1112520-1- > sshegde@linux.ibm.com/ > > Christophe Sorry about the confusion in numbers. Hopefully below helps for reviewing. If there are no objections, I will keep next one as v2. Please let me know. Revision logs: ++++++++++++++++++++++++++++++++++++++ RFC PATCH v4 -> PATCH (This series) ++++++++++++++++++++++++++++++++++++++ - Last two patches were sent out separate instead of being with series. Sent it as part of series. - Use DEVICE_ATTR_RW instead (greg) - Made it as PATCH since arch specific handling completes the functionality. +++++++++++++++++++++++++++++++++ RFC PATCH v3 -> RFC PATCH v4 +++++++++++++++++++++++++++++++++ - Introduced computation of steal time in powerpc code. - Derive number of CPUs to use and mark the remaining as paravirt based on steal values. - Provide debugfs knobs to alter how steal time values being used. - Removed static key check for paravirt CPUs (Yury) - Removed preempt_disable/enable while calling stopper (Prateek) - Made select_idle_sibling and friends aware of paravirt CPUs. - Removed 3 unused schedstat fields and introduced 2 related to paravirt handling. - Handled nohz_full case by enabling tick on it when there is CFS/RT on it. - Updated debug patch to override arch behavior for easier debugging during development. - Kept the method to push only current task out instead of moving all task's on rq given the complexity of later. +++++++++++++++++++++++++++++++++ RFC v2 -> RFC PATCH v3 +++++++++++++++++++++++++++++++++ - Renamed to paravirt_cpus_mask - Folded the changes under CONFIG_PARAVIRT. - Fixed the crash due work_buf corruption while using stop_one_cpu_nowait. - Added sysfs documentation. - Copy most of __balance_push_cpu_stop to new one, this helps it move the code out of CONFIG_HOTPLUG_CPU. - Some of the code movement suggested. +++++++++++++++++++++++++++++++++ RFC PATCH -> RFC v2 +++++++++++++++++++++++++++++++++ - Renamed to cpu_avoid_mask in place of cpu_parked_mask. - Used a static key such that no impact to regular case. - add sysfs file to show avoid CPUs. - Make RT understand avoid CPUs. - Add documentation patch - Took care of reported compile error when NR_CPUS=1 PATCH : https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/ RFC PATCH v4 : https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/#r RFC PATCH v3 : https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/#r RFC v2 : https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/#r RFC PATCH : https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
On 11/19/25 6:14 PM, Shrikanth Hegde wrote: > Detailed problem statement and some of the implementation choices were > discussed earlier[1]. Performance data on x86 and PowerPC: ++++++++++++++++++++++++++++++++++++++++++++++++ PowerPC: LPAR(VM) Running on powerVM hypervisor ++++++++++++++++++++++++++++++++++++++++++++++++ Host: 126 cores available in pool. VM1: 96VP/64EC - 768 CPUs VM2: 72VP/48EC - 576 CPUs (VP- Virtual Processor core), (EC - Entitled Cores) steal_check_frequency:1 steal_ratio_high:400 steal_ratio_low:150 Scenarios: Secario 1: (Major improvement) VM1 is running daytrader[1] and VM2 is running stress-ng --cpu=$(nproc) Note: High gains. In the upstream the steal time was around 15%. With series it comes down to 3%. With further tuning it could be reduced. upstream +series daytrader 1x 1.7x <<- 70% gain throughput ----------- Scenario 2: (improves thread_count < num_cpus) VM1 is running schbench and VM2 is running stress-ng --cpu=$(nproc) Note: Values are average of 5 runs and they are wakeup latencies schbench -t 400 upstream +series 50.0th: 18.00 16.60 90.0th: 174.00 46.80 99.0th: 3197.60 928.80 99.9th: 6203.20 4539.20 average rps: 39665.61 42334.65 schbench -t 600 upstream +series 50.0th: 23.80 19.80 90.0th: 917.20 439.00 99.0th: 5582.40 3869.60 99.9th: 8982.40 6574.40 average rps: 39541.00 40018.11 ----------- Scenario 3: (Improves) VM1 is running hackbench and VM2 is running stress-ng --cpu=$(nproc) Note: Values are average of 10 runs and 20000 loops. Process 10 groups 2.84 2.62 Process 20 groups 5.39 4.48 Process 30 groups 7.51 6.29 Process 40 groups 9.88 7.42 Process 50 groups 12.46 9.54 Process 60 groups 14.76 12.09 thread 10 groups 2.93 2.70 thread 20 groups 5.79 4.78 Process(Pipe) 10 groups 2.31 2.18 Process(Pipe) 20 groups 3.32 3.26 Process(Pipe) 30 groups 4.19 4.14 Process(Pipe) 40 groups 5.18 5.53 Process(Pipe) 50 groups 6.57 6.80 Process(Pipe) 60 groups 8.21 8.13 thread(Pipe) 10 groups 2.42 2.24 thread(Pipe) 20 groups 3.62 3.42 ----------- Notes: Numbers might be very favorable since VM2 is constantly running and has some CPUs marked as paravirt when there is steal time and thresholds also might have played a role. Will plan to run same workload i.e hackbench and schbench on both VM's and see the behavior. VM1 is CPUs distributed equally across Nodes, while VM2 is not. Since CPUs are marked paravirt based on core count, some nodes on VM2 would have left unused and that could have added a boot for VM1 performance specially for daytrader. [1]: Daytrader is real life benchmark which does stock trading simulation. https://www.ibm.com/docs/en/linux-on-systems?topic=descriptions-daytrader-benchmark-application https://cwiki.apache.org/confluence/display/GMOxDOC12/Daytrader TODO: Get numbers with very high concurrency of hackbench/schbench. +++++++++++++++++++++++++++++++ on x86_64 (Laptop running KVMs) +++++++++++++++++++++++++++++++ Host: 8 CPUs. Two VM. Each spawned with -smp 8. ----------- Scenario 1: Both VM's are running hackbench 10 process 10000 loops. Values are average of 3 runs. High steal of close 50% was seen when running upstream. So marked 4-7 as paravirt by writing to sysfs file. Since laptop has lot of host tasks running, there will be still be steal time. hackbench 10 groups upstream +series (4-7 marked as paravirt) (seconds) 58 54.42 Note: Having 5 groups helps too. But when concurrency goes such as very high(40 groups), it regress. ----------- Scenario 2: Both VM's are running schbench. Values are average of 2 runs. "schbench -t 4 -r 30 -i 30" (latencies improve but rps is slightly less) wakeup latencies upstream +series(4-7 marked as paravirt) 50.0th 25.5 13.5 90.0th 70.0 30.0 99.0th 2588.0 1992.0 99.9th 3844.0 6032.0 average rps: 338 326 schbench -t 8 -r 30 -i 30 (Major degradation of rps) wakeup latencies upstream +series(4-7 marked as paravirt) 50.0th 15.0 11.5 90.0th 1630.0 2844.0 99.0th 4314.0 6624.0 99.9th 8572.0 10896.0 average rps: 393 240.5 Anything higher also regress. Need to see why it might be? Maybe too many context switches since number of threads are too high and CPUs available is less.
© 2016 - 2025 Red Hat, Inc.