Documentation/filesystems/proc.rst | 1 + arch/x86/include/asm/tsc.h | 1 + arch/x86/kernel/tsc.c | 13 +++++++++ arch/x86/kvm/x86.c | 30 +++++++++++++++++++++ fs/proc/array.c | 7 ++++- include/linux/sched.h | 5 ++++ include/linux/sched/signal.h | 1 + kernel/exit.c | 1 + kernel/fork.c | 2 +- kernel/sched/core.c | 1 + kernel/sched/fair.c | 25 ++++++++++++++++++ kernel/sched/pelt.c | 42 +++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 13 files changed, 122 insertions(+), 9 deletions(-)
With guest hlt, pause and mwait pass through, the hypervisor loses visibility on real guest cpu activity. From the point of view of the host, such vcpus are always 100% active even when the guest is completely halted. Typically hlt, pause and mwait pass through is only implemented on non-timeshared pcpus. However, there are cases where this assumption cannot be strictly met as some occasional housekeeping work needs to be scheduled on such cpus while we generally want to preserve the pass through performance gains. This applies for system which don't have dedicated cpus for housekeeping purposes. In such cases, the lack of visibility of the hypervisor is problematic from a load balancing point of view. In the absence of a better signal, it will preemt vcpus at random. For example it could decide to interrupt a vcpu doing critical idle poll work while another vcpu sits idle. Another motivation for gaining visibility into real guest cpu activity is to enable the hypervisor to vend metrics about it for external consumption. In this RFC we introduce the concept of guest halted time to address these concerns. Guest halted time (gtime_halted) accounts for cycles spent in guest mode while the cpu is halted. gtime_halted relies on measuring the mperf msr register (x86) around VM enter/exits to compute the number of unhalted cycles; halted cycles are then derived from the tsc difference minus the mperf difference. gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables users to monitor real guest activity. gtime_halted is also plumbed to the scheduler infrastructure to discount halted cycles from fair load accounting. This enlightens the load balancer to real guest activity for better task placement. This initial RFC has a few limitations and open questions: * only the x86 infrastructure is supported as it relies on architecture dependent registers. Future development will extend this to ARM. * we assume that mperf accumulates as the same rate as tsc. While I am not certain whether this assumption is ever violated, the spec doesn't seem to offer this guarantee [1] so we may want to calibrate mperf. * the sched enlightenment logic relies on periodic gtime_halted updates. As such, it is incompatible with nohz full because this could result in long periods of no update followed by a massive halted time update which doesn't play well with the existing PELT integration. It is possible to address this limitation with generalized, more complex accounting. [1] https://cdrdv2.intel.com/v1/dl/getContent/671427 "The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at close to the maximum non-turbo frequency, which is equal to the product of scalable bus frequency and maximum non-turbo ratio." Fernand Sieber (3): fs/proc: Add gtime halted to proc/<pid>/stat kvm/x86: Add support for gtime halted sched,x86: Make the scheduler guest unhalted aware Documentation/filesystems/proc.rst | 1 + arch/x86/include/asm/tsc.h | 1 + arch/x86/kernel/tsc.c | 13 +++++++++ arch/x86/kvm/x86.c | 30 +++++++++++++++++++++ fs/proc/array.c | 7 ++++- include/linux/sched.h | 5 ++++ include/linux/sched/signal.h | 1 + kernel/exit.c | 1 + kernel/fork.c | 2 +- kernel/sched/core.c | 1 + kernel/sched/fair.c | 25 ++++++++++++++++++ kernel/sched/pelt.c | 42 +++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 13 files changed, 122 insertions(+), 9 deletions(-) === TESTING === For testing I use a host running a VM via qEMU and I simulate host interference via instances of stress. The VM uses 16 vCPUs, which are pinned to pCPUs 0-15. Each vCPU is pinned to a dedicated pCPU which follows the 'mostly non-timeshared CPU' model. We use the -overcommit cpu-pm=on qEMU flag to enable hlt, mwait and pause pass through. On the host, alongside qEMU, there are 8 stressors pinned to the same CPUs (taskset -c 0-15 stress --cpu 8). The VM then runs rtla on 8 cores to measure host interference. With the enlightenment in the patch we expect the load balancer to move the stressors to the remaining 8 idle cores and to mostly eliminate interference. With enlightenment: rtla hwnoise -c 0-7 -P f:50 -p 27000 -r 26000 -d 2m -T 1000 -q --warm-up 60 Hardware-related Noise duration: 0 00:02:00 | time is in us CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI 0 #4443 115518000 0 100.00000 0 0 0 0 1 #4442 115512416 144178 99.87518 4006 4006 37 0 2 #4443 115518000 0 100.00000 0 0 0 0 3 #4443 115518000 0 100.00000 0 0 0 0 4 #4443 115518000 0 100.00000 0 0 0 0 5 #4443 115518000 0 100.00000 0 0 0 0 6 #4444 115547479 11018 99.99046 4006 4006 3 0 7 #4444 115544000 12015 99.98960 4005 4005 3 0 Baseline without patches: rtla hwnoise -c 0-7 -P f:50 -p 27000 -r 26000 -d 2m -T 1000 -q --warm-up 60 Hardware-related Noise duration: 0 00:02:00 | time is in us CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI 0 #4171 112394904 36139505 67.84595 29015 13006 4533 0 1 #4153 111960227 38277963 65.81110 29015 13006 4748 0 2 #3882 108016483 73845612 31.63486 29017 16005 8628 0 3 #3881 108088929 73946692 31.58717 30017 14006 8636 0 4 #4177 112380299 36646487 67.39064 28018 14007 4551 0 5 #4157 112059732 37863899 66.21096 28017 13005 4689 0 6 #4166 112312643 37458217 66.64826 29016 14005 4653 0 7 #4157 112034934 36922368 67.04387 29015 14006 4609 0 -- 2.43.0 Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07
On Tue, Feb 18, 2025, Fernand Sieber wrote: > With guest hlt, pause and mwait pass through, the hypervisor loses > visibility on real guest cpu activity. From the point of view of the > host, such vcpus are always 100% active even when the guest is > completely halted. > > Typically hlt, pause and mwait pass through is only implemented on > non-timeshared pcpus. However, there are cases where this assumption > cannot be strictly met as some occasional housekeeping work needs to be What housekeeping work? > scheduled on such cpus while we generally want to preserve the pass > through performance gains. This applies for system which don't have > dedicated cpus for housekeeping purposes. > > In such cases, the lack of visibility of the hypervisor is problematic > from a load balancing point of view. In the absence of a better signal, > it will preemt vcpus at random. For example it could decide to interrupt > a vcpu doing critical idle poll work while another vcpu sits idle. > > Another motivation for gaining visibility into real guest cpu activity > is to enable the hypervisor to vend metrics about it for external > consumption. Such as? > In this RFC we introduce the concept of guest halted time to address > these concerns. Guest halted time (gtime_halted) accounts for cycles > spent in guest mode while the cpu is halted. gtime_halted relies on > measuring the mperf msr register (x86) around VM enter/exits to compute > the number of unhalted cycles; halted cycles are then derived from the > tsc difference minus the mperf difference. IMO, there are better ways to solve this than having KVM sample MPERF on every entry and exit. The kernel already samples APERF/MPREF on every tick and provides that information via /proc/cpuinfo, just use that. If your userspace is unable to use /proc/cpuinfo or similar, that needs to be explained. And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT passthrough, *and* you want to schedule other tasks on those CPUs, then IMO you're abusing all of those things and it's not KVM's problem to solve, especially now that sched_ext is a thing. > gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables > users to monitor real guest activity. > > gtime_halted is also plumbed to the scheduler infrastructure to discount > halted cycles from fair load accounting. This enlightens the load > balancer to real guest activity for better task placement. > > This initial RFC has a few limitations and open questions: > * only the x86 infrastructure is supported as it relies on architecture > dependent registers. Future development will extend this to ARM. > * we assume that mperf accumulates as the same rate as tsc. While I am > not certain whether this assumption is ever violated, the spec doesn't > seem to offer this guarantee [1] so we may want to calibrate mperf. > * the sched enlightenment logic relies on periodic gtime_halted updates. > As such, it is incompatible with nohz full because this could result > in long periods of no update followed by a massive halted time update > which doesn't play well with the existing PELT integration. It is > possible to address this limitation with generalized, more complex > accounting.
On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote: > > > On Tue, Feb 18, 2025, Fernand Sieber wrote: > > With guest hlt, pause and mwait pass through, the hypervisor loses > > visibility on real guest cpu activity. From the point of view of > > the > > host, such vcpus are always 100% active even when the guest is > > completely halted. > > > > Typically hlt, pause and mwait pass through is only implemented on > > non-timeshared pcpus. However, there are cases where this > > assumption > > cannot be strictly met as some occasional housekeeping work needs > > to be > > What housekeeping work? In the case that want to solve, housekeeping work is mainly userspace tasks implementing hypervisor functionality such as gathering metrics, performing health checks, handling VM lifecycle, etc. The platforms don't have dedicated cpus for housekeeping purposes and try as much as possible to fully dedicate the cpus to VMs, hence HLT/MWAIT pass through. The housekeeping work is low but can still interfere with guests that are running very latency sensitive operations on a subset of vCPUs (e.g idle poll), which is what we want to detect and avoid. > > > scheduled on such cpus while we generally want to preserve the pass > > through performance gains. This applies for system which don't have > > dedicated cpus for housekeeping purposes. > > > > In such cases, the lack of visibility of the hypervisor is > > problematic > > from a load balancing point of view. In the absence of a better > > signal, > > it will preemt vcpus at random. For example it could decide to > > interrupt > > a vcpu doing critical idle poll work while another vcpu sits idle. > > > > Another motivation for gaining visibility into real guest cpu > > activity > > is to enable the hypervisor to vend metrics about it for external > > consumption. > > Such as? An example is feeding VM utilisation metrics to other systems like auto scaling of guest applications. While it is possible to implement this functionality purely on the guest side, having the hypervisor handling it means that it's available out of the box for all VMs in a standard way without relying on guest side configuration. > > > In this RFC we introduce the concept of guest halted time to > > address > > these concerns. Guest halted time (gtime_halted) accounts for > > cycles > > spent in guest mode while the cpu is halted. gtime_halted relies on > > measuring the mperf msr register (x86) around VM enter/exits to > > compute > > the number of unhalted cycles; halted cycles are then derived from > > the > > tsc difference minus the mperf difference. > > IMO, there are better ways to solve this than having KVM sample MPERF > on every > entry and exit. > > The kernel already samples APERF/MPREF on every tick and provides > that information > via /proc/cpuinfo, just use that. If your userspace is unable to use > /proc/cpuinfo > or similar, that needs to be explained. If I understand correctly what you are suggesting is to have userspace regularly sampling these values to detect the most idle CPUs and then use CPU affinity to repin housekeeping tasks to these. While it's possible this essentially requires to implement another scheduling layer in userspace through constant re-pinning of tasks. This also requires to constantly identify the full set of tasks that can induce undesirable overhead so that they can be pinned accordingly. For these reasons we would rather want the logic to be implemented directly in the scheduler. > > And if you're running vCPUs on tickless CPUs, and you're doing > HLT/MWAIT passthrough, > *and* you want to schedule other tasks on those CPUs, then IMO you're > abusing all > of those things and it's not KVM's problem to solve, especially now > that sched_ext > is a thing. We are running vCPUs with ticks, the rest of your observations are correct. > > > gtime_halted is exposed to proc/<pid>/stat as a new entry, which > > enables > > users to monitor real guest activity. > > > > gtime_halted is also plumbed to the scheduler infrastructure to > > discount > > halted cycles from fair load accounting. This enlightens the load > > balancer to real guest activity for better task placement. > > > > This initial RFC has a few limitations and open questions: > > * only the x86 infrastructure is supported as it relies on > > architecture > > dependent registers. Future development will extend this to ARM. > > * we assume that mperf accumulates as the same rate as tsc. While I > > am > > not certain whether this assumption is ever violated, the spec > > doesn't > > seem to offer this guarantee [1] so we may want to calibrate > > mperf. > > * the sched enlightenment logic relies on periodic gtime_halted > > updates. > > As such, it is incompatible with nohz full because this could > > result > > in long periods of no update followed by a massive halted time > > update > > which doesn't play well with the existing PELT integration. It is > > possible to address this limitation with generalized, more > > complex > > accounting. Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07
On Wed, Feb 26, 2025, Fernand Sieber wrote: > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote: > > > In this RFC we introduce the concept of guest halted time to address > > > these concerns. Guest halted time (gtime_halted) accounts for cycles > > > spent in guest mode while the cpu is halted. gtime_halted relies on > > > measuring the mperf msr register (x86) around VM enter/exits to compute > > > the number of unhalted cycles; halted cycles are then derived from the > > > tsc difference minus the mperf difference. > > > > IMO, there are better ways to solve this than having KVM sample MPERF on > > every entry and exit. > > > > The kernel already samples APERF/MPREF on every tick and provides that > > information via /proc/cpuinfo, just use that. If your userspace is unable > > to use /proc/cpuinfo or similar, that needs to be explained. > > If I understand correctly what you are suggesting is to have userspace > regularly sampling these values to detect the most idle CPUs and then > use CPU affinity to repin housekeeping tasks to these. While it's > possible this essentially requires to implement another scheduling > layer in userspace through constant re-pinning of tasks. This also > requires to constantly identify the full set of tasks that can induce > undesirable overhead so that they can be pinned accordingly. For these > reasons we would rather want the logic to be implemented directly in > the scheduler. > > > And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT > > passthrough, *and* you want to schedule other tasks on those CPUs, then IMO > > you're abusing all of those things and it's not KVM's problem to solve, > > especially now that sched_ext is a thing. > > We are running vCPUs with ticks, the rest of your observations are > correct. If there's a host tick, why do you need KVM's help to make scheduling decisions? It sounds like what you want is a scheduler that is primarily driven by MPERF (and APERF?), and sched_tick() => arch_scale_freq_tick() already knows about MPERF.
On Wed, 2025-02-26 at 13:00 -0800, Sean Christopherson wrote: > On Wed, Feb 26, 2025, Fernand Sieber wrote: > > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote: > > > > In this RFC we introduce the concept of guest halted time to > > > > address > > > > these concerns. Guest halted time (gtime_halted) accounts for > > > > cycles > > > > spent in guest mode while the cpu is halted. gtime_halted > > > > relies on > > > > measuring the mperf msr register (x86) around VM enter/exits to > > > > compute > > > > the number of unhalted cycles; halted cycles are then derived > > > > from the > > > > tsc difference minus the mperf difference. > > > > > > IMO, there are better ways to solve this than having KVM sample > > > MPERF on > > > every entry and exit. > > > > > > The kernel already samples APERF/MPREF on every tick and provides > > > that > > > information via /proc/cpuinfo, just use that. If your userspace > > > is unable > > > to use /proc/cpuinfo or similar, that needs to be explained. > > > > If I understand correctly what you are suggesting is to have > > userspace > > regularly sampling these values to detect the most idle CPUs and > > then > > use CPU affinity to repin housekeeping tasks to these. While it's > > possible this essentially requires to implement another scheduling > > layer in userspace through constant re-pinning of tasks. This also > > requires to constantly identify the full set of tasks that can > > induce > > undesirable overhead so that they can be pinned accordingly. For > > these > > reasons we would rather want the logic to be implemented directly > > in > > the scheduler. > > > > > And if you're running vCPUs on tickless CPUs, and you're doing > > > HLT/MWAIT > > > passthrough, *and* you want to schedule other tasks on those > > > CPUs, then IMO > > > you're abusing all of those things and it's not KVM's problem to > > > solve, > > > especially now that sched_ext is a thing. > > > > We are running vCPUs with ticks, the rest of your observations are > > correct. > > If there's a host tick, why do you need KVM's help to make scheduling > decisions? > It sounds like what you want is a scheduler that is primarily driven > by MPERF > (and APERF?), and sched_tick() => arch_scale_freq_tick() already > knows about MPERF. Having the measure around VM enter/exit makes it easy to attribute the unhalted cycles to a specific task (vCPU), which solves both our use cases of VM metrics and scheduling. That said we may be able to avoid it and achieve the same results. i.e * the VM metrics use case can be solved by using /proc/cpuinfo from userspace. * for the scheduling use case, the tick based sampling of MPERF means we could potentially introduce a correcting factor on PELT accounting of pinned vCPU tasks based on its value (similar to what I do in the last patch of the series). The combination of these would remove the requirement of adding any logic around VM entrer/exit to support our use cases. I'm happy to prototype that if we think it's going in the right direction? Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07
On Thu, Feb 27, 2025, Fernand Sieber wrote: > On Wed, 2025-02-26 at 13:00 -0800, Sean Christopherson wrote: > > On Wed, Feb 26, 2025, Fernand Sieber wrote: > > > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote: > > > > And if you're running vCPUs on tickless CPUs, and you're doing > > > > HLT/MWAIT passthrough, *and* you want to schedule other tasks on those > > > > CPUs, then IMO you're abusing all of those things and it's not KVM's > > > > problem to solve, especially now that sched_ext is a thing. > > > > > > We are running vCPUs with ticks, the rest of your observations are > > > correct. > > > > If there's a host tick, why do you need KVM's help to make scheduling > > decisions? It sounds like what you want is a scheduler that is primarily > > driven by MPERF (and APERF?), and sched_tick() => arch_scale_freq_tick() > > already knows about MPERF. > > Having the measure around VM enter/exit makes it easy to attribute the > unhalted cycles to a specific task (vCPU), which solves both our use > cases of VM metrics and scheduling. That said we may be able to avoid > it and achieve the same results. > > i.e > * the VM metrics use case can be solved by using /proc/cpuinfo from > userspace. > * for the scheduling use case, the tick based sampling of MPERF means > we could potentially introduce a correcting factor on PELT accounting > of pinned vCPU tasks based on its value (similar to what I do in the > last patch of the series). > > The combination of these would remove the requirement of adding any > logic around VM entrer/exit to support our use cases. > > I'm happy to prototype that if we think it's going in the right > direction? That's mostly a question for the scheduler folks. That said, from a KVM perspective, sampling MPERF around entry/exit for scheduling purposes is a non-starter.
© 2016 - 2025 Red Hat, Inc.