[PATCH v4 7/8] kvm: i386: irqchip: take BQL only if there is an interrupt

Igor Mammedov posted 8 patches 2 months, 2 weeks ago
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, Riku Voipio <riku.voipio@iki.fi>, "Michael S. Tsirkin" <mst@redhat.com>, Igor Mammedov <imammedo@redhat.com>, Ani Sinha <anisinha@redhat.com>, Thomas Huth <thuth@redhat.com>, Halil Pasic <pasic@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, David Hildenbrand <david@redhat.com>, Jason Herne <jjherne@linux.ibm.com>, Stafford Horne <shorne@gmail.com>, Eduardo Habkost <eduardo@habkost.net>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Yanan Wang <wangyanan55@huawei.com>, Zhao Liu <zhao1.liu@intel.com>, Peter Xu <peterx@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, Alexander Graf <agraf@csgraf.de>, Mads Ynddal <mads@ynddal.dk>, Michael Rolnik <mrolnik@gmail.com>, Helge Deller <deller@gmx.de>, Cameron Esfahani <dirty@apple.com>, Roman Bolshakov <rbolshakov@ddn.com>, Phil Dennis-Jordan <phil@philjordan.eu>, Marcelo Tosatti <mtosatti@redhat.com>, Reinoud Zandijk <reinoud@netbsd.org>, Sunil Muthuswamy <sunilmut@microsoft.com>, Song Gao <gaosong@loongson.cn>, Laurent Vivier <laurent@vivier.eu>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>, Aurelien Jarno <aurelien@aurel32.net>, Jiaxun Yang <jiaxun.yang@flygoat.com>, Aleksandar Rikalo <arikalo@gmail.com>, Huacai Chen <chenhuacai@kernel.org>, Nicholas Piggin <npiggin@gmail.com>, Chinmay Rath <rathc@linux.ibm.com>, Harsh Prateek Bora <harshpb@linux.ibm.com>, Yoshinori Sato <yoshinori.sato@nifty.com>, Ilya Leoshkevich <iii@linux.ibm.com>, Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>, Artyom Tarasenko <atar4qemu@gmail.com>
[PATCH v4 7/8] kvm: i386: irqchip: take BQL only if there is an interrupt
Posted by Igor Mammedov 2 months, 2 weeks ago
when kernel-irqchip=split is used, QEMU still hits BQL
contention issue when reading ACPI PM/HPET timers
(despite of timer[s] access being lock-less).

So Windows with more than 255 cpus is still not able to
boot (since it requires iommu -> split irqchip).

Problematic path is in kvm_arch_pre_run() where BQL is taken
unconditionally when split irqchip is in use.

There are a few parts that BQL protects there:
  1. interrupt check and injecting

    however we do not take BQL when checking for pending
    interrupt (even within the same function), so the patch
    takes the same approach for cpu->interrupt_request checks
    and takes BQL only if there is a job to do.

  2. request_interrupt_window access
      CPUState::kvm_run::request_interrupt_window doesn't need BQL
      as it's accessed by its own vCPU thread.

  3. cr8/cpu_get_apic_tpr access
      the same (as #2) applies to CPUState::kvm_run::cr8,
      and APIC registers are also cached/synced (get/put) within
      the vCPU thread it belongs to.

Taking BQL only when is necessary, eleminates BQL bottleneck on
IO/MMIO only exit path, improoving latency by 80% on HPET micro
benchmark.

This lets Windows to boot succesfully (in case hv-time isn't used)
when more than 255 vCPUs are in use.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
v3:
  * drop net needed pair of () in cpu->interrupt_request & CPU_INTERRUPT_HARD
    check
  * Paolo Bonzini <pbonzini@redhat.com>
     * don't take BQL when setting exit_request, use qatomic_set() instead
     * after above simplification take/release BQL unconditionally
     * drop smp_mb() after run->cr8/run->request_interrupt_window update
---
 target/i386/kvm/kvm.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a7b5c8f81b..306430a052 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5478,9 +5478,6 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
         }
     }
 
-    if (!kvm_pic_in_kernel()) {
-        bql_lock();
-    }
 
     /* Force the VCPU out of its inner loop to process any INIT requests
      * or (for userspace APIC, but it is cheap to combine the checks here)
@@ -5489,10 +5486,10 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
     if (cpu_test_interrupt(cpu, CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
         if (cpu_test_interrupt(cpu, CPU_INTERRUPT_INIT) &&
             !(env->hflags & HF_SMM_MASK)) {
-            cpu->exit_request = 1;
+            qatomic_set(&cpu->exit_request, 1);
         }
         if (cpu_test_interrupt(cpu, CPU_INTERRUPT_TPR)) {
-            cpu->exit_request = 1;
+            qatomic_set(&cpu->exit_request, 1);
         }
     }
 
@@ -5503,6 +5500,8 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
             (env->eflags & IF_MASK)) {
             int irq;
 
+            bql_lock();
+
             cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
             irq = cpu_get_pic_interrupt(env);
             if (irq >= 0) {
@@ -5517,6 +5516,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
                             strerror(-ret));
                 }
             }
+            bql_unlock();
         }
 
         /* If we have an interrupt but the guest is not ready to receive an
@@ -5531,8 +5531,6 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
 
         DPRINTF("setting tpr\n");
         run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
-
-        bql_unlock();
     }
 }
 
-- 
2.47.1
Re: [PATCH v4 7/8] kvm: i386: irqchip: take BQL only if there is an interrupt
Posted by Philippe Mathieu-Daudé 2 months ago
On 14/8/25 18:05, Igor Mammedov wrote:
> when kernel-irqchip=split is used, QEMU still hits BQL
> contention issue when reading ACPI PM/HPET timers
> (despite of timer[s] access being lock-less).
> 
> So Windows with more than 255 cpus is still not able to
> boot (since it requires iommu -> split irqchip).
> 
> Problematic path is in kvm_arch_pre_run() where BQL is taken
> unconditionally when split irqchip is in use.
> 
> There are a few parts that BQL protects there:
>    1. interrupt check and injecting
> 
>      however we do not take BQL when checking for pending
>      interrupt (even within the same function), so the patch
>      takes the same approach for cpu->interrupt_request checks
>      and takes BQL only if there is a job to do.
> 
>    2. request_interrupt_window access
>        CPUState::kvm_run::request_interrupt_window doesn't need BQL
>        as it's accessed by its own vCPU thread.
> 
>    3. cr8/cpu_get_apic_tpr access
>        the same (as #2) applies to CPUState::kvm_run::cr8,
>        and APIC registers are also cached/synced (get/put) within
>        the vCPU thread it belongs to.
> 
> Taking BQL only when is necessary, eleminates BQL bottleneck on
> IO/MMIO only exit path, improoving latency by 80% on HPET micro
> benchmark.
> 
> This lets Windows to boot succesfully (in case hv-time isn't used)
> when more than 255 vCPUs are in use.
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
> v3:
>    * drop net needed pair of () in cpu->interrupt_request & CPU_INTERRUPT_HARD
>      check
>    * Paolo Bonzini <pbonzini@redhat.com>
>       * don't take BQL when setting exit_request, use qatomic_set() instead
>       * after above simplification take/release BQL unconditionally
>       * drop smp_mb() after run->cr8/run->request_interrupt_window update
> ---
>   target/i386/kvm/kvm.c | 12 +++++-------
>   1 file changed, 5 insertions(+), 7 deletions(-)


>       /* Force the VCPU out of its inner loop to process any INIT requests
>        * or (for userspace APIC, but it is cheap to combine the checks here)
> @@ -5489,10 +5486,10 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>       if (cpu_test_interrupt(cpu, CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
>           if (cpu_test_interrupt(cpu, CPU_INTERRUPT_INIT) &&
>               !(env->hflags & HF_SMM_MASK)) {
> -            cpu->exit_request = 1;
> +            qatomic_set(&cpu->exit_request, 1);
>           }
>           if (cpu_test_interrupt(cpu, CPU_INTERRUPT_TPR)) {
> -            cpu->exit_request = 1;
> +            qatomic_set(&cpu->exit_request, 1);
>           }
>       }

Interesting. IMHO to avoid future similar problems, we should declare
CPUState::exit_request a "private" field and access it via a helper,
where atomicity is enforced.

The only non-accelerator use is in PPC sPAPR:

hw/ppc/spapr_hcall.c:512:        cs->exit_request = 1;
hw/ppc/spapr_hcall.c:534:    cs->exit_request = 1;
hw/ppc/spapr_hcall.c:627:    cs->exit_request = 1;

FYI last week we noticed a similar problem with CPUState::thread_kicked
when using HVF and I'll post in a few days a series containing this
change:

-- >8 --
diff --git a/system/cpus.c b/system/cpus.c
index 26fb3bd69c3..39daf85bae7 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -464,10 +464,10 @@ void qemu_wait_io_event(CPUState *cpu)

  void cpus_kick_thread(CPUState *cpu)
  {
-    if (cpu->thread_kicked) {
+    if (qatomic_read(&cpu->thread_kicked)) {
          return;
      }
-    cpu->thread_kicked = true;
+    qatomic_set(&cpu->thread_kicked, true);

---

It only affects HVF because this is the single access to thread_kicked
out of accelerator code:

$ git grep -w thread_kicked
include/hw/core/cpu.h:484:    bool thread_kicked;
system/cpus.c:449:    qatomic_set_mb(&cpu->thread_kicked, false);
system/cpus.c:476:    if (cpu->thread_kicked) {
system/cpus.c:479:    cpu->thread_kicked = true;
target/arm/hvf/hvf.c:1825:    qatomic_set_mb(&cpu->thread_kicked, false);

(Call introduced in commit 219c101fa7f "arm/hvf: Add a WFI handler").
Re: [PATCH v4 7/8] kvm: i386: irqchip: take BQL only if there is an interrupt
Posted by Igor Mammedov 2 months ago
On Mon, 25 Aug 2025 12:46:07 +0200
Philippe Mathieu-Daudé <philmd@linaro.org> wrote:

> On 14/8/25 18:05, Igor Mammedov wrote:
> > when kernel-irqchip=split is used, QEMU still hits BQL
> > contention issue when reading ACPI PM/HPET timers
> > (despite of timer[s] access being lock-less).
> > 
> > So Windows with more than 255 cpus is still not able to
> > boot (since it requires iommu -> split irqchip).
> > 
> > Problematic path is in kvm_arch_pre_run() where BQL is taken
> > unconditionally when split irqchip is in use.
> > 
> > There are a few parts that BQL protects there:
> >    1. interrupt check and injecting
> > 
> >      however we do not take BQL when checking for pending
> >      interrupt (even within the same function), so the patch
> >      takes the same approach for cpu->interrupt_request checks
> >      and takes BQL only if there is a job to do.
> > 
> >    2. request_interrupt_window access
> >        CPUState::kvm_run::request_interrupt_window doesn't need BQL
> >        as it's accessed by its own vCPU thread.
> > 
> >    3. cr8/cpu_get_apic_tpr access
> >        the same (as #2) applies to CPUState::kvm_run::cr8,
> >        and APIC registers are also cached/synced (get/put) within
> >        the vCPU thread it belongs to.
> > 
> > Taking BQL only when is necessary, eleminates BQL bottleneck on
> > IO/MMIO only exit path, improoving latency by 80% on HPET micro
> > benchmark.
> > 
> > This lets Windows to boot succesfully (in case hv-time isn't used)
> > when more than 255 vCPUs are in use.
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > ---
> > v3:
> >    * drop net needed pair of () in cpu->interrupt_request & CPU_INTERRUPT_HARD
> >      check
> >    * Paolo Bonzini <pbonzini@redhat.com>
> >       * don't take BQL when setting exit_request, use qatomic_set() instead
> >       * after above simplification take/release BQL unconditionally
> >       * drop smp_mb() after run->cr8/run->request_interrupt_window update
> > ---
> >   target/i386/kvm/kvm.c | 12 +++++-------
> >   1 file changed, 5 insertions(+), 7 deletions(-)  
> 
> 
> >       /* Force the VCPU out of its inner loop to process any INIT requests
> >        * or (for userspace APIC, but it is cheap to combine the checks here)
> > @@ -5489,10 +5486,10 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >       if (cpu_test_interrupt(cpu, CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> >           if (cpu_test_interrupt(cpu, CPU_INTERRUPT_INIT) &&
> >               !(env->hflags & HF_SMM_MASK)) {
> > -            cpu->exit_request = 1;
> > +            qatomic_set(&cpu->exit_request, 1);
> >           }
> >           if (cpu_test_interrupt(cpu, CPU_INTERRUPT_TPR)) {
> > -            cpu->exit_request = 1;
> > +            qatomic_set(&cpu->exit_request, 1);
> >           }
> >       }  
> 
> Interesting. IMHO to avoid future similar problems, we should declare
> CPUState::exit_request a "private" field and access it via a helper,
> where atomicity is enforced.

I did only bare minimum here, while
Paolo took over sanitizing/fixing exit_request part
see
https://patchew.org/QEMU/20250808185905.62776-1-pbonzini@redhat.com/
so this suggestion better fits there.
 
> The only non-accelerator use is in PPC sPAPR:
> 
> hw/ppc/spapr_hcall.c:512:        cs->exit_request = 1;
> hw/ppc/spapr_hcall.c:534:    cs->exit_request = 1;
> hw/ppc/spapr_hcall.c:627:    cs->exit_request = 1;
> 
> FYI last week we noticed a similar problem with CPUState::thread_kicked
> when using HVF and I'll post in a few days a series containing this
> change:
> 
> -- >8 --  
> diff --git a/system/cpus.c b/system/cpus.c
> index 26fb3bd69c3..39daf85bae7 100644
> --- a/system/cpus.c
> +++ b/system/cpus.c
> @@ -464,10 +464,10 @@ void qemu_wait_io_event(CPUState *cpu)
> 
>   void cpus_kick_thread(CPUState *cpu)
>   {
> -    if (cpu->thread_kicked) {
> +    if (qatomic_read(&cpu->thread_kicked)) {
>           return;
>       }
> -    cpu->thread_kicked = true;
> +    qatomic_set(&cpu->thread_kicked, true);
> 
> ---
> 
> It only affects HVF because this is the single access to thread_kicked
> out of accelerator code:
> 
> $ git grep -w thread_kicked
> include/hw/core/cpu.h:484:    bool thread_kicked;
> system/cpus.c:449:    qatomic_set_mb(&cpu->thread_kicked, false);
> system/cpus.c:476:    if (cpu->thread_kicked) {
> system/cpus.c:479:    cpu->thread_kicked = true;
> target/arm/hvf/hvf.c:1825:    qatomic_set_mb(&cpu->thread_kicked, false);
> 
> (Call introduced in commit 219c101fa7f "arm/hvf: Add a WFI handler").
>