[v2] Reinvent BQL-free PIO/MMIO

[PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt

Posted by Igor Mammedov 3 months, 2 weeks ago

when kernel-irqchip=split is used, QEMU still hits BQL
contention issue when reading ACPI PM/HPET timers
(despite of timer[s] access being lock-less).

So Windows with more than 255 cpus is still not able to
boot (since it requires iommu -> split irqchip).

Problematic path is in kvm_arch_pre_run() where BQL is taken
unconditionally when split irqchip is in use.

There are a few parts tha BQL protects there:
  1. interrupt check and injecting

    however we do not take BQL when checking for pending
    interrupt (even within the same function), so the patch
    takes the same approach for cpu->interrupt_request checks
    and takes BQL only if there is a job to do.

  2. request_interrupt_window access
      CPUState::kvm_run::request_interrupt_window doesn't need BQL
      as it's accessed on side QEMU only by its own vCPU thread.
      The only thing that BQL provides there is implict barrier.
      Which can be done by using cheaper explicit barrier there.

  3. cr8/cpu_get_apic_tpr access
      the same (as #2) applies to CPUState::kvm_run::cr8 write,
      and APIC registers are also cached/synced (get/put) within
      the vCPU thread it belongs to.

Taking BQL only when is necessary, eleminates BQL bottleneck on
IO/MMIO only exit path, improoving latency by 80% on HPET micro
benchmark.

This lets Windows to boot succesfully (in case hv-time isn't used)
when more than 255 vCPUs are in use.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 target/i386/kvm/kvm.c | 58 +++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 21 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 369626f8c8..32024d50f5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5450,6 +5450,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
 {
     X86CPU *x86_cpu = X86_CPU(cpu);
     CPUX86State *env = &x86_cpu->env;
+    bool release_bql = 0;
     int ret;
 
     /* Inject NMI */
@@ -5478,15 +5479,16 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
         }
     }
 
-    if (!kvm_pic_in_kernel()) {
-        bql_lock();
-    }
 
     /* Force the VCPU out of its inner loop to process any INIT requests
      * or (for userspace APIC, but it is cheap to combine the checks here)
      * pending TPR access reports.
      */
     if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
+        if (!kvm_pic_in_kernel()) {
+            bql_lock();
+            release_bql = true;
+        }
         if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
             !(env->hflags & HF_SMM_MASK)) {
             cpu->exit_request = 1;
@@ -5497,24 +5499,31 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
     }
 
     if (!kvm_pic_in_kernel()) {
-        /* Try to inject an interrupt if the guest can accept it */
-        if (run->ready_for_interrupt_injection &&
-            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
-            (env->eflags & IF_MASK)) {
-            int irq;
-
-            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
-            irq = cpu_get_pic_interrupt(env);
-            if (irq >= 0) {
-                struct kvm_interrupt intr;
-
-                intr.irq = irq;
-                DPRINTF("injected interrupt %d\n", irq);
-                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
-                if (ret < 0) {
-                    fprintf(stderr,
-                            "KVM: injection failed, interrupt lost (%s)\n",
-                            strerror(-ret));
+        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
+            if (!release_bql) {
+                bql_lock();
+                release_bql = true;
+            }
+
+            /* Try to inject an interrupt if the guest can accept it */
+            if (run->ready_for_interrupt_injection &&
+                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
+                (env->eflags & IF_MASK)) {
+                int irq;
+
+                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
+                irq = cpu_get_pic_interrupt(env);
+                if (irq >= 0) {
+                    struct kvm_interrupt intr;
+
+                    intr.irq = irq;
+                    DPRINTF("injected interrupt %d\n", irq);
+                    ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
+                    if (ret < 0) {
+                        fprintf(stderr,
+                                "KVM: injection failed, interrupt lost (%s)\n",
+                                strerror(-ret));
+                    }
                 }
             }
         }
@@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
 
         DPRINTF("setting tpr\n");
         run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
+        /*
+         * make sure that request_interrupt_window/cr8 are set
+         * before KVM_RUN might read them
+         */
+        smp_mb();
+    }
 
+    if (release_bql) {
         bql_unlock();
     }
 }
-- 
2.47.1

Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt

Posted by Paolo Bonzini 3 months, 2 weeks ago

The patch is not wrong but complicates things more than it should.

Also, as we do more of these tricks it may be worth adding wrapper APIs 
for interrupt_request access, but that needs to be done tree-wide so you 
can do it separately.

On 7/30/25 14:39, Igor Mammedov wrote:
>      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> +        if (!kvm_pic_in_kernel()) {
> +            bql_lock();
> +            release_bql = true;
> +        }

This bql_lock() is not needed, all the writes in the "if" are local to 
the current CPU.

When the outer bql_lock() was added, cpu_interrupt() was not thread-safe 
at all, and taking the lock was needed in order to read 
cpu->interrupt_request.  But now it is ok to read outside the lock, 
which you can use to simplify this patch a lot.

>          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
>              !(env->hflags & HF_SMM_MASK)) {
>              cpu->exit_request = 1;

A patch that changes all these accesses to 
qatomic_set(&cpu->exit_request, 1), tree-wide, would be nice.

> +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {

This should be qatomic_read(&cpu->interrupt_request).  Not a blocker for 
now, but this is where I would suggest adding a wrapper like 
cpu_test_interrupt(cpu, CPU_INTERRUPT_HARD).

> +            if (!release_bql) {
> +                bql_lock();
> +                release_bql = true;
> +            }

With the above simplification, this can be done unconditionally.

> +            /* Try to inject an interrupt if the guest can accept it */
> +            if (run->ready_for_interrupt_injection &&
> +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> +                (env->eflags & IF_MASK)) {
> +                int irq;
> +
> +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;

Reads and writes to cpu->interrupt_request still take the BQL, which is 
consistent with include/hw/core/cpu.h, so yeah here the bql_lock() is 
needed.

Like above, writing it's a data race with readers outside the BQL, so 
qatomic_read()/qatomic_set() would be needed to respect the C standard. 
Even better could be to add a function cpu_reset_interrupt_locked() that 
does

    assert(bql_locked());
    qatomic_set(&cpu->interrupt_request, cpu->interrupt_request & ~mask);

But neither of these wrappers (which should be applied tree-wide) are an 
absolute necessity for this series.

> @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>   
>           DPRINTF("setting tpr\n");
>           run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> +        /*
> +         * make sure that request_interrupt_window/cr8 are set
> +         * before KVM_RUN might read them
> +         */
> +        smp_mb();

This is not needed, ->cr8 is only read for the same CPU (in 
kvm_arch_vcpu_ioctl_run).

> +    }
>   
> +    if (release_bql) {
>           bql_unlock();
>       }

And since release_bql is not needed anymore, the bql_unlock() can be 
left where it was.

Paolo

>   }

Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt

Posted by Igor Mammedov 3 months, 2 weeks ago

On Fri, 1 Aug 2025 12:26:26 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> The patch is not wrong but complicates things more than it should.
> 
> Also, as we do more of these tricks it may be worth adding wrapper APIs 
> for interrupt_request access, but that needs to be done tree-wide so you 
> can do it separately.

Thanks,
I'll respin this with minimal changes for this series
and post another one on top with tree wide refactoring as suggested

> 
> On 7/30/25 14:39, Igor Mammedov wrote:
> >      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> > +        if (!kvm_pic_in_kernel()) {
> > +            bql_lock();
> > +            release_bql = true;
> > +        }  
> 
> This bql_lock() is not needed, all the writes in the "if" are local to 
> the current CPU.
> 
> When the outer bql_lock() was added, cpu_interrupt() was not thread-safe 
> at all, and taking the lock was needed in order to read 
> cpu->interrupt_request.  But now it is ok to read outside the lock, 
> which you can use to simplify this patch a lot.
> 
> >          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
> >              !(env->hflags & HF_SMM_MASK)) {
> >              cpu->exit_request = 1;  
> 
> A patch that changes all these accesses to 
> qatomic_set(&cpu->exit_request, 1), tree-wide, would be nice.
> 
> > +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {  
> 
> This should be qatomic_read(&cpu->interrupt_request).  Not a blocker for 
> now, but this is where I would suggest adding a wrapper like 
> cpu_test_interrupt(cpu, CPU_INTERRUPT_HARD).
> 
> > +            if (!release_bql) {
> > +                bql_lock();
> > +                release_bql = true;
> > +            }  
> 
> With the above simplification, this can be done unconditionally.
> 
> > +            /* Try to inject an interrupt if the guest can accept it */
> > +            if (run->ready_for_interrupt_injection &&
> > +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> > +                (env->eflags & IF_MASK)) {
> > +                int irq;
> > +
> > +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;  
> 
> Reads and writes to cpu->interrupt_request still take the BQL, which is 
> consistent with include/hw/core/cpu.h, so yeah here the bql_lock() is 
> needed.
> 
> Like above, writing it's a data race with readers outside the BQL, so 
> qatomic_read()/qatomic_set() would be needed to respect the C standard. 
> Even better could be to add a function cpu_reset_interrupt_locked() that 
> does
> 
>     assert(bql_locked());
>     qatomic_set(&cpu->interrupt_request, cpu->interrupt_request & ~mask);
> 
> But neither of these wrappers (which should be applied tree-wide) are an 
> absolute necessity for this series.
> 
> > @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >   
> >           DPRINTF("setting tpr\n");
> >           run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> > +        /*
> > +         * make sure that request_interrupt_window/cr8 are set
> > +         * before KVM_RUN might read them
> > +         */
> > +        smp_mb();  
> 
> This is not needed, ->cr8 is only read for the same CPU (in 
> kvm_arch_vcpu_ioctl_run).
> 
> > +    }
> >   
> > +    if (release_bql) {
> >           bql_unlock();
> >       }  
> 
> And since release_bql is not needed anymore, the bql_unlock() can be 
> left where it was.
> 
> Paolo
> 
> >   }  
>

Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt

Posted by Peter Xu 3 months, 2 weeks ago

On Wed, Jul 30, 2025 at 02:39:34PM +0200, Igor Mammedov wrote:
> when kernel-irqchip=split is used, QEMU still hits BQL
> contention issue when reading ACPI PM/HPET timers
> (despite of timer[s] access being lock-less).
> 
> So Windows with more than 255 cpus is still not able to
> boot (since it requires iommu -> split irqchip).
> 
> Problematic path is in kvm_arch_pre_run() where BQL is taken
> unconditionally when split irqchip is in use.
> 
> There are a few parts tha BQL protects there:
>   1. interrupt check and injecting
> 
>     however we do not take BQL when checking for pending
>     interrupt (even within the same function), so the patch
>     takes the same approach for cpu->interrupt_request checks
>     and takes BQL only if there is a job to do.
> 
>   2. request_interrupt_window access
>       CPUState::kvm_run::request_interrupt_window doesn't need BQL
>       as it's accessed on side QEMU only by its own vCPU thread.
>       The only thing that BQL provides there is implict barrier.
>       Which can be done by using cheaper explicit barrier there.
> 
>   3. cr8/cpu_get_apic_tpr access
>       the same (as #2) applies to CPUState::kvm_run::cr8 write,
>       and APIC registers are also cached/synced (get/put) within
>       the vCPU thread it belongs to.
> 
> Taking BQL only when is necessary, eleminates BQL bottleneck on
> IO/MMIO only exit path, improoving latency by 80% on HPET micro
> benchmark.
> 
> This lets Windows to boot succesfully (in case hv-time isn't used)
> when more than 255 vCPUs are in use.

Not familiar with this path, but the change looks reasonable, a few pure
questions inline.

> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
>  target/i386/kvm/kvm.c | 58 +++++++++++++++++++++++++++----------------
>  1 file changed, 37 insertions(+), 21 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 369626f8c8..32024d50f5 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -5450,6 +5450,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>  {
>      X86CPU *x86_cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86_cpu->env;
> +    bool release_bql = 0;
>      int ret;
>  
>      /* Inject NMI */
> @@ -5478,15 +5479,16 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>          }
>      }
>  
> -    if (!kvm_pic_in_kernel()) {
> -        bql_lock();
> -    }
>  
>      /* Force the VCPU out of its inner loop to process any INIT requests
>       * or (for userspace APIC, but it is cheap to combine the checks here)
>       * pending TPR access reports.
>       */
>      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> +        if (!kvm_pic_in_kernel()) {
> +            bql_lock();
> +            release_bql = true;
> +        }

Does updating exit_request need bql at all?  I saw the pattern is this:

        kvm_arch_pre_run(cpu, run);
        if (qatomic_read(&cpu->exit_request)) {
            trace_kvm_interrupt_exit_request();
            /*
             * KVM requires us to reenter the kernel after IO exits to complete
             * instruction emulation. This self-signal will ensure that we
             * leave ASAP again.
             */
            kvm_cpu_kick_self();
        }

So setting exit_request=1 here will likely be read very soon later, in this
case it seems the lock isn't needed.

>          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
>              !(env->hflags & HF_SMM_MASK)) {
>              cpu->exit_request = 1;
> @@ -5497,24 +5499,31 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>      }
>  
>      if (!kvm_pic_in_kernel()) {
> -        /* Try to inject an interrupt if the guest can accept it */
> -        if (run->ready_for_interrupt_injection &&
> -            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> -            (env->eflags & IF_MASK)) {
> -            int irq;
> -
> -            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> -            irq = cpu_get_pic_interrupt(env);
> -            if (irq >= 0) {
> -                struct kvm_interrupt intr;
> -
> -                intr.irq = irq;
> -                DPRINTF("injected interrupt %d\n", irq);
> -                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> -                if (ret < 0) {
> -                    fprintf(stderr,
> -                            "KVM: injection failed, interrupt lost (%s)\n",
> -                            strerror(-ret));
> +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
> +            if (!release_bql) {
> +                bql_lock();
> +                release_bql = true;
> +            }
> +
> +            /* Try to inject an interrupt if the guest can accept it */
> +            if (run->ready_for_interrupt_injection &&
> +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> +                (env->eflags & IF_MASK)) {
> +                int irq;
> +
> +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> +                irq = cpu_get_pic_interrupt(env);
> +                if (irq >= 0) {
> +                    struct kvm_interrupt intr;
> +
> +                    intr.irq = irq;
> +                    DPRINTF("injected interrupt %d\n", irq);
> +                    ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> +                    if (ret < 0) {
> +                        fprintf(stderr,
> +                                "KVM: injection failed, interrupt lost (%s)\n",
> +                                strerror(-ret));
> +                    }
>                  }
>              }
>          }
> @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>  
>          DPRINTF("setting tpr\n");
>          run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> +        /*
> +         * make sure that request_interrupt_window/cr8 are set
> +         * before KVM_RUN might read them
> +         */
> +        smp_mb();

Is this mb() needed if KVM_RUN will always happen in the same thread anyway?

Thanks,

> +    }
>  
> +    if (release_bql) {
>          bql_unlock();
>      }
>  }
> -- 
> 2.47.1
> 

-- 
Peter Xu

Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt

Posted by Igor Mammedov 3 months, 2 weeks ago

On Thu, 31 Jul 2025 15:24:59 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 30, 2025 at 02:39:34PM +0200, Igor Mammedov wrote:
> > when kernel-irqchip=split is used, QEMU still hits BQL
> > contention issue when reading ACPI PM/HPET timers
> > (despite of timer[s] access being lock-less).
> > 
> > So Windows with more than 255 cpus is still not able to
> > boot (since it requires iommu -> split irqchip).
> > 
> > Problematic path is in kvm_arch_pre_run() where BQL is taken
> > unconditionally when split irqchip is in use.
> > 
> > There are a few parts tha BQL protects there:
> >   1. interrupt check and injecting
> > 
> >     however we do not take BQL when checking for pending
> >     interrupt (even within the same function), so the patch
> >     takes the same approach for cpu->interrupt_request checks
> >     and takes BQL only if there is a job to do.
> > 
> >   2. request_interrupt_window access
> >       CPUState::kvm_run::request_interrupt_window doesn't need BQL
> >       as it's accessed on side QEMU only by its own vCPU thread.
> >       The only thing that BQL provides there is implict barrier.
> >       Which can be done by using cheaper explicit barrier there.
> > 
> >   3. cr8/cpu_get_apic_tpr access
> >       the same (as #2) applies to CPUState::kvm_run::cr8 write,
> >       and APIC registers are also cached/synced (get/put) within
> >       the vCPU thread it belongs to.
> > 
> > Taking BQL only when is necessary, eleminates BQL bottleneck on
> > IO/MMIO only exit path, improoving latency by 80% on HPET micro
> > benchmark.
> > 
> > This lets Windows to boot succesfully (in case hv-time isn't used)
> > when more than 255 vCPUs are in use.  
> 
> Not familiar with this path, but the change looks reasonable, a few pure
> questions inline.

perhaps Paolo can give an expert opinion.

> 
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> >  target/i386/kvm/kvm.c | 58 +++++++++++++++++++++++++++----------------
> >  1 file changed, 37 insertions(+), 21 deletions(-)
> > 
> > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > index 369626f8c8..32024d50f5 100644
> > --- a/target/i386/kvm/kvm.c
> > +++ b/target/i386/kvm/kvm.c
> > @@ -5450,6 +5450,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >  {
> >      X86CPU *x86_cpu = X86_CPU(cpu);
> >      CPUX86State *env = &x86_cpu->env;
> > +    bool release_bql = 0;
> >      int ret;
> >  
> >      /* Inject NMI */
> > @@ -5478,15 +5479,16 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >          }
> >      }
> >  
> > -    if (!kvm_pic_in_kernel()) {
> > -        bql_lock();
> > -    }
> >  
> >      /* Force the VCPU out of its inner loop to process any INIT requests
> >       * or (for userspace APIC, but it is cheap to combine the checks here)
> >       * pending TPR access reports.
> >       */
> >      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> > +        if (!kvm_pic_in_kernel()) {
> > +            bql_lock();
> > +            release_bql = true;
> > +        }  
> 
> Does updating exit_request need bql at all?  I saw the pattern is this:
> 
>         kvm_arch_pre_run(cpu, run);
>         if (qatomic_read(&cpu->exit_request)) {
>             trace_kvm_interrupt_exit_request();
>             /*
>              * KVM requires us to reenter the kernel after IO exits to complete
>              * instruction emulation. This self-signal will ensure that we
>              * leave ASAP again.
>              */
>             kvm_cpu_kick_self();
>         }
> 
> So setting exit_request=1 here will likely be read very soon later, in this
> case it seems the lock isn't needed.

Given I'm not familiar with the code, I'm changing locking logic only in
places I have to and preserving current current behavior elsewhere.

looking at the code, it seems we are always hold BQL when setting
exit_request.

> 
> >          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
> >              !(env->hflags & HF_SMM_MASK)) {
> >              cpu->exit_request = 1;
> > @@ -5497,24 +5499,31 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >      }
> >  
> >      if (!kvm_pic_in_kernel()) {
> > -        /* Try to inject an interrupt if the guest can accept it */
> > -        if (run->ready_for_interrupt_injection &&
> > -            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> > -            (env->eflags & IF_MASK)) {
> > -            int irq;
> > -
> > -            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> > -            irq = cpu_get_pic_interrupt(env);
> > -            if (irq >= 0) {
> > -                struct kvm_interrupt intr;
> > -
> > -                intr.irq = irq;
> > -                DPRINTF("injected interrupt %d\n", irq);
> > -                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> > -                if (ret < 0) {
> > -                    fprintf(stderr,
> > -                            "KVM: injection failed, interrupt lost (%s)\n",
> > -                            strerror(-ret));
> > +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
> > +            if (!release_bql) {
> > +                bql_lock();
> > +                release_bql = true;
> > +            }
> > +
> > +            /* Try to inject an interrupt if the guest can accept it */
> > +            if (run->ready_for_interrupt_injection &&
> > +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> > +                (env->eflags & IF_MASK)) {
> > +                int irq;
> > +
> > +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> > +                irq = cpu_get_pic_interrupt(env);
> > +                if (irq >= 0) {
> > +                    struct kvm_interrupt intr;
> > +
> > +                    intr.irq = irq;
> > +                    DPRINTF("injected interrupt %d\n", irq);
> > +                    ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> > +                    if (ret < 0) {
> > +                        fprintf(stderr,
> > +                                "KVM: injection failed, interrupt lost (%s)\n",
> > +                                strerror(-ret));
> > +                    }
> >                  }
> >              }
> >          }
> > @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >  
> >          DPRINTF("setting tpr\n");
> >          run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> > +        /*
> > +         * make sure that request_interrupt_window/cr8 are set
> > +         * before KVM_RUN might read them
> > +         */
> > +        smp_mb();  
> 
> Is this mb() needed if KVM_RUN will always happen in the same thread anyway?


it matches with similar pattern:

        /* Read cpu->exit_request before KVM_RUN reads run->immediate_exit.      
         * Matching barrier in kvm_eat_signals.                                  
         */                                                                                                     
        smp_rmb();                                                               
                                                                                 
        run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0); 

to be on the safe side, this is preserving barrier that BQL has provided before
I can drop it if it's not really needed.

> 
> Thanks,
> 
> > +    }
> >  
> > +    if (release_bql) {
> >          bql_unlock();
> >      }
> >  }
> > -- 
> > 2.47.1
> >   
>

Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt

Posted by Paolo Bonzini 3 months, 2 weeks ago

On 8/1/25 10:42, Igor Mammedov wrote:
> looking at the code, it seems we are always hold BQL when setting
> exit_request.

While the BQL is taken, it is for other reasons (e.g. because of 
cpu->halt_cond).

In this case it's set and read from within the same thread so it's okay.

> it matches with similar pattern:
> 
>          /* Read cpu->exit_request before KVM_RUN reads run->immediate_exit.
>           * Matching barrier in kvm_eat_signals.
>           */
>          smp_rmb();
>                                                                                   
>          run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
> 
> to be on the safe side, this is preserving barrier that BQL has provided before
> I can drop it if it's not really needed.

That comment is wrong...  The actual pairing here is with cpu_exit(), 
though the logic of cpu_exit() is messed up and only fully works for 
TCG, and immediate_exit does not matter at all.  I'll clean it up and 
write a comment.

A correct ordering would be:

(a) store other flags that will be checked if cpu->exit_request is 1

(b) cpu_exit(): store-release cpu->exit_request

(c) cpu_interrupt(): store-release cpu->interrupt_request

- broadcast cpu->halt_cond if needed; right now it's done always in 
qemu_cpu_kick()

 >>> now you can release the BQL

(d) do the accelerator-specific kick (e.g. write icount_decr for TCG, 
pthread_kill for KVM, etc.)


The other side then does the checks in the opposite direction:

(d) the accelerator's execution loop exits thanks to the kick

(c) then check cpu->interrupt_request - any work that's needed here may 
take the BQL or not, and may set cpu->exit_request

(b) then check cpu->exit_request to see if it should do slow path work

(a) then (under the BQL) it possibly goes to sleep, waiting on 
cpu->halt_cond.

cpu->exit_request and cpu->interrupt_request are not a 
load-acquire/store-release pair right now, but it should be.  Probably 
everything is protected one way or the other by the BQL, but it's not clear.

I'll handle cpu->exit_request and leave cpu->interrupt_request to you. 
For the sake of this series, please do the following:

- contrarily to what I said in my earlier review, do introduce a 
cpu_test_interrupt() now and make it use a load-acquire.  There aren't 
many occurrences:

accel/tcg/cpu-exec.c:    if 
(unlikely(qatomic_read(&cpu->interrupt_request))) {
target/alpha/cpu.c:    return cs->interrupt_request & (CPU_INTERRUPT_HARD
target/arm/cpu.c:        && cs->interrupt_request &
target/arm/hvf/hvf.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ)) {
target/avr/cpu.c:    return (cs->interrupt_request & (CPU_INTERRUPT_HARD 
| CPU_INTERRUPT_RESET))
target/hppa/cpu.c:    return cs->interrupt_request & (CPU_INTERRUPT_HARD 
| CPU_INTERRUPT_NMI);
target/i386/kvm/kvm.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_NMI | CPU_INTERRUPT_SMI)) {
target/i386/kvm/kvm.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
target/i386/nvmm/nvmm-all.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
target/i386/whpx/whpx-all.c:        cpu->interrupt_request & 
(CPU_INTERRUPT_NMI | CPU_INTERRUPT_SMI)) {
target/i386/whpx/whpx-all.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
target/microblaze/cpu.c:    return cs->interrupt_request & 
(CPU_INTERRUPT_HARD | CPU_INTERRUPT_NMI);
target/openrisc/cpu.c:    return cs->interrupt_request & 
(CPU_INTERRUPT_HARD |
target/rx/cpu.c:    return cs->interrupt_request &

- in tcg_handle_interrupt and generic_handle_interrupt, change like this

     /* Pairs with load_acquire in cpu_test_interrupt().  */
     qatomic_store_release(&cpu->interrupt_request,
         cpu->interrupt_request | mask);

I'll take care of properly adding the store-release/load-acquire
for exit_request and removing the unnecessary memory barriers in kvm-all.c.

Paolo


>>
>> Thanks,
>>
>>> +    }
>>>   
>>> +    if (release_bql) {
>>>           bql_unlock();
>>>       }
>>>   }
>>> -- 
>>> 2.47.1
>>>    
>>
> 
> 
>

[PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO
[PATCH v2 2/6] acpi: mark PMTIMER as unlocked
[PATCH v2 3/6] hpet: switch to fain-grained device locking
[PATCH v2 4/6] hpet: move out main counter read into a separate block
[PATCH v2 5/6] hpet: make main counter read lock-less
[PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt