[PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support

Sean Christopherson posted 67 patches 8 months, 2 weeks ago
There is a newer version of this series
arch/x86/include/asm/irq_remapping.h          |  17 +-
arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
arch/x86/include/asm/kvm_host.h               |  20 +-
arch/x86/include/asm/svm.h                    |  13 +-
arch/x86/kvm/svm/avic.c                       | 707 ++++++++----------
arch/x86/kvm/svm/svm.c                        |   6 +
arch/x86/kvm/svm/svm.h                        |  24 +-
arch/x86/kvm/trace.h                          |  19 +-
arch/x86/kvm/vmx/capabilities.h               |   1 -
arch/x86/kvm/vmx/main.c                       |   2 +-
arch/x86/kvm/vmx/posted_intr.c                | 150 ++--
arch/x86/kvm/vmx/posted_intr.h                |  11 +-
arch/x86/kvm/vmx/vmx.c                        |   2 -
arch/x86/kvm/x86.c                            | 124 ++-
drivers/iommu/amd/amd_iommu_types.h           |   1 -
drivers/iommu/amd/init.c                      |   8 +-
drivers/iommu/amd/iommu.c                     | 171 +++--
drivers/iommu/intel/irq_remapping.c           |  10 +-
include/linux/amd-iommu.h                     |  25 +-
include/linux/kvm_host.h                      |   9 +-
include/linux/kvm_irqfd.h                     |   4 +
tools/testing/selftests/kvm/Makefile.kvm      |   2 +
.../selftests/kvm/include/vfio_pci_util.h     | 149 ++++
.../selftests/kvm/include/x86/processor.h     |  21 +
.../testing/selftests/kvm/lib/vfio_pci_util.c | 201 +++++
tools/testing/selftests/kvm/mercury_device.h  | 118 +++
tools/testing/selftests/kvm/vfio_irq_test.c   | 429 +++++++++++
virt/kvm/eventfd.c                            |  22 +-
28 files changed, 1610 insertions(+), 658 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
create mode 100644 tools/testing/selftests/kvm/mercury_device.h
create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c
[PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Sean Christopherson 8 months, 2 weeks ago
TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
       general.  This needs more testing on AMD with device posted IRQs.

This applies on the small series that adds a enable_device_posted_irqs
module param (the prep work for that is also prep work for this):

   https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com

Fix a variety of bugs related to device posted IRQs, especially on the
AMD side, and clean up KVM's implementation, which IMO is in the running
for Most Convoluted Code in KVM.

Stating the obvious, this series is comically large.  I'm posting it as a
single series, at least for the first round of reviews, to build the
(mostly) full picture of the end goal (it's not the true end goal; there's
still more cleanups that can be done).  And because properly testing most
of the code would be futile until almost the end of the series (so. many.
bugs.).

Batch #1 (patches 1-10) fixes bugs of varying severity.

Batch #2 is mostly SVM specific:

 - Cleans up various warts and bugs in the IRTE tracking
 - Fixes AVIC to not reject large VMs (honor KVM's ABI)
 - Wire up AVIC to enable_ipiv to support disabling IPI virtualization while
   still utilizing device posted interrupts, and to workaround erratum #1235.

Batch #3 overhauls the guts of IRQ bypass in KVM, and moves the vast majority
of the logic to common x86; only the code that needs to communicate with the
IOMMU is truly vendor specific.

Batch #4 is more SVM/AVIC cleanups that are made possible by batch #3.

Batch #5 adds WARNs and drops dead code after all the previous cleanups and
fixes (I don't want to add the WARNs earlier; I don't any point in adding
WARNs in code that's known to be broken).

Batch #6 is yet more SVM/AVIC cleanups, with the specific goal of configuring
IRTEs to generate GA log interrupts if and only if KVM actually needs a wake
event.

This series is well tested except for one notable gap: I was not able to
fully test the AMD IOMMU changes.  Long story short, getting upstream
kernels into our full test environments is practically infeasible.  And
exposing a device or VF on systems that are available to developers is a
bit of a mess.

The device the selftest (see the last patch) uses is an internel test VF
that's hosted on a smart NIC using non-production (test-only) firmware.
Unfortunately, only some of our developer systems have the right NIC, and
for unknown reasons I couldn't get the test firmware to install cleanly on
Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
I could test the KVM flows, but the IOMMU was having none of my attempts
to force enable APIC virtualization against its will.

Through hackery (see the penultimate patch), I was able to gain a decent
amount of confidence in the IOMMU changes (and the interface between KVM
and the IOMMU).

For initial development of the series, I also cobbled together a "mock"
IRQ bypass device, to allow testing in a VM.

  https://github.com/sean-jc/linux.git x86/mock_irqbypass_producer

Note, the diffstat is misleading due to the last two DO NOT MERGE patches
adding 1k+ LoC.  Without those, this series removes ~80 LoC (substantially
more if comments are ignored).

  21 files changed, 577 insertions(+), 655 deletions(-)

Maxim Levitsky (2):
  KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235

Sean Christopherson (65):
  KVM: SVM: Allocate IR data using atomic allocation
  KVM: x86: Reset IRTE to host control if *new* route isn't postable
  KVM: x86: Explicitly treat routing entry type changes as changes
  KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
  iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  iommu/amd: WARN if KVM attempts to set vCPU affinity without posted
    intrrupts
  KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
  KVM: x86: Pass new routing entries and irqfd when updating IRTEs
  KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
  KVM: SVM: Delete IRTE link from previous vCPU irrespective of new
    routing
  KVM: SVM: Drop pointless masking of default APIC base when setting
    V_APIC_BAR
  KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA
    masks
  KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
  KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU
    creation
  KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  KVM: SVM: Track AVIC tables as natively sized pointers, not "struct
    pages"
  KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
  KVM: VMX: Move enable_ipiv knob to common x86
  KVM: VMX: Suppress PI notifications whenever the vCPU is put
  KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores
    IRQ blocking
  iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
  iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
  KVM: SVM: Get vCPU info for IRTE using new routing entry
  KVM: SVM: Stop walking list of routing table entries when updating
    IRTE
  KVM: VMX: Stop walking list of routing table entries when updating
    IRTE
  KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  KVM: x86: Nullify irqfd->producer after updating IRTEs
  KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  KVM: x86: Move posted interrupt tracepoint to common code
  KVM: SVM: Clean up return handling in avic_pi_update_irte()
  iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel
    structs
  KVM: Don't WARN if updating IRQ bypass route fails
  KVM: Fold kvm_arch_irqfd_route_changed() into
    kvm_arch_update_irqfd_routing()
  KVM: x86: Track irq_bypass_vcpu in common x86 code
  KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being
    targeted
  KVM: x86: Don't update IRTE entries when old and new routes were !MSI
  KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR
    metadata
  KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
  iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
  iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
  iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
  iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC
    is inhibited
  KVM: SVM: Don't check for assigned device(s) when updating affinity
  KVM: SVM: Don't check for assigned device(s) when activating AVIC
  KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
  KVM: SVM: Process all IRTEs on affinity change even if one update
    fails
  KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
  KVM: x86: Drop superfluous "has assigned device" check in
    kvm_pi_update_irte()
  KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
  KVM: x86: WARN if IRQ bypass routing is updated without in-kernel
    local APIC
  KVM: SVM: WARN if ir_list is non-empty at vCPU free
  KVM: x86: Decouple device assignment from IRQ bypass
  KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ
    bypass
  KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
  iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC
    support
  KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
  KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
  KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
  iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  KVM: SVM: Generate GA log IRQs only if the associated vCPUs is
    blocking
  *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support
  *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test

 arch/x86/include/asm/irq_remapping.h          |  17 +-
 arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
 arch/x86/include/asm/kvm_host.h               |  20 +-
 arch/x86/include/asm/svm.h                    |  13 +-
 arch/x86/kvm/svm/avic.c                       | 707 ++++++++----------
 arch/x86/kvm/svm/svm.c                        |   6 +
 arch/x86/kvm/svm/svm.h                        |  24 +-
 arch/x86/kvm/trace.h                          |  19 +-
 arch/x86/kvm/vmx/capabilities.h               |   1 -
 arch/x86/kvm/vmx/main.c                       |   2 +-
 arch/x86/kvm/vmx/posted_intr.c                | 150 ++--
 arch/x86/kvm/vmx/posted_intr.h                |  11 +-
 arch/x86/kvm/vmx/vmx.c                        |   2 -
 arch/x86/kvm/x86.c                            | 124 ++-
 drivers/iommu/amd/amd_iommu_types.h           |   1 -
 drivers/iommu/amd/init.c                      |   8 +-
 drivers/iommu/amd/iommu.c                     | 171 +++--
 drivers/iommu/intel/irq_remapping.c           |  10 +-
 include/linux/amd-iommu.h                     |  25 +-
 include/linux/kvm_host.h                      |   9 +-
 include/linux/kvm_irqfd.h                     |   4 +
 tools/testing/selftests/kvm/Makefile.kvm      |   2 +
 .../selftests/kvm/include/vfio_pci_util.h     | 149 ++++
 .../selftests/kvm/include/x86/processor.h     |  21 +
 .../testing/selftests/kvm/lib/vfio_pci_util.c | 201 +++++
 tools/testing/selftests/kvm/mercury_device.h  | 118 +++
 tools/testing/selftests/kvm/vfio_irq_test.c   | 429 +++++++++++
 virt/kvm/eventfd.c                            |  22 +-
 28 files changed, 1610 insertions(+), 658 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
 create mode 100644 tools/testing/selftests/kvm/mercury_device.h
 create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c


base-commit: 5f9f498ea14ffe15390aa46fb85375e7c901bce3
-- 
2.49.0.504.g3bcea36a83-goog
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Sairaj Kodilkar 7 months ago
Hi Sean,

We ran few tests with following setup

* Turin system with 2P, 192 cores each (SMT enabled, Total 768)
* 4 NVMEs of size 1.7 attached to a single IOMMU
* Total RAM 247 GiB
* Qemu version : 9.1.93
* Guest kernel : 6.14-rc7
* FIO random reads with 4K blocksize and libai

With above setup we measured the Guest nvme interrupts, IOPS, GALOG interrupts
and GALOG entries for 60 seconds with and without your changes.

Here are the results,

                          VCPUS = 32, Jobs per NVME = 8
==============================================================================================
                             w/o Sean's patches           w/ Sean's patches     Percent change
----------------------------------------------------------------------------------------------
Guest Nvme interrupts               123,922,860                 124,559,110              0.51%
IOPS (in kilo)                            4,795                       4,796              0.04%
GALOG Interrupts                         40,245                         164            -99.59%
GALOG entries                            42,040                         169            -99.60%
----------------------------------------------------------------------------------------------


                VCPUS = 64, Jobs per NVME = 16
==============================================================================================
                             w/o Sean's patches           w/ Sean's patches     Percent change
----------------------------------------------------------------------------------------------
Guest Nvme interrupts               99,483,339                   99,800,056             0.32% 
IOPS (in kilo)                           4,791                        4,798             0.15% 
GALOG Interrupts                        47,599                       11,634           -75.56% 
GALOG entries                           48,899                       11,923           -75.62%
----------------------------------------------------------------------------------------------


                VCPUS = 192, Jobs per NVME = 48
==============================================================================================
                             w/o Sean's patches          w/ Sean's patches      Percent change
----------------------------------------------------------------------------------------------
Guest Nvme interrupts               76,750,310                  78,066,512               1.71%
IOPS (in kilo)                           4,751                       4,749              -0.04%
GALOG Interrupts                        56,621                      54,732              -3.34%
GALOG entries                           59,579                      56,215              -5.65%
----------------------------------------------------------------------------------------------
 

The results show that patches have significant impact on the number of posted
interrupts at lower vCPU count (32 and 64) while providing similar IOPS and
Guest NVME interrupt rate (i.e. patches do not regress).

Along with the performance evaluation, we did sanity tests such with AVIC,
x2AVIC and kernel selftest.  All tests look good.

For AVIC related patches:
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>

Regards
Sairaj Kodilkar
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Sean Christopherson 7 months ago
On Thu, May 15, 2025, Sairaj Kodilkar wrote:
> Hi Sean,
> 
> We ran few tests with following setup

A few!!?!?  This is awesome!  Thank you, I greatly appreciate the testing!

> * Turin system with 2P, 192 cores each (SMT enabled, Total 768)
> * 4 NVMEs of size 1.7 attached to a single IOMMU
> * Total RAM 247 GiB
> * Qemu version : 9.1.93
> * Guest kernel : 6.14-rc7
> * FIO random reads with 4K blocksize and libai
> 
> With above setup we measured the Guest nvme interrupts, IOPS, GALOG interrupts
> and GALOG entries for 60 seconds with and without your changes.
> 
> Here are the results,
> 
>                           VCPUS = 32, Jobs per NVME = 8
> ==============================================================================================
>                              w/o Sean's patches           w/ Sean's patches     Percent change
> ----------------------------------------------------------------------------------------------
> Guest Nvme interrupts               123,922,860                 124,559,110              0.51%
> IOPS (in kilo)                            4,795                       4,796              0.04%
> GALOG Interrupts                         40,245                         164            -99.59%
> GALOG entries                            42,040                         169            -99.60%
> ----------------------------------------------------------------------------------------------
> 
> 
>                 VCPUS = 64, Jobs per NVME = 16
> ==============================================================================================
>                              w/o Sean's patches           w/ Sean's patches     Percent change
> ----------------------------------------------------------------------------------------------
> Guest Nvme interrupts               99,483,339                   99,800,056             0.32% 
> IOPS (in kilo)                           4,791                        4,798             0.15% 
> GALOG Interrupts                        47,599                       11,634           -75.56% 
> GALOG entries                           48,899                       11,923           -75.62%
> ----------------------------------------------------------------------------------------------
> 
> 
>                 VCPUS = 192, Jobs per NVME = 48
> ==============================================================================================
>                              w/o Sean's patches          w/ Sean's patches      Percent change
> ----------------------------------------------------------------------------------------------
> Guest Nvme interrupts               76,750,310                  78,066,512               1.71%
> IOPS (in kilo)                           4,751                       4,749              -0.04%
> GALOG Interrupts                        56,621                      54,732              -3.34%
> GALOG entries                           59,579                      56,215              -5.65%
> ----------------------------------------------------------------------------------------------
>  
> 
> The results show that patches have significant impact on the number of posted
> interrupts at lower vCPU count (32 and 64) while providing similar IOPS and
> Guest NVME interrupt rate (i.e. patches do not regress).
> 
> Along with the performance evaluation, we did sanity tests such with AVIC,
> x2AVIC and kernel selftest.  All tests look good.
> 
> For AVIC related patches:
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> 
> Regards
> Sairaj Kodilkar
>
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by David Woodhouse 8 months ago
On Fri, 2025-04-04 at 12:38 -0700, Sean Christopherson wrote:
> 
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes.  Long story short, getting upstream
> kernels into our full test environments is practically infeasible.  And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.

If I can make AMD bare-metal "instances" available to you, would that help?

Separately, I'd quite like to see the eventfd→MSI delivery linkage not
use the IRQ routing table at all, and not need a GSI# assigned. Doing
it that way is just a scaling and performance issue.

I recently looked through the irqfd code and came to the conclusion
that it wouldn't be hard to add a new user API which allows us to
simply configure the kvm_irq_routing_entry to be delivered when a given
eventfd fires, without using the table.

I haven't had a chance to look hard, hopefully your rework doesn't make
that any less feasible...
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Sean Christopherson 8 months ago
On Fri, Apr 18, 2025, David Woodhouse wrote:
> On Fri, 2025-04-04 at 12:38 -0700, Sean Christopherson wrote:
> > 
> > This series is well tested except for one notable gap: I was not able to
> > fully test the AMD IOMMU changes.  Long story short, getting upstream
> > kernels into our full test environments is practically infeasible.  And
> > exposing a device or VF on systems that are available to developers is a
> > bit of a mess.
> 
> If I can make AMD bare-metal "instances" available to you, would that help?

Probably not, my main limitation is time, not lack of hardware.

I'm confident I can get a functional AMD test setup internally, it'll just take
a bit more time/effort (there are other people working on the testing front; I'm
hoping if I wait a bit, someone will solve the hiccups for me).

I'd been holding this series since ~October of last year, precisely due to lack
of bandwidth to configure a working test environment.  I felt that I got far
enough in testing that the odds of something being _really_ broken are small,
and didn't want to delay posting for potentially multiple more months as I assume
other folks in the community already have readily available test setups.

And no matter what, I want to get more thorough testing on a broader range of
hardware, e.g. from Intel and AMD in particular, before this gets merged, so in
the end I don't think me getting access to different hardware would move the
needle much.

Though I appreciate the offer :-)

> Separately, I'd quite like to see the eventfd→MSI delivery linkage not
> use the IRQ routing table at all, and not need a GSI# assigned. Doing
> it that way is just a scaling and performance issue.
> 
> I recently looked through the irqfd code and came to the conclusion
> that it wouldn't be hard to add a new user API which allows us to
> simply configure the kvm_irq_routing_entry to be delivered when a given
> eventfd fires, without using the table.

Yeah, especially if we gated the functionality on a per-VM capability.  That way
kvm_irq_routing_update() could completely skip processing irqfds.  At that point,
other than the new uAPI, I think it's just irqfd_inject() and the resample code
that needs to be modified.

> I haven't had a chance to look hard, hopefully your rework doesn't make
> that any less feasible...

Quite the opposite, it should make it much, much easier.  Currently, both
vmx_pi_update_irte() and avic_pi_update_irte() pull the GSI's routing entry from
kvm->irq_routing.

After this rework, irqfd->irq_entry is explicitly passed into 
kvm_arch_update_irqfd_routing(), i.e. it removes two of the gnarliest paths that
expect irqfd to go through the standard routing table.
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by David Matlack 8 months, 1 week ago
On Fri, Apr 4, 2025 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes.  Long story short, getting upstream
> kernels into our full test environments is practically infeasible.  And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.
>
> The device the selftest (see the last patch) uses is an internel test VF
> that's hosted on a smart NIC using non-production (test-only) firmware.
> Unfortunately, only some of our developer systems have the right NIC, and
> for unknown reasons I couldn't get the test firmware to install cleanly on
> Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
> but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
> I could test the KVM flows, but the IOMMU was having none of my attempts
> to force enable APIC virtualization against its will.

(Sean already knows this but just sharing for the broader visibility.)

I am working on a VFIO selftests framework and helper library that we
can link into the KVM selftests to make this kind of testing much
easier. It will support a driver framework so we can support testing
against different devices in a common way. Developers/companies can
carry their own out-of-tree drivers for non-standard/custom test
devices, e.g. the "Mercury device" used in this series.

I will send an RFC in the coming weeks. If/when my proposal is merged,
then I think we'll have a clean way to get the vfio_irq_test merged
upstream as well.
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by David Matlack 6 months, 3 weeks ago
On Tue, Apr 8, 2025 at 10:13 AM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Apr 4, 2025 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > This series is well tested except for one notable gap: I was not able to
> > fully test the AMD IOMMU changes.  Long story short, getting upstream
> > kernels into our full test environments is practically infeasible.  And
> > exposing a device or VF on systems that are available to developers is a
> > bit of a mess.
> >
> > The device the selftest (see the last patch) uses is an internel test VF
> > that's hosted on a smart NIC using non-production (test-only) firmware.
> > Unfortunately, only some of our developer systems have the right NIC, and
> > for unknown reasons I couldn't get the test firmware to install cleanly on
> > Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
> > but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
> > I could test the KVM flows, but the IOMMU was having none of my attempts
> > to force enable APIC virtualization against its will.
>
> (Sean already knows this but just sharing for the broader visibility.)
>
> I am working on a VFIO selftests framework and helper library that we
> can link into the KVM selftests to make this kind of testing much
> easier. It will support a driver framework so we can support testing
> against different devices in a common way. Developers/companies can
> carry their own out-of-tree drivers for non-standard/custom test
> devices, e.g. the "Mercury device" used in this series.
>
> I will send an RFC in the coming weeks. If/when my proposal is merged,
> then I think we'll have a clean way to get the vfio_irq_test merged
> upstream as well.

This RFC can be found here:
https://lore.kernel.org/kvm/20250523233018.1702151-1-dmatlack@google.com/
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Paolo Bonzini 8 months, 1 week ago
On 4/4/25 21:38, Sean Christopherson wrote:
> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>         general.  This needs more testing on AMD with device posted IRQs.
> 
> This applies on the small series that adds a enable_device_posted_irqs
> module param (the prep work for that is also prep work for this):
> 
>     https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com
> 
> Fix a variety of bugs related to device posted IRQs, especially on the
> AMD side, and clean up KVM's implementation, which IMO is in the running
> for Most Convoluted Code in KVM.
> 
> Stating the obvious, this series is comically large.  I'm posting it as a
> single series, at least for the first round of reviews, to build the
> (mostly) full picture of the end goal (it's not the true end goal; there's
> still more cleanups that can be done).  And because properly testing most
> of the code would be futile until almost the end of the series (so. many.
> bugs.).
> 
> Batch #1 (patches 1-10) fixes bugs of varying severity.

I started reviewing these, I guess patches 1-7 could be queued for 6.15? 
  And maybe also patch 2 from 
https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com/.

Paolo

> Batch #2 is mostly SVM specific:
> 
>   - Cleans up various warts and bugs in the IRTE tracking
>   - Fixes AVIC to not reject large VMs (honor KVM's ABI)
>   - Wire up AVIC to enable_ipiv to support disabling IPI virtualization while
>     still utilizing device posted interrupts, and to workaround erratum #1235.
> 
> Batch #3 overhauls the guts of IRQ bypass in KVM, and moves the vast majority
> of the logic to common x86; only the code that needs to communicate with the
> IOMMU is truly vendor specific.
> 
> Batch #4 is more SVM/AVIC cleanups that are made possible by batch #3.
> 
> Batch #5 adds WARNs and drops dead code after all the previous cleanups and
> fixes (I don't want to add the WARNs earlier; I don't any point in adding
> WARNs in code that's known to be broken).
> 
> Batch #6 is yet more SVM/AVIC cleanups, with the specific goal of configuring
> IRTEs to generate GA log interrupts if and only if KVM actually needs a wake
> event.
> 
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes.  Long story short, getting upstream
> kernels into our full test environments is practically infeasible.  And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.
> 
> The device the selftest (see the last patch) uses is an internel test VF
> that's hosted on a smart NIC using non-production (test-only) firmware.
> Unfortunately, only some of our developer systems have the right NIC, and
> for unknown reasons I couldn't get the test firmware to install cleanly on
> Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
> but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
> I could test the KVM flows, but the IOMMU was having none of my attempts
> to force enable APIC virtualization against its will.
> 
> Through hackery (see the penultimate patch), I was able to gain a decent
> amount of confidence in the IOMMU changes (and the interface between KVM
> and the IOMMU).
> 
> For initial development of the series, I also cobbled together a "mock"
> IRQ bypass device, to allow testing in a VM.
> 
>    https://github.com/sean-jc/linux.git x86/mock_irqbypass_producer
> 
> Note, the diffstat is misleading due to the last two DO NOT MERGE patches
> adding 1k+ LoC.  Without those, this series removes ~80 LoC (substantially
> more if comments are ignored).
> 
>    21 files changed, 577 insertions(+), 655 deletions(-)
> 
> Maxim Levitsky (2):
>    KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
>    KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
> 
> Sean Christopherson (65):
>    KVM: SVM: Allocate IR data using atomic allocation
>    KVM: x86: Reset IRTE to host control if *new* route isn't postable
>    KVM: x86: Explicitly treat routing entry type changes as changes
>    KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
>    iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
>    iommu/amd: WARN if KVM attempts to set vCPU affinity without posted
>      intrrupts
>    KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
>    KVM: x86: Pass new routing entries and irqfd when updating IRTEs
>    KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
>    KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
>    KVM: SVM: Delete IRTE link from previous vCPU irrespective of new
>      routing
>    KVM: SVM: Drop pointless masking of default APIC base when setting
>      V_APIC_BAR
>    KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA
>      masks
>    KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
>    KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
>    KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU
>      creation
>    KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
>    KVM: SVM: Track AVIC tables as natively sized pointers, not "struct
>      pages"
>    KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
>    KVM: VMX: Move enable_ipiv knob to common x86
>    KVM: VMX: Suppress PI notifications whenever the vCPU is put
>    KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores
>      IRQ blocking
>    iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
>    iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
>    iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
>    KVM: SVM: Get vCPU info for IRTE using new routing entry
>    KVM: SVM: Stop walking list of routing table entries when updating
>      IRTE
>    KVM: VMX: Stop walking list of routing table entries when updating
>      IRTE
>    KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
>    KVM: x86: Nullify irqfd->producer after updating IRTEs
>    KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
>    KVM: x86: Move posted interrupt tracepoint to common code
>    KVM: SVM: Clean up return handling in avic_pi_update_irte()
>    iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel
>      structs
>    KVM: Don't WARN if updating IRQ bypass route fails
>    KVM: Fold kvm_arch_irqfd_route_changed() into
>      kvm_arch_update_irqfd_routing()
>    KVM: x86: Track irq_bypass_vcpu in common x86 code
>    KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being
>      targeted
>    KVM: x86: Don't update IRTE entries when old and new routes were !MSI
>    KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR
>      metadata
>    KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
>    iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
>    iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
>    iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
>    iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC
>      is inhibited
>    KVM: SVM: Don't check for assigned device(s) when updating affinity
>    KVM: SVM: Don't check for assigned device(s) when activating AVIC
>    KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
>    KVM: SVM: Process all IRTEs on affinity change even if one update
>      fails
>    KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
>    KVM: x86: Drop superfluous "has assigned device" check in
>      kvm_pi_update_irte()
>    KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
>    KVM: x86: WARN if IRQ bypass routing is updated without in-kernel
>      local APIC
>    KVM: SVM: WARN if ir_list is non-empty at vCPU free
>    KVM: x86: Decouple device assignment from IRQ bypass
>    KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ
>      bypass
>    KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
>    iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC
>      support
>    KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
>    KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
>    KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
>    iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
>    KVM: SVM: Generate GA log IRQs only if the associated vCPUs is
>      blocking
>    *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support
>    *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test
> 
>   arch/x86/include/asm/irq_remapping.h          |  17 +-
>   arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>   arch/x86/include/asm/kvm_host.h               |  20 +-
>   arch/x86/include/asm/svm.h                    |  13 +-
>   arch/x86/kvm/svm/avic.c                       | 707 ++++++++----------
>   arch/x86/kvm/svm/svm.c                        |   6 +
>   arch/x86/kvm/svm/svm.h                        |  24 +-
>   arch/x86/kvm/trace.h                          |  19 +-
>   arch/x86/kvm/vmx/capabilities.h               |   1 -
>   arch/x86/kvm/vmx/main.c                       |   2 +-
>   arch/x86/kvm/vmx/posted_intr.c                | 150 ++--
>   arch/x86/kvm/vmx/posted_intr.h                |  11 +-
>   arch/x86/kvm/vmx/vmx.c                        |   2 -
>   arch/x86/kvm/x86.c                            | 124 ++-
>   drivers/iommu/amd/amd_iommu_types.h           |   1 -
>   drivers/iommu/amd/init.c                      |   8 +-
>   drivers/iommu/amd/iommu.c                     | 171 +++--
>   drivers/iommu/intel/irq_remapping.c           |  10 +-
>   include/linux/amd-iommu.h                     |  25 +-
>   include/linux/kvm_host.h                      |   9 +-
>   include/linux/kvm_irqfd.h                     |   4 +
>   tools/testing/selftests/kvm/Makefile.kvm      |   2 +
>   .../selftests/kvm/include/vfio_pci_util.h     | 149 ++++
>   .../selftests/kvm/include/x86/processor.h     |  21 +
>   .../testing/selftests/kvm/lib/vfio_pci_util.c | 201 +++++
>   tools/testing/selftests/kvm/mercury_device.h  | 118 +++
>   tools/testing/selftests/kvm/vfio_irq_test.c   | 429 +++++++++++
>   virt/kvm/eventfd.c                            |  22 +-
>   28 files changed, 1610 insertions(+), 658 deletions(-)
>   create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
>   create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
>   create mode 100644 tools/testing/selftests/kvm/mercury_device.h
>   create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c
> 
> 
> base-commit: 5f9f498ea14ffe15390aa46fb85375e7c901bce3
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Joerg Roedel 8 months, 1 week ago
Hey Sean,

On Fri, Apr 04, 2025 at 12:38:15PM -0700, Sean Christopherson wrote:
> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>        general.  This needs more testing on AMD with device posted IRQs.

Thanks for posting this, it fixes quite some issues in the posted IRQ
implemention. I skimmed through the AMD IOMMU changes and besides some
small things didn't spot anything worrisome.

Adding Suravee and Vasant from AMD to this thread for deeper review
(also of the KVM parts) and testing.

Thanks,

	Joerg
Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
Posted by Vasant Hegde 8 months, 1 week ago
Joerg,


On 4/8/2025 6:14 PM, Joerg Roedel wrote:
> Hey Sean,
> 
> On Fri, Apr 04, 2025 at 12:38:15PM -0700, Sean Christopherson wrote:
>> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>>        general.  This needs more testing on AMD with device posted IRQs.
> 
> Thanks for posting this, it fixes quite some issues in the posted IRQ
> implemention. I skimmed through the AMD IOMMU changes and besides some
> small things didn't spot anything worrisome.
> 
> Adding Suravee and Vasant from AMD to this thread for deeper review
> (also of the KVM parts) and testing.

Sure. We will try to review it soon and will run some tests.

-Vasant