hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++------- hw/i386/amd_iommu.h | 52 +++ qemu-options.hx | 23 + system/memory.c | 10 +- 4 files changed, 934 insertions(+), 156 deletions(-)
This series adds support for guests using the AMD vIOMMU to enable DMA
remapping for VFIO devices. In addition to the currently supported
passthrough (PT) mode, guest kernels are now able to to provide DMA
address translation and access permission checking to VFs attached to
paging domains, using the AMD v1 I/O page table format.
Please see v1[0] cover letter for additional details such as example
QEMU command line parameters used in testing.
Changes since v1[0]:
- Added documentation entry for '-device amd-iommu'
- Code movement with no functional changes to avoid use of forward
declarations in later patches [Sairaj, mst]
- Moved addr_translation and dma-remap property to separate commits.
The dma-remap feature is only available for users to enable after
all required functionality is implemented [Sairaj]
- Explicit initialization of significant fields like addr_translation
and notifier_flags [Sairaj]
- Fixed bug in decoding of invalidation size [Sairaj]
- Changed fetch_pte() to use an out parameter for pte, and be able to
check for error conditions via negative return value [Clement]
- Removed UNMAP-only notifier optimization, leaving vhost support for
later series [Sairaj]
- Fixed ordering between address space unmap and memory region activation
on devtab invalidation [Sairaj]
- Fixed commit message with "V=1, TV=0" [Sairaj]
- Dropped patch removing the page_fault event. That area is better
addressed in separate series.
- Independent testing by Sairaj (thank you!)
Thank you,
Alejandro
[0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
Alejandro Jimenez (20):
memory: Adjust event ranges to fit within notifier boundaries
amd_iommu: Document '-device amd-iommu' common options
amd_iommu: Reorder device and page table helpers
amd_iommu: Helper to decode size of page invalidation command
amd_iommu: Add helper function to extract the DTE
amd_iommu: Return an error when unable to read PTE from guest memory
amd_iommu: Add helpers to walk AMD v1 Page Table format
amd_iommu: Add a page walker to sync shadow page tables on
invalidation
amd_iommu: Add basic structure to support IOMMU notifier updates
amd_iommu: Sync shadow page tables on page invalidation
amd_iommu: Use iova_tree records to determine large page size on UNMAP
amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
amd_iommu: Add replay callback
amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
amd_iommu: Toggle memory regions based on address translation mode
amd_iommu: Set all address spaces to default translation mode on reset
amd_iommu: Add dma-remap property to AMD vIOMMU device
amd_iommu: Toggle address translation mode on devtab entry
invalidation
amd_iommu: Do not assume passthrough translation when DTE[TV]=0
amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++-------
hw/i386/amd_iommu.h | 52 +++
qemu-options.hx | 23 +
system/memory.c | 10 +-
4 files changed, 934 insertions(+), 156 deletions(-)
base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf
--
2.43.5
On Fri, May 02, 2025 at 02:15:45AM +0000, Alejandro Jimenez wrote: > This series adds support for guests using the AMD vIOMMU to enable DMA > remapping for VFIO devices. In addition to the currently supported > passthrough (PT) mode, guest kernels are now able to to provide DMA > address translation and access permission checking to VFs attached to > paging domains, using the AMD v1 I/O page table format. > > Please see v1[0] cover letter for additional details such as example > QEMU command line parameters used in testing. are you working on v3? there was a bug you wanted to fix. > Changes since v1[0]: > - Added documentation entry for '-device amd-iommu' > - Code movement with no functional changes to avoid use of forward > declarations in later patches [Sairaj, mst] > - Moved addr_translation and dma-remap property to separate commits. > The dma-remap feature is only available for users to enable after > all required functionality is implemented [Sairaj] > - Explicit initialization of significant fields like addr_translation > and notifier_flags [Sairaj] > - Fixed bug in decoding of invalidation size [Sairaj] > - Changed fetch_pte() to use an out parameter for pte, and be able to > check for error conditions via negative return value [Clement] > - Removed UNMAP-only notifier optimization, leaving vhost support for > later series [Sairaj] > - Fixed ordering between address space unmap and memory region activation > on devtab invalidation [Sairaj] > - Fixed commit message with "V=1, TV=0" [Sairaj] > - Dropped patch removing the page_fault event. That area is better > addressed in separate series. > - Independent testing by Sairaj (thank you!) > > Thank you, > Alejandro > > [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/ > > Alejandro Jimenez (20): > memory: Adjust event ranges to fit within notifier boundaries > amd_iommu: Document '-device amd-iommu' common options > amd_iommu: Reorder device and page table helpers > amd_iommu: Helper to decode size of page invalidation command > amd_iommu: Add helper function to extract the DTE > amd_iommu: Return an error when unable to read PTE from guest memory > amd_iommu: Add helpers to walk AMD v1 Page Table format > amd_iommu: Add a page walker to sync shadow page tables on > invalidation > amd_iommu: Add basic structure to support IOMMU notifier updates > amd_iommu: Sync shadow page tables on page invalidation > amd_iommu: Use iova_tree records to determine large page size on UNMAP > amd_iommu: Unmap all address spaces under the AMD IOMMU on reset > amd_iommu: Add replay callback > amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL > amd_iommu: Toggle memory regions based on address translation mode > amd_iommu: Set all address spaces to default translation mode on reset > amd_iommu: Add dma-remap property to AMD vIOMMU device > amd_iommu: Toggle address translation mode on devtab entry > invalidation > amd_iommu: Do not assume passthrough translation when DTE[TV]=0 > amd_iommu: Refactor amdvi_page_walk() to use common code for page walk > > hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++------- > hw/i386/amd_iommu.h | 52 +++ > qemu-options.hx | 23 + > system/memory.c | 10 +- > 4 files changed, 934 insertions(+), 156 deletions(-) > > > base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf > -- > 2.43.5
On 5/30/25 7:41 AM, Michael S. Tsirkin wrote: > On Fri, May 02, 2025 at 02:15:45AM +0000, Alejandro Jimenez wrote: >> This series adds support for guests using the AMD vIOMMU to enable DMA >> remapping for VFIO devices. In addition to the currently supported >> passthrough (PT) mode, guest kernels are now able to to provide DMA >> address translation and access permission checking to VFs attached to >> paging domains, using the AMD v1 I/O page table format. >> >> Please see v1[0] cover letter for additional details such as example >> QEMU command line parameters used in testing. > > are you working on v3? Yes, there are suggestions from Sairaj that I will address on v3. I am also planning to include two small patches from Joao Martins that add support for the HATDis feature (this is something that Sairaj suggested earlier). The Linux changes are being reviewed here: https://lore.kernel.org/all/cover.1746613368.git.Ankit.Soni@amd.com/ I will be offline from 6/2 to 6/6, so I didn't want to send a new revision and disappear. In general, the changes from v2->v3 are minor and well contained, so any reviews I receive for v2 will be valid. That being said, I can send v3 today if you'd prefer that. Please let me know. > there was a bug you wanted to fix. > I assume the bug is Sairaj's report of a dmesg warning with an NVME passthrough on a 4.15 kernel, but unfortunately I have not been able to reproduce that problem. We agreed that given the age of the kernel (and reports of the same warning on NVME devices in unrelated scenarios), this is likely a guest driver issue, and should not be a blocker. More details: I have tested an Ubuntu image with a 4.15 kernel, but I cannot hit any issues when I passthrough a CX-6 VF (I don't have access to NMVE VF). The kernel is old enough that I have to force bind the mlx5_core driver to the VF on the guest, but once I do the VF comes up with no errors and I can see DMA map/unmap activity in the traces. Sairaj: Are you passing a full NVME device to the guest (i.e. a PF)? I ask because the BDF in '-device vfio-pci,host=0000:44:00.0' doesn't look like a typical VF... Thank you, Alejandro >> Changes since v1[0]: >> - Added documentation entry for '-device amd-iommu' >> - Code movement with no functional changes to avoid use of forward >> declarations in later patches [Sairaj, mst] >> - Moved addr_translation and dma-remap property to separate commits. >> The dma-remap feature is only available for users to enable after >> all required functionality is implemented [Sairaj] >> - Explicit initialization of significant fields like addr_translation >> and notifier_flags [Sairaj] >> - Fixed bug in decoding of invalidation size [Sairaj] >> - Changed fetch_pte() to use an out parameter for pte, and be able to >> check for error conditions via negative return value [Clement] >> - Removed UNMAP-only notifier optimization, leaving vhost support for >> later series [Sairaj] >> - Fixed ordering between address space unmap and memory region activation >> on devtab invalidation [Sairaj] >> - Fixed commit message with "V=1, TV=0" [Sairaj] >> - Dropped patch removing the page_fault event. That area is better >> addressed in separate series. >> - Independent testing by Sairaj (thank you!) >> >> Thank you, >> Alejandro >> >> [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/ >> >> Alejandro Jimenez (20): >> memory: Adjust event ranges to fit within notifier boundaries >> amd_iommu: Document '-device amd-iommu' common options >> amd_iommu: Reorder device and page table helpers >> amd_iommu: Helper to decode size of page invalidation command >> amd_iommu: Add helper function to extract the DTE >> amd_iommu: Return an error when unable to read PTE from guest memory >> amd_iommu: Add helpers to walk AMD v1 Page Table format >> amd_iommu: Add a page walker to sync shadow page tables on >> invalidation >> amd_iommu: Add basic structure to support IOMMU notifier updates >> amd_iommu: Sync shadow page tables on page invalidation >> amd_iommu: Use iova_tree records to determine large page size on UNMAP >> amd_iommu: Unmap all address spaces under the AMD IOMMU on reset >> amd_iommu: Add replay callback >> amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL >> amd_iommu: Toggle memory regions based on address translation mode >> amd_iommu: Set all address spaces to default translation mode on reset >> amd_iommu: Add dma-remap property to AMD vIOMMU device >> amd_iommu: Toggle address translation mode on devtab entry >> invalidation >> amd_iommu: Do not assume passthrough translation when DTE[TV]=0 >> amd_iommu: Refactor amdvi_page_walk() to use common code for page walk >> >> hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++------- >> hw/i386/amd_iommu.h | 52 +++ >> qemu-options.hx | 23 + >> system/memory.c | 10 +- >> 4 files changed, 934 insertions(+), 156 deletions(-) >> >> >> base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf >> -- >> 2.43.5 >
> Sairaj: Are you passing a full NVME device to the guest (i.e. a PF)? I > ask because the BDF in '-device vfio-pci,host=0000:44:00.0' doesn't look > like a typical VF... > Hey Alejandro, I am passing full NVME device to the guest (not just VF). Thanks Sairaj
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote: > This series adds support for guests using the AMD vIOMMU to enable DMA > remapping for VFIO devices. In addition to the currently supported > passthrough (PT) mode, guest kernels are now able to to provide DMA > address translation and access permission checking to VFs attached to > paging domains, using the AMD v1 I/O page table format. > > Please see v1[0] cover letter for additional details such as example > QEMU command line parameters used in testing. > > Changes since v1[0]: > - Added documentation entry for '-device amd-iommu' > - Code movement with no functional changes to avoid use of forward > declarations in later patches [Sairaj, mst] > - Moved addr_translation and dma-remap property to separate commits. > The dma-remap feature is only available for users to enable after > all required functionality is implemented [Sairaj] > - Explicit initialization of significant fields like addr_translation > and notifier_flags [Sairaj] > - Fixed bug in decoding of invalidation size [Sairaj] > - Changed fetch_pte() to use an out parameter for pte, and be able to > check for error conditions via negative return value [Clement] > - Removed UNMAP-only notifier optimization, leaving vhost support for > later series [Sairaj] > - Fixed ordering between address space unmap and memory region activation > on devtab invalidation [Sairaj] > - Fixed commit message with "V=1, TV=0" [Sairaj] > - Dropped patch removing the page_fault event. That area is better > addressed in separate series. > - Independent testing by Sairaj (thank you!) > > Thank you, > Alejandro > > [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/ > > Alejandro Jimenez (20): > memory: Adjust event ranges to fit within notifier boundaries > amd_iommu: Document '-device amd-iommu' common options > amd_iommu: Reorder device and page table helpers > amd_iommu: Helper to decode size of page invalidation command > amd_iommu: Add helper function to extract the DTE > amd_iommu: Return an error when unable to read PTE from guest memory > amd_iommu: Add helpers to walk AMD v1 Page Table format > amd_iommu: Add a page walker to sync shadow page tables on > invalidation > amd_iommu: Add basic structure to support IOMMU notifier updates > amd_iommu: Sync shadow page tables on page invalidation > amd_iommu: Use iova_tree records to determine large page size on UNMAP > amd_iommu: Unmap all address spaces under the AMD IOMMU on reset > amd_iommu: Add replay callback > amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL > amd_iommu: Toggle memory regions based on address translation mode > amd_iommu: Set all address spaces to default translation mode on reset > amd_iommu: Add dma-remap property to AMD vIOMMU device > amd_iommu: Toggle address translation mode on devtab entry > invalidation > amd_iommu: Do not assume passthrough translation when DTE[TV]=0 > amd_iommu: Refactor amdvi_page_walk() to use common code for page walk > > hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++------- > hw/i386/amd_iommu.h | 52 +++ > qemu-options.hx | 23 + > system/memory.c | 10 +- > 4 files changed, 934 insertions(+), 156 deletions(-) > > > base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf Hi Alejandro, Tested the v2, everything looks good when I boot guest with upstream kernel. But I observed that NVME driver fails to load with guest kernel version 4.15.0-213-generic. This is the default kernel that comes with the ubuntu image. This is what I see in the dmesg [ 26.702381] nvme nvme0: pci function 0000:00:04.0 [ 26.817847] nvme nvme0: missing or invalid SUBNQN field. I am using following command qemu command line -enable-kvm -m 10G -smp cpus=$NUM_VCPUS \ -device amd-iommu,dma-remap=on \ -netdev user,id=USER0,hostfwd=tcp::3333-:22 \ -device virtio-net-pci,id=vnet0,iommu_platform=on,disable-legacy=on,romfile=,netdev=USER0 \ -cpu EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm-pv-tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield \ -name guest=my-vm,debug-threads=on \ -machine q35,kernel_irqchip=split \ -global kvm-pit.lost_tick_policy=discard \ -nographic -vga none -chardev stdio,id=STDIO0,signal=off,mux=on \ -device isa-serial,id=isa-serial0,chardev=STDIO0 \ -smbios type=0,version=2.8 \ -blockdev node-name=drive0,driver=qcow2,file.driver=file,file.filename=$IMG \ -device virtio-blk-pci,num-queues=8,drive=drive0 \ -chardev socket,id=SOCKET1,server=on,wait=off,path=qemu.mon.user3333 \ -mon chardev=SOCKET1,mode=control \ -device vfio-pci,host=0000:44:00.0 Do you have any idea what might trigger this. I see the error only when I am using emulated AMD IOMMU with passthrough device. Regular passthrough works fine. Regards Sairaj Kodilkar P.S. I know that the guest kernel is quite old but still wanted to make you aware.
Hi Sairaj On 5/16/25 4:07 AM, Sairaj Kodilkar wrote: > > > On 5/2/2025 7:45 AM, Alejandro Jimenez wrote: > Hi Alejandro, > > Tested the v2, everything looks good when I boot guest with upstream > kernel. But I observed that NVME driver fails to load with guest kernel > version 4.15.0-213-generic. This is the default kernel that comes with > the ubuntu image. Thank you for the additional testing and for the report. I wanted to investigate and if possible solve the issue before replying, but since it is taking me some time I wanted to ACK your message. Minor comments below... > > This is what I see in the dmesg > > [ 26.702381] nvme nvme0: pci function 0000:00:04.0 > [ 26.817847] nvme nvme0: missing or invalid SUBNQN field. There are multiple reports of that warning which would indicate that is not caused by an issue with the IOMMU emulation, but it is interesting that you don't see it with "regular passthrough" (I assume that means with guest kernel in pt mode). > > I am using following command qemu command line > > -enable-kvm -m 10G -smp cpus=$NUM_VCPUS \ > -device amd-iommu,dma-remap=on \ > -netdev user,id=USER0,hostfwd=tcp::3333-:22 \ > -device virtio-net-pci,id=vnet0,iommu_platform=on,disable- > legacy=on,romfile=,netdev=USER0 \ > -cpu EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm-pv- > tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield \ > -name guest=my-vm,debug-threads=on \ > -machine q35,kernel_irqchip=split \ > -global kvm-pit.lost_tick_policy=discard \ > -nographic -vga none -chardev stdio,id=STDIO0,signal=off,mux=on \ > -device isa-serial,id=isa-serial0,chardev=STDIO0 \ > -smbios type=0,version=2.8 \ > -blockdev node- > name=drive0,driver=qcow2,file.driver=file,file.filename=$IMG \ > -device virtio-blk-pci,num-queues=8,drive=drive0 \ > -chardev socket,id=SOCKET1,server=on,wait=off,path=qemu.mon.user3333 \ > -mon chardev=SOCKET1,mode=control \ > -device vfio-pci,host=0000:44:00.0 > > Do you have any idea what might trigger this. There are some parameters above that are unnecessary and perhaps conflicting e.g. we don't need kvm-msi-ext-dest-id=on since the vIOMMU provides interrupt remapping (plus you are likely not using more than 255 vCPUs). We also don't need kvm-pit.lost_tick_policy when using split irqchip, since the PIT is not emulated by KVM. But to be fair I don't believe those are likely to be causing the problem... My main suspicion is the guest IOMMU driver being too old and missing lots of fixes, so it could be missing some essential operations that the emulation requires to work. e.g. if the guest driver does not comply with the spec and fails to issue a DEVTAB_INVALIDATE after changing the DTE, the vIOMMU code never gets the chance to enable the IOMMU memory region, and it all goes wrong from that point on. But I need to reproduce the problem and figure out where/when the emulation is failing. I've tested as far back as 5.15 based kernels. I would argue that while it is something that I am definitely going to address if possible, this issue should not be a blocker. I'll update as soon as I have more data on the cause. Thank you, Alejandro > > I see the error only when I am using emulated AMD IOMMU with passthrough > device. Regular passthrough works fine. > > Regards > Sairaj Kodilkar > > P.S. I know that the guest kernel is quite old but still wanted to make > you aware. >
On 5/21/2025 8:05 AM, Alejandro Jimenez wrote: > Hi Sairaj > > On 5/16/25 4:07 AM, Sairaj Kodilkar wrote: >> >> >> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote: > >> Hi Alejandro, >> >> Tested the v2, everything looks good when I boot guest with upstream >> kernel. But I observed that NVME driver fails to load with guest kernel >> version 4.15.0-213-generic. This is the default kernel that comes with >> the ubuntu image. > > Thank you for the additional testing and for the report. I wanted to > investigate and if possible solve the issue before replying, but since > it is taking me some time I wanted to ACK your message. Minor comments > below... >> >> This is what I see in the dmesg >> >> [ 26.702381] nvme nvme0: pci function 0000:00:04.0 >> [ 26.817847] nvme nvme0: missing or invalid SUBNQN field. > > There are multiple reports of that warning which would indicate that is > not caused by an issue with the IOMMU emulation, but it is interesting > that you don't see it with "regular passthrough" (I assume that means > with guest kernel in pt mode). > Yep The "regular passthrough" is guest without amd-iommu or pt=on >> >> I am using following command qemu command line >> >> -enable-kvm -m 10G -smp cpus=$NUM_VCPUS \ >> -device amd-iommu,dma-remap=on \ >> -netdev user,id=USER0,hostfwd=tcp::3333-:22 \ >> -device virtio-net-pci,id=vnet0,iommu_platform=on,disable- >> legacy=on,romfile=,netdev=USER0 \ >> -cpu EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm- >> pv- tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield \ >> -name guest=my-vm,debug-threads=on \ >> -machine q35,kernel_irqchip=split \ >> -global kvm-pit.lost_tick_policy=discard \ >> -nographic -vga none -chardev stdio,id=STDIO0,signal=off,mux=on \ >> -device isa-serial,id=isa-serial0,chardev=STDIO0 \ >> -smbios type=0,version=2.8 \ >> -blockdev node- >> name=drive0,driver=qcow2,file.driver=file,file.filename=$IMG \ >> -device virtio-blk-pci,num-queues=8,drive=drive0 \ >> -chardev socket,id=SOCKET1,server=on,wait=off,path=qemu.mon.user3333 \ >> -mon chardev=SOCKET1,mode=control \ >> -device vfio-pci,host=0000:44:00.0 >> >> Do you have any idea what might trigger this. > > There are some parameters above that are unnecessary and perhaps > conflicting e.g. we don't need kvm-msi-ext-dest-id=on since the vIOMMU > provides interrupt remapping (plus you are likely not using more than > 255 vCPUs). We also don't need kvm-pit.lost_tick_policy when using split > irqchip, since the PIT is not emulated by KVM. But to be fair I don't > believe those are likely to be causing the problem... Thanks for letting me know, I'll update the script. > > My main suspicion is the guest IOMMU driver being too old and missing > lots of fixes, so it could be missing some essential operations that the > emulation requires to work. e.g. if the guest driver does not comply > with the spec and fails to issue a DEVTAB_INVALIDATE after changing the > DTE, the vIOMMU code never gets the chance to enable the IOMMU memory > region, and it all goes wrong from that point on. > But I need to reproduce the problem and figure out where/when the > emulation is failing. I've tested as far back as 5.15 based kernels. > > I would argue that while it is something that I am definitely going to > address if possible, this issue should not be a blocker. I'll update as > soon as I have more data on the cause. > > Thank you, > Alejandro > I also think the same. This may be some old driver issue and we should not block on it. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Regards Sairaj Kodilkar
© 2016 - 2025 Red Hat, Inc.