[PATCH v4 00/15] vfio: VFIO migration support with vIOMMU

Joao Martins posted 15 patches 2 years, 7 months ago
Failed in applying to current master (apply log)
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Peter Xu <peterx@redhat.com>, Jason Wang <jasowang@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Alex Williamson <alex.williamson@redhat.com>, "Cédric Le Goater" <clg@redhat.com>, David Hildenbrand <david@redhat.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>
include/exec/memory.h         |   4 +-
include/hw/pci/pci.h          |  11 ++
include/hw/pci/pci_bus.h      |   1 +
include/hw/vfio/vfio-common.h |   2 +
hw/i386/intel_iommu.c         |  53 +++++++-
hw/pci/pci.c                  |  58 +++++++-
hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
hw/vfio/pci.c                 |  22 +++-
8 files changed, 329 insertions(+), 63 deletions(-)
[PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 2 years, 7 months ago
Hey,

This series introduces support for vIOMMU with VFIO device migration,
particurlarly related to how we do the dirty page tracking.

Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
provide dma translation services for guests to provide some form of
guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
required for big VMs with VFs with more than 255 vcpus. We tackle both
and remove the migration blocker when vIOMMU is present provided the
conditions are met. I have both use-cases here in one series, but I am happy
to tackle them in separate series.

As I found out we don't necessarily need to expose the whole vIOMMU
functionality in order to just support interrupt remapping. x86 IOMMUs
on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
can instantiate a IOMMU just for interrupt remapping without needing to
be advertised/support DMA translation. AMD IOMMU in theory can provide
the same, but Linux doesn't quite support the IR-only part there yet,
only intel-iommu.

The series is organized as following:

Patches 1-5: Today we can't gather vIOMMU details before the guest
establishes their first DMA mapping via the vIOMMU. So these first four
patches add a way for vIOMMUs to be asked of their properties at start
of day. I choose the least churn possible way for now (as opposed to a
treewide conversion) and allow easy conversion a posteriori. As
suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
allows us to fetch PCI backing vIOMMU attributes, without necessarily
tieing the caller (VFIO or anyone else) to an IOMMU MR like I
was doing in v3.

Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
DMA translation allowed. Today the 'dma-translation' attribute is
x86-iommu only, but the way this series is structured nothing stops from
other vIOMMUs supporting it too as long as they use
pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
we've then tackled item (1) of the second paragraph.

Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
IOVA address space, leveraging the logic we use to compose the dirty ranges.
The blocker is once again relaxed for vIOMMUs that advertise their IOVA
addressing limits. This tackles item (2). So far I mainly use it with
intel-iommu, although I have a small set of patches for virtio-iommu per
Alex's suggestion in v2.

Comments, suggestions welcome. Thanks for the review!

Regards,
	Joao

Changes since v3[8]:
* Pick up Yi's patches[5][6], and rework the first four patches.
  These are a bit better splitted, and make the new iommu_ops *optional*
  as opposed to a treewide conversion. Rather than returning an IOMMU MR
  and let VFIO operate on it to fetch attributes, we instead let the
  underlying IOMMU driver fetch the desired IOMMU MR and ask for the
  desired IOMMU attribute. Callers only care about PCI Device backing
  vIOMMU attributes regardless of its topology/association. (Peter Xu)
  These patches are a bit better splitted compared to original ones,
  and I've kept all the same authorship and note the changes from
  original where applicable.
* Because of the rework of the first four patches, switch to
  individual attributes in the VFIOSpace that track dma_translation
  and the max_iova. All are expected to be unused when zero to retain
  the defaults of today in common code.
* Improve the migration blocker message of the last patch to be
  more obvious that vIOMMU migration blocker is added when no vIOMMU
  address space limits are advertised. (Patch 15)
* Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
* Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
* Change diffstat of patches with scripts/git.orderfile (Philippe).

Changes since v2[3]:
* New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
introduce ways to know various IOMMU model attributes via the IOMMU MR. This
is partly meant to address a comment in previous versions where we can't
access the IOMMU MR prior to the DMA mapping happening. Before this series
vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
guest. As well as better tackling of the IOMMU usage for interrupt-remapping
only purposes. 
* Dropped Peter Xu ack on patch 9 given that the code changed a bit.
* Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
* The patches that existed in v2 of vIOMMU dirty tracking, are mostly
* untouched, except patch 12 which was greatly simplified.

Changes since v1[4]:
- Rebased on latest master branch. As part of it, made some changes in
  pre-copy to adjust it to Juan's new patches:
  1. Added a new patch that passes threshold_size parameter to
     .state_pending_{estimate,exact}() handlers.
  2. Added a new patch that refactors vfio_save_block().
  3. Changed the pre-copy patch to cache and report pending pre-copy
     size in the .state_pending_estimate() handler.
- Removed unnecessary P2P code. This should be added later on when P2P
  support is added. (Alex)
- Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
  (patch #11). (Alex)
- Stored vfio_devices_all_device_dirty_tracking()'s value in a local
  variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
- Refactored the viommu device dirty tracking ranges creation code to
  make it clearer (patch #15).
- Changed overflow check in vfio_iommu_range_is_device_tracked() to
  emphasize that we specifically check for 2^64 wrap around (patch #15).
- Added R-bs / Acks.

[0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
[1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
[2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
[3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
[4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
[5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
[6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
[7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
[8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/

Avihai Horon (4):
  memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
  vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  vfio/common: Optimize device dirty page tracking with vIOMMU

Joao Martins (7):
  memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
  intel-iommu: Implement get_attr() method
  vfio/common: Track whether DMA Translation is enabled on the vIOMMU
  vfio/common: Relax vIOMMU detection when DMA translation is off
  vfio/common: Move dirty tracking ranges update to helper
  vfio/common: Support device dirty page tracking with vIOMMU
  vfio/common: Block migration with vIOMMUs without address width limits

Yi Liu (4):
  hw/pci: Add a pci_setup_iommu_ops() helper
  hw/pci: Refactor pci_device_iommu_address_space()
  hw/pci: Introduce pci_device_iommu_get_attr()
  intel-iommu: Switch to pci_setup_iommu_ops()

 include/exec/memory.h         |   4 +-
 include/hw/pci/pci.h          |  11 ++
 include/hw/pci/pci_bus.h      |   1 +
 include/hw/vfio/vfio-common.h |   2 +
 hw/i386/intel_iommu.c         |  53 +++++++-
 hw/pci/pci.c                  |  58 +++++++-
 hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
 hw/vfio/pci.c                 |  22 +++-
 8 files changed, 329 insertions(+), 63 deletions(-)

-- 
2.17.2
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Zhangfei Gao 1 year, 1 month ago
Hi, Joao

On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,
>
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
>
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
>
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
>
> The series is organized as following:
>
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
>
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.

Not understanding how to handle the device page table.

Does this mean after live-migration, the page table built by vIOMMU
will be re-build in the target guest via pci_setup_iommu_ops?
Or done by page-fault again?

Thanks
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 1 year ago
On 07/01/2025 06:55, Zhangfei Gao wrote:
> Hi, Joao
> 
> On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Hey,
>>
>> This series introduces support for vIOMMU with VFIO device migration,
>> particurlarly related to how we do the dirty page tracking.
>>
>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>> provide dma translation services for guests to provide some form of
>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>> and remove the migration blocker when vIOMMU is present provided the
>> conditions are met. I have both use-cases here in one series, but I am happy
>> to tackle them in separate series.
>>
>> As I found out we don't necessarily need to expose the whole vIOMMU
>> functionality in order to just support interrupt remapping. x86 IOMMUs
>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>> can instantiate a IOMMU just for interrupt remapping without needing to
>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>> the same, but Linux doesn't quite support the IR-only part there yet,
>> only intel-iommu.
>>
>> The series is organized as following:
>>
>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>> establishes their first DMA mapping via the vIOMMU. So these first four
>> patches add a way for vIOMMUs to be asked of their properties at start
>> of day. I choose the least churn possible way for now (as opposed to a
>> treewide conversion) and allow easy conversion a posteriori. As
>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>> was doing in v3.
>>
>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>> DMA translation allowed. Today the 'dma-translation' attribute is
>> x86-iommu only, but the way this series is structured nothing stops from
>> other vIOMMUs supporting it too as long as they use
>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>> we've then tackled item (1) of the second paragraph.
> 
> Not understanding how to handle the device page table.
> 
> Does this mean after live-migration, the page table built by vIOMMU
> will be re-build in the target guest via pci_setup_iommu_ops?

AFAIU It is supposed to be done post loading the vIOMMU vmstate when enabling
the vIOMMU related MRs. And when walking the different 'emulated' address spaces
 it will replay all mappings (and skip non-present parts of the address space).

The trick in making this work largelly depends on individual vIOMMU
implementation (and this emulated vIOMMU stuff shouldn't be confused with IOMMU
nesting btw!). In intel case (and AMD will be similar) the root table pointer
that's part of the vmstate has all the device pagetables, which is just guest
memory that gets migrated over and enough to resolve VT-d/IVRS page walks.

The somewhat hard to follow part is that when it replays it walks all the whole
DMAR memory region and only notifies IOMMU MR listeners if there's a present PTE
or skip it. So at the end of the enabling of MRs the IOTLB gets reconstructed.
Though you would have to try to understand the flow with the vIOMMU you are using.

The replay in intel-iommu is triggered more or less this stack trace for a
present PTE:

vfio_iommu_map_notify
memory_region_notify_iommu_one
vtd_replay_hook
vtd_page_walk_one
vtd_page_walk_level
vtd_page_walk_level
vtd_page_walk_level
vtd_page_walk
vtd_iommu_replay
memory_region_iommu_replay
vfio_listener_region_add
address_space_update_topology_pass
address_space_set_flatview
memory_region_transaction_commit
vtd_switch_address_space
vtd_switch_address_space_all
vtd_post_load
vmstate_load_state
vmstate_load
qemu_loadvm_section_start_full
qemu_loadvm_state_main
qemu_loadvm_state
process_incoming_migration_co

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Zhangfei Gao 1 year ago
On Wed, Jan 22, 2025 at 12:43 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 07/01/2025 06:55, Zhangfei Gao wrote:
> > Hi, Joao
> >
> > On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Hey,
> >>
> >> This series introduces support for vIOMMU with VFIO device migration,
> >> particurlarly related to how we do the dirty page tracking.
> >>
> >> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> >> provide dma translation services for guests to provide some form of
> >> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> >> required for big VMs with VFs with more than 255 vcpus. We tackle both
> >> and remove the migration blocker when vIOMMU is present provided the
> >> conditions are met. I have both use-cases here in one series, but I am happy
> >> to tackle them in separate series.
> >>
> >> As I found out we don't necessarily need to expose the whole vIOMMU
> >> functionality in order to just support interrupt remapping. x86 IOMMUs
> >> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> >> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> >> can instantiate a IOMMU just for interrupt remapping without needing to
> >> be advertised/support DMA translation. AMD IOMMU in theory can provide
> >> the same, but Linux doesn't quite support the IR-only part there yet,
> >> only intel-iommu.
> >>
> >> The series is organized as following:
> >>
> >> Patches 1-5: Today we can't gather vIOMMU details before the guest
> >> establishes their first DMA mapping via the vIOMMU. So these first four
> >> patches add a way for vIOMMUs to be asked of their properties at start
> >> of day. I choose the least churn possible way for now (as opposed to a
> >> treewide conversion) and allow easy conversion a posteriori. As
> >> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> >> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> >> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> >> was doing in v3.
> >>
> >> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> >> DMA translation allowed. Today the 'dma-translation' attribute is
> >> x86-iommu only, but the way this series is structured nothing stops from
> >> other vIOMMUs supporting it too as long as they use
> >> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> >> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> >> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> >> we've then tackled item (1) of the second paragraph.
> >
> > Not understanding how to handle the device page table.
> >
> > Does this mean after live-migration, the page table built by vIOMMU
> > will be re-build in the target guest via pci_setup_iommu_ops?
>
> AFAIU It is supposed to be done post loading the vIOMMU vmstate when enabling
> the vIOMMU related MRs. And when walking the different 'emulated' address spaces
>  it will replay all mappings (and skip non-present parts of the address space).
>
> The trick in making this work largelly depends on individual vIOMMU
> implementation (and this emulated vIOMMU stuff shouldn't be confused with IOMMU
> nesting btw!). In intel case (and AMD will be similar) the root table pointer
> that's part of the vmstate has all the device pagetables, which is just guest
> memory that gets migrated over and enough to resolve VT-d/IVRS page walks.
>
> The somewhat hard to follow part is that when it replays it walks all the whole
> DMAR memory region and only notifies IOMMU MR listeners if there's a present PTE
> or skip it. So at the end of the enabling of MRs the IOTLB gets reconstructed.
> Though you would have to try to understand the flow with the vIOMMU you are using.
>
> The replay in intel-iommu is triggered more or less this stack trace for a
> present PTE:
>
> vfio_iommu_map_notify
> memory_region_notify_iommu_one
> vtd_replay_hook
> vtd_page_walk_one
> vtd_page_walk_level
> vtd_page_walk_level
> vtd_page_walk_level
> vtd_page_walk
> vtd_iommu_replay
> memory_region_iommu_replay
> vfio_listener_region_add
> address_space_update_topology_pass
> address_space_set_flatview
> memory_region_transaction_commit
> vtd_switch_address_space
> vtd_switch_address_space_all
> vtd_post_load
> vmstate_load_state
> vmstate_load
> qemu_loadvm_section_start_full
> qemu_loadvm_state_main
> qemu_loadvm_state
> process_incoming_migration_co

Thanks Joao for the info

Sorry, some more questions,

When src boots up, the guest kernel will send commands to qemu.
qemu will consume these commands, and trigger

smmuv3_cmdq_consume
smmu_realloc_veventq
smmuv3_cmdq_consume
smmuv3_cmdq_consume SMMU_CMD_CFGI_STE
smmuv3_install_nested_ste
iommufd_backend_alloc_hwpt
host_iommu_device_iommufd_attach_hwpt

After live-migration, the dst does not get these commands, so it does
not call smmuv3_install_nested_ste etc.
so the dma page table is not set up and the kernel reports errors.

Not sure if we need to set up these commands in the dst, or directly
copy the existing page table from src to the dst.

Thanks
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 11 months, 1 week ago
On 08/02/2025 02:07, Zhangfei Gao wrote:
> On Wed, Jan 22, 2025 at 12:43 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 07/01/2025 06:55, Zhangfei Gao wrote:
>>> Hi, Joao
>>>
>>> On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> Hey,
>>>>
>>>> This series introduces support for vIOMMU with VFIO device migration,
>>>> particurlarly related to how we do the dirty page tracking.
>>>>
>>>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>>>> provide dma translation services for guests to provide some form of
>>>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>>>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>>>> and remove the migration blocker when vIOMMU is present provided the
>>>> conditions are met. I have both use-cases here in one series, but I am happy
>>>> to tackle them in separate series.
>>>>
>>>> As I found out we don't necessarily need to expose the whole vIOMMU
>>>> functionality in order to just support interrupt remapping. x86 IOMMUs
>>>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>>>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>>>> can instantiate a IOMMU just for interrupt remapping without needing to
>>>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>>>> the same, but Linux doesn't quite support the IR-only part there yet,
>>>> only intel-iommu.
>>>>
>>>> The series is organized as following:
>>>>
>>>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>>>> establishes their first DMA mapping via the vIOMMU. So these first four
>>>> patches add a way for vIOMMUs to be asked of their properties at start
>>>> of day. I choose the least churn possible way for now (as opposed to a
>>>> treewide conversion) and allow easy conversion a posteriori. As
>>>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>>>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>>>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>>>> was doing in v3.
>>>>
>>>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>>>> DMA translation allowed. Today the 'dma-translation' attribute is
>>>> x86-iommu only, but the way this series is structured nothing stops from
>>>> other vIOMMUs supporting it too as long as they use
>>>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>>>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>>>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>>>> we've then tackled item (1) of the second paragraph.
>>>
>>> Not understanding how to handle the device page table.
>>>
>>> Does this mean after live-migration, the page table built by vIOMMU
>>> will be re-build in the target guest via pci_setup_iommu_ops?
>>
>> AFAIU It is supposed to be done post loading the vIOMMU vmstate when enabling
>> the vIOMMU related MRs. And when walking the different 'emulated' address spaces
>>  it will replay all mappings (and skip non-present parts of the address space).
>>
>> The trick in making this work largelly depends on individual vIOMMU
>> implementation (and this emulated vIOMMU stuff shouldn't be confused with IOMMU
>> nesting btw!). In intel case (and AMD will be similar) the root table pointer
>> that's part of the vmstate has all the device pagetables, which is just guest
>> memory that gets migrated over and enough to resolve VT-d/IVRS page walks.
>>
>> The somewhat hard to follow part is that when it replays it walks all the whole
>> DMAR memory region and only notifies IOMMU MR listeners if there's a present PTE
>> or skip it. So at the end of the enabling of MRs the IOTLB gets reconstructed.
>> Though you would have to try to understand the flow with the vIOMMU you are using.
>>
>> The replay in intel-iommu is triggered more or less this stack trace for a
>> present PTE:
>>
>> vfio_iommu_map_notify
>> memory_region_notify_iommu_one
>> vtd_replay_hook
>> vtd_page_walk_one
>> vtd_page_walk_level
>> vtd_page_walk_level
>> vtd_page_walk_level
>> vtd_page_walk
>> vtd_iommu_replay
>> memory_region_iommu_replay
>> vfio_listener_region_add
>> address_space_update_topology_pass
>> address_space_set_flatview
>> memory_region_transaction_commit
>> vtd_switch_address_space
>> vtd_switch_address_space_all
>> vtd_post_load
>> vmstate_load_state
>> vmstate_load
>> qemu_loadvm_section_start_full
>> qemu_loadvm_state_main
>> qemu_loadvm_state
>> process_incoming_migration_co
> 
> Thanks Joao for the info
> 
> Sorry, some more questions,
> 
> When src boots up, the guest kernel will send commands to qemu.
> qemu will consume these commands, and trigger
> 
> smmuv3_cmdq_consume
> smmu_realloc_veventq
> smmuv3_cmdq_consume
> smmuv3_cmdq_consume SMMU_CMD_CFGI_STE
> smmuv3_install_nested_ste
> iommufd_backend_alloc_hwpt
> host_iommu_device_iommufd_attach_hwpt
> 
> After live-migration, the dst does not get these commands, so it does
> not call smmuv3_install_nested_ste etc.
> so the dma page table is not set up and the kernel reports errors.
> 
> Not sure if we need to set up these commands in the dst, or directly
> copy the existing page table from src to the dst.

Whatever constructs the 'root' of the device page table (STE entry?) that allows
page table walkers to get resolved, need to be sent as part of the device state.

And then the post-load callback is what then mirrors the 'present PTEs' into
host IOMMU pagetable(s). At a quick glance this guest information looks to be
migrated via ::strtab_base and ::strtab_base_cfg of the smmuv3 vmstate.

Your problem seems to be that nothing is loading it (in the post-load) and thus
nothing walks the whole (migrated) thing to reconstruct what you had in the source.

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Zhangfei Gao 1 year, 2 months ago
Hi, Joao

On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,
>
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
>
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
>
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
>
> The series is organized as following:
>
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
>
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.
>
> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
> IOVA address space, leveraging the logic we use to compose the dirty ranges.
> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
> addressing limits. This tackles item (2). So far I mainly use it with
> intel-iommu, although I have a small set of patches for virtio-iommu per
> Alex's suggestion in v2.
>
> Comments, suggestions welcome. Thanks for the review!
>
> Regards,
>         Joao
>
> Changes since v3[8]:
> * Pick up Yi's patches[5][6], and rework the first four patches.
>   These are a bit better splitted, and make the new iommu_ops *optional*
>   as opposed to a treewide conversion. Rather than returning an IOMMU MR
>   and let VFIO operate on it to fetch attributes, we instead let the
>   underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>   desired IOMMU attribute. Callers only care about PCI Device backing
>   vIOMMU attributes regardless of its topology/association. (Peter Xu)
>   These patches are a bit better splitted compared to original ones,
>   and I've kept all the same authorship and note the changes from
>   original where applicable.
> * Because of the rework of the first four patches, switch to
>   individual attributes in the VFIOSpace that track dma_translation
>   and the max_iova. All are expected to be unused when zero to retain
>   the defaults of today in common code.
> * Improve the migration blocker message of the last patch to be
>   more obvious that vIOMMU migration blocker is added when no vIOMMU
>   address space limits are advertised. (Patch 15)
> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
> * Change diffstat of patches with scripts/git.orderfile (Philippe).
>
> Changes since v2[3]:
> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
> is partly meant to address a comment in previous versions where we can't
> access the IOMMU MR prior to the DMA mapping happening. Before this series
> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
> only purposes.
> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
> * untouched, except patch 12 which was greatly simplified.
>
> Changes since v1[4]:
> - Rebased on latest master branch. As part of it, made some changes in
>   pre-copy to adjust it to Juan's new patches:
>   1. Added a new patch that passes threshold_size parameter to
>      .state_pending_{estimate,exact}() handlers.
>   2. Added a new patch that refactors vfio_save_block().
>   3. Changed the pre-copy patch to cache and report pending pre-copy
>      size in the .state_pending_estimate() handler.
> - Removed unnecessary P2P code. This should be added later on when P2P
>   support is added. (Alex)
> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>   (patch #11). (Alex)
> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>   variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
> - Refactored the viommu device dirty tracking ranges creation code to
>   make it clearer (patch #15).
> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>   emphasize that we specifically check for 2^64 wrap around (patch #15).
> - Added R-bs / Acks.
>
> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
> [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
> [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
>
> Avihai Horon (4):
>   memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>   intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>   vfio/common: Optimize device dirty page tracking with vIOMMU
>
> Joao Martins (7):
>   memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>   intel-iommu: Implement get_attr() method
>   vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>   vfio/common: Relax vIOMMU detection when DMA translation is off
>   vfio/common: Move dirty tracking ranges update to helper
>   vfio/common: Support device dirty page tracking with vIOMMU
>   vfio/common: Block migration with vIOMMUs without address width limits
>
> Yi Liu (4):
>   hw/pci: Add a pci_setup_iommu_ops() helper
>   hw/pci: Refactor pci_device_iommu_address_space()
>   hw/pci: Introduce pci_device_iommu_get_attr()
>   intel-iommu: Switch to pci_setup_iommu_ops()
>

Would you mind pointing to the github address?
I have some conflicts, and the github will be much helpful.


Thanks
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 1 year, 2 months ago
On 28/11/2024 03:19, Zhangfei Gao wrote:
> Hi, Joao
> 
> On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Hey,
>>
>> This series introduces support for vIOMMU with VFIO device migration,
>> particurlarly related to how we do the dirty page tracking.
>>
>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>> provide dma translation services for guests to provide some form of
>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>> and remove the migration blocker when vIOMMU is present provided the
>> conditions are met. I have both use-cases here in one series, but I am happy
>> to tackle them in separate series.
>>
>> As I found out we don't necessarily need to expose the whole vIOMMU
>> functionality in order to just support interrupt remapping. x86 IOMMUs
>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>> can instantiate a IOMMU just for interrupt remapping without needing to
>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>> the same, but Linux doesn't quite support the IR-only part there yet,
>> only intel-iommu.
>>
>> The series is organized as following:
>>
>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>> establishes their first DMA mapping via the vIOMMU. So these first four
>> patches add a way for vIOMMUs to be asked of their properties at start
>> of day. I choose the least churn possible way for now (as opposed to a
>> treewide conversion) and allow easy conversion a posteriori. As
>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>> was doing in v3.
>>
>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>> DMA translation allowed. Today the 'dma-translation' attribute is
>> x86-iommu only, but the way this series is structured nothing stops from
>> other vIOMMUs supporting it too as long as they use
>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>> we've then tackled item (1) of the second paragraph.
>>
>> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
>> IOVA address space, leveraging the logic we use to compose the dirty ranges.
>> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
>> addressing limits. This tackles item (2). So far I mainly use it with
>> intel-iommu, although I have a small set of patches for virtio-iommu per
>> Alex's suggestion in v2.
>>
>> Comments, suggestions welcome. Thanks for the review!
>>
>> Regards,
>>         Joao
>>
>> Changes since v3[8]:
>> * Pick up Yi's patches[5][6], and rework the first four patches.
>>   These are a bit better splitted, and make the new iommu_ops *optional*
>>   as opposed to a treewide conversion. Rather than returning an IOMMU MR
>>   and let VFIO operate on it to fetch attributes, we instead let the
>>   underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>>   desired IOMMU attribute. Callers only care about PCI Device backing
>>   vIOMMU attributes regardless of its topology/association. (Peter Xu)
>>   These patches are a bit better splitted compared to original ones,
>>   and I've kept all the same authorship and note the changes from
>>   original where applicable.
>> * Because of the rework of the first four patches, switch to
>>   individual attributes in the VFIOSpace that track dma_translation
>>   and the max_iova. All are expected to be unused when zero to retain
>>   the defaults of today in common code.
>> * Improve the migration blocker message of the last patch to be
>>   more obvious that vIOMMU migration blocker is added when no vIOMMU
>>   address space limits are advertised. (Patch 15)
>> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
>> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
>> * Change diffstat of patches with scripts/git.orderfile (Philippe).
>>
>> Changes since v2[3]:
>> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
>> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
>> is partly meant to address a comment in previous versions where we can't
>> access the IOMMU MR prior to the DMA mapping happening. Before this series
>> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
>> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
>> only purposes.
>> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
>> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
>> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
>> * untouched, except patch 12 which was greatly simplified.
>>
>> Changes since v1[4]:
>> - Rebased on latest master branch. As part of it, made some changes in
>>   pre-copy to adjust it to Juan's new patches:
>>   1. Added a new patch that passes threshold_size parameter to
>>      .state_pending_{estimate,exact}() handlers.
>>   2. Added a new patch that refactors vfio_save_block().
>>   3. Changed the pre-copy patch to cache and report pending pre-copy
>>      size in the .state_pending_estimate() handler.
>> - Removed unnecessary P2P code. This should be added later on when P2P
>>   support is added. (Alex)
>> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>>   (patch #11). (Alex)
>> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>>   variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
>> - Refactored the viommu device dirty tracking ranges creation code to
>>   make it clearer (patch #15).
>> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>>   emphasize that we specifically check for 2^64 wrap around (patch #15).
>> - Added R-bs / Acks.
>>
>> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
>> [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
>> [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
>> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
>> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
>> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
>> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
>> [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
>>
>> Avihai Horon (4):
>>   memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>>   intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>>   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>>   vfio/common: Optimize device dirty page tracking with vIOMMU
>>
>> Joao Martins (7):
>>   memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>>   intel-iommu: Implement get_attr() method
>>   vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>>   vfio/common: Relax vIOMMU detection when DMA translation is off
>>   vfio/common: Move dirty tracking ranges update to helper
>>   vfio/common: Support device dirty page tracking with vIOMMU
>>   vfio/common: Block migration with vIOMMUs without address width limits
>>
>> Yi Liu (4):
>>   hw/pci: Add a pci_setup_iommu_ops() helper
>>   hw/pci: Refactor pci_device_iommu_address_space()
>>   hw/pci: Introduce pci_device_iommu_get_attr()
>>   intel-iommu: Switch to pci_setup_iommu_ops()
>>
> 
> Would you mind pointing to the github address?
> I have some conflicts, and the github will be much helpful.

Yeap, I have a series -- picking up from Cedric's rebase since 9.1 soft freeze
-- but testing is still in progress.

Give me a couple days and I'll respond here as there's a little more changes on
top (now that we have IOMMUFD support) will get for v5.

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 1 year ago
On 28/11/2024 18:29, Joao Martins wrote:
> On 28/11/2024 03:19, Zhangfei Gao wrote:
>> Hi, Joao
>>
>> On Fri, Jun 23, 2023 at 5:51 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> Hey,
>>>
>>> This series introduces support for vIOMMU with VFIO device migration,
>>> particurlarly related to how we do the dirty page tracking.
>>>
>>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>>> provide dma translation services for guests to provide some form of
>>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>>> and remove the migration blocker when vIOMMU is present provided the
>>> conditions are met. I have both use-cases here in one series, but I am happy
>>> to tackle them in separate series.
>>>
>>> As I found out we don't necessarily need to expose the whole vIOMMU
>>> functionality in order to just support interrupt remapping. x86 IOMMUs
>>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>>> can instantiate a IOMMU just for interrupt remapping without needing to
>>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>>> the same, but Linux doesn't quite support the IR-only part there yet,
>>> only intel-iommu.
>>>
>>> The series is organized as following:
>>>
>>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>>> establishes their first DMA mapping via the vIOMMU. So these first four
>>> patches add a way for vIOMMUs to be asked of their properties at start
>>> of day. I choose the least churn possible way for now (as opposed to a
>>> treewide conversion) and allow easy conversion a posteriori. As
>>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>>> was doing in v3.
>>>
>>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>>> DMA translation allowed. Today the 'dma-translation' attribute is
>>> x86-iommu only, but the way this series is structured nothing stops from
>>> other vIOMMUs supporting it too as long as they use
>>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>>> we've then tackled item (1) of the second paragraph.
>>>
>>> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
>>> IOVA address space, leveraging the logic we use to compose the dirty ranges.
>>> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
>>> addressing limits. This tackles item (2). So far I mainly use it with
>>> intel-iommu, although I have a small set of patches for virtio-iommu per
>>> Alex's suggestion in v2.
>>>
>>> Comments, suggestions welcome. Thanks for the review!
>>>
>>> Regards,
>>>         Joao
>>>
>>> Changes since v3[8]:
>>> * Pick up Yi's patches[5][6], and rework the first four patches.
>>>   These are a bit better splitted, and make the new iommu_ops *optional*
>>>   as opposed to a treewide conversion. Rather than returning an IOMMU MR
>>>   and let VFIO operate on it to fetch attributes, we instead let the
>>>   underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>>>   desired IOMMU attribute. Callers only care about PCI Device backing
>>>   vIOMMU attributes regardless of its topology/association. (Peter Xu)
>>>   These patches are a bit better splitted compared to original ones,
>>>   and I've kept all the same authorship and note the changes from
>>>   original where applicable.
>>> * Because of the rework of the first four patches, switch to
>>>   individual attributes in the VFIOSpace that track dma_translation
>>>   and the max_iova. All are expected to be unused when zero to retain
>>>   the defaults of today in common code.
>>> * Improve the migration blocker message of the last patch to be
>>>   more obvious that vIOMMU migration blocker is added when no vIOMMU
>>>   address space limits are advertised. (Patch 15)
>>> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
>>> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
>>> * Change diffstat of patches with scripts/git.orderfile (Philippe).
>>>
>>> Changes since v2[3]:
>>> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
>>> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
>>> is partly meant to address a comment in previous versions where we can't
>>> access the IOMMU MR prior to the DMA mapping happening. Before this series
>>> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
>>> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
>>> only purposes.
>>> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
>>> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
>>> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
>>> * untouched, except patch 12 which was greatly simplified.
>>>
>>> Changes since v1[4]:
>>> - Rebased on latest master branch. As part of it, made some changes in
>>>   pre-copy to adjust it to Juan's new patches:
>>>   1. Added a new patch that passes threshold_size parameter to
>>>      .state_pending_{estimate,exact}() handlers.
>>>   2. Added a new patch that refactors vfio_save_block().
>>>   3. Changed the pre-copy patch to cache and report pending pre-copy
>>>      size in the .state_pending_estimate() handler.
>>> - Removed unnecessary P2P code. This should be added later on when P2P
>>>   support is added. (Alex)
>>> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>>>   (patch #11). (Alex)
>>> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>>>   variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
>>> - Refactored the viommu device dirty tracking ranges creation code to
>>>   make it clearer (patch #15).
>>> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>>>   emphasize that we specifically check for 2^64 wrap around (patch #15).
>>> - Added R-bs / Acks.
>>>
>>> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
>>> [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
>>> [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
>>> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
>>> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
>>> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>>> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
>>> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
>>> [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
>>>
>>> Avihai Horon (4):
>>>   memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>>>   intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>>>   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>>>   vfio/common: Optimize device dirty page tracking with vIOMMU
>>>
>>> Joao Martins (7):
>>>   memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>>>   intel-iommu: Implement get_attr() method
>>>   vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>>>   vfio/common: Relax vIOMMU detection when DMA translation is off
>>>   vfio/common: Move dirty tracking ranges update to helper
>>>   vfio/common: Support device dirty page tracking with vIOMMU
>>>   vfio/common: Block migration with vIOMMUs without address width limits
>>>
>>> Yi Liu (4):
>>>   hw/pci: Add a pci_setup_iommu_ops() helper
>>>   hw/pci: Refactor pci_device_iommu_address_space()
>>>   hw/pci: Introduce pci_device_iommu_get_attr()
>>>   intel-iommu: Switch to pci_setup_iommu_ops()
>>>
>>
>> Would you mind pointing to the github address?
>> I have some conflicts, and the github will be much helpful.
> 
> Yeap, I have a series -- picking up from Cedric's rebase since 9.1 soft freeze
> -- but testing is still in progress.
> 
> Give me a couple days and I'll respond here as there's a little more changes on
> top (now that we have IOMMUFD support) will get for v5.

Here it is the WIP (there's still 2 wrinkles left):

	https://github.com/jpemartins/qemu/commits/vfio-migration-viommu/

The first four patches relax the LM blocking of viommu if it's using IOMMUFD
dirty tracking. The rest is roughly this series that optimizes things a bit
though mostly useful for VF dirty tracking.

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Cédric Le Goater 1 year, 8 months ago
Hello Joao,

On 6/22/23 23:48, Joao Martins wrote:
> Hey,
> 
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
> 
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
> 
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
> 
> The series is organized as following:
> 
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
> 
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.
> 
> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
> IOVA address space, leveraging the logic we use to compose the dirty ranges.
> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
> addressing limits. This tackles item (2). So far I mainly use it with
> intel-iommu, although I have a small set of patches for virtio-iommu per
> Alex's suggestion in v2.
> 
> Comments, suggestions welcome. Thanks for the review!


I spent sometime refreshing your series on upstream QEMU (See [1]) and
gave migration a try with CX-7 VF. LGTM. It doesn't seem we are far
from acceptance in QEMU 9.1. Are we ?

First, I will resend these with the changes I made :

   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
   vfio/common: Move dirty tracking ranges update to helper()

I guess the PCIIOMMUOps::get_iommu_attr needs a close review. Is
IOMMU_ATTR_DMA_TRANSLATION a must have ?

The rest is mostly VFIO internals for dirty tracking.

Thanks,

C.

[1] https://github.com/legoater/qemu/commits/vfio-9.1


> 
> Regards,
> 	Joao
> 
> Changes since v3[8]:
> * Pick up Yi's patches[5][6], and rework the first four patches.
>    These are a bit better splitted, and make the new iommu_ops *optional*
>    as opposed to a treewide conversion. Rather than returning an IOMMU MR
>    and let VFIO operate on it to fetch attributes, we instead let the
>    underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>    desired IOMMU attribute. Callers only care about PCI Device backing
>    vIOMMU attributes regardless of its topology/association. (Peter Xu)
>    These patches are a bit better splitted compared to original ones,
>    and I've kept all the same authorship and note the changes from
>    original where applicable.
> * Because of the rework of the first four patches, switch to
>    individual attributes in the VFIOSpace that track dma_translation
>    and the max_iova. All are expected to be unused when zero to retain
>    the defaults of today in common code.
> * Improve the migration blocker message of the last patch to be
>    more obvious that vIOMMU migration blocker is added when no vIOMMU
>    address space limits are advertised. (Patch 15)
> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
> * Change diffstat of patches with scripts/git.orderfile (Philippe).
> 
> Changes since v2[3]:
> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
> is partly meant to address a comment in previous versions where we can't
> access the IOMMU MR prior to the DMA mapping happening. Before this series
> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
> only purposes.
> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
> * untouched, except patch 12 which was greatly simplified.
> 
> Changes since v1[4]:
> - Rebased on latest master branch. As part of it, made some changes in
>    pre-copy to adjust it to Juan's new patches:
>    1. Added a new patch that passes threshold_size parameter to
>       .state_pending_{estimate,exact}() handlers.
>    2. Added a new patch that refactors vfio_save_block().
>    3. Changed the pre-copy patch to cache and report pending pre-copy
>       size in the .state_pending_estimate() handler.
> - Removed unnecessary P2P code. This should be added later on when P2P
>    support is added. (Alex)
> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>    (patch #11). (Alex)
> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>    variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
> - Refactored the viommu device dirty tracking ranges creation code to
>    make it clearer (patch #15).
> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>    emphasize that we specifically check for 2^64 wrap around (patch #15).
> - Added R-bs / Acks.
> 
> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
> [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
> [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
> 
> Avihai Horon (4):
>    memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>    intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>    vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>    vfio/common: Optimize device dirty page tracking with vIOMMU
> 
> Joao Martins (7):
>    memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>    intel-iommu: Implement get_attr() method
>    vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>    vfio/common: Relax vIOMMU detection when DMA translation is off
>    vfio/common: Move dirty tracking ranges update to helper
>    vfio/common: Support device dirty page tracking with vIOMMU
>    vfio/common: Block migration with vIOMMUs without address width limits
> 
> Yi Liu (4):
>    hw/pci: Add a pci_setup_iommu_ops() helper
>    hw/pci: Refactor pci_device_iommu_address_space()
>    hw/pci: Introduce pci_device_iommu_get_attr()
>    intel-iommu: Switch to pci_setup_iommu_ops()
> 
>   include/exec/memory.h         |   4 +-
>   include/hw/pci/pci.h          |  11 ++
>   include/hw/pci/pci_bus.h      |   1 +
>   include/hw/vfio/vfio-common.h |   2 +
>   hw/i386/intel_iommu.c         |  53 +++++++-
>   hw/pci/pci.c                  |  58 +++++++-
>   hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
>   hw/vfio/pci.c                 |  22 +++-
>   8 files changed, 329 insertions(+), 63 deletions(-)
>
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 1 year, 8 months ago
On 06/06/2024 16:43, Cédric Le Goater wrote:
> Hello Joao,
> 
> On 6/22/23 23:48, Joao Martins wrote:
>> Hey,
>>
>> This series introduces support for vIOMMU with VFIO device migration,
>> particurlarly related to how we do the dirty page tracking.
>>
>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>> provide dma translation services for guests to provide some form of
>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>> and remove the migration blocker when vIOMMU is present provided the
>> conditions are met. I have both use-cases here in one series, but I am happy
>> to tackle them in separate series.
>>
>> As I found out we don't necessarily need to expose the whole vIOMMU
>> functionality in order to just support interrupt remapping. x86 IOMMUs
>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>> can instantiate a IOMMU just for interrupt remapping without needing to
>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>> the same, but Linux doesn't quite support the IR-only part there yet,
>> only intel-iommu.
>>
>> The series is organized as following:
>>
>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>> establishes their first DMA mapping via the vIOMMU. So these first four
>> patches add a way for vIOMMUs to be asked of their properties at start
>> of day. I choose the least churn possible way for now (as opposed to a
>> treewide conversion) and allow easy conversion a posteriori. As
>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>> was doing in v3.
>>
>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>> DMA translation allowed. Today the 'dma-translation' attribute is
>> x86-iommu only, but the way this series is structured nothing stops from
>> other vIOMMUs supporting it too as long as they use
>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>> we've then tackled item (1) of the second paragraph.
>>
>> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
>> IOVA address space, leveraging the logic we use to compose the dirty ranges.
>> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
>> addressing limits. This tackles item (2). So far I mainly use it with
>> intel-iommu, although I have a small set of patches for virtio-iommu per
>> Alex's suggestion in v2.
>>
>> Comments, suggestions welcome. Thanks for the review!
> 
> 
> I spent sometime refreshing your series on upstream QEMU (See [1]) and
> gave migration a try with CX-7 VF. LGTM. It doesn't seem we are far
> from acceptance in QEMU 9.1. Are we ?
> 
Yeah.

There was a comment from Zhenzhong on the vfio_viommu_preset() here[0]. But I
was looking at that to remind myself what was it that we had to change, but even
with re-reading the thread I can't spot any flaw that needs change.

[0]
https://lore.kernel.org/qemu-devel/de2b72d2-f56b-9350-ce0f-70edfb58eff5@intel.com/#r

> First, I will resend these with the changes I made :
> 
>   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>   vfio/common: Move dirty tracking ranges update to helper()
> 
> I guess the PCIIOMMUOps::get_iommu_attr needs a close review. Is
> IOMMU_ATTR_DMA_TRANSLATION a must have ?
> 
It's sort of the 'correct way' of relaxing vIOMMU checks, because you are 100%
guaranteed that the guest won't do DMA. The other outstanding thing related to
that is for older kernels which is to use the directmap for dirty page tracking,
but the moment a mapping is attempted the migration doesn't start or if it's in
progress it gets aborted[*]:

https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/

The above link and DMA_TRANSLATION is mostly for the usecase we use that only
cares about vIOMMU for interrupt remapping only and no DMA translation services.
But we can't just disable dma-translation in qemu because it may crash older
kernels, so it supports both old and new this way.

[*] Recently I noticed you improved error reporting, so
vfio_set_migration_error(-EOPNOTSUPP) probably has a better way of getting there.

> The rest is mostly VFIO internals for dirty tracking.
> 
Right.

I derailed with other work and also stuff required for iommu dirty tracking that
I forgot about these patches, sorry.

> Thanks,
> 
> C.
> 
> [1] https://github.com/legoater/qemu/commits/vfio-9.1
> 
> 
>>
>> Regards,
>>     Joao
>>
>> Changes since v3[8]:
>> * Pick up Yi's patches[5][6], and rework the first four patches.
>>    These are a bit better splitted, and make the new iommu_ops *optional*
>>    as opposed to a treewide conversion. Rather than returning an IOMMU MR
>>    and let VFIO operate on it to fetch attributes, we instead let the
>>    underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>>    desired IOMMU attribute. Callers only care about PCI Device backing
>>    vIOMMU attributes regardless of its topology/association. (Peter Xu)
>>    These patches are a bit better splitted compared to original ones,
>>    and I've kept all the same authorship and note the changes from
>>    original where applicable.
>> * Because of the rework of the first four patches, switch to
>>    individual attributes in the VFIOSpace that track dma_translation
>>    and the max_iova. All are expected to be unused when zero to retain
>>    the defaults of today in common code.
>> * Improve the migration blocker message of the last patch to be
>>    more obvious that vIOMMU migration blocker is added when no vIOMMU
>>    address space limits are advertised. (Patch 15)
>> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
>> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
>> * Change diffstat of patches with scripts/git.orderfile (Philippe).
>>
>> Changes since v2[3]:
>> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
>> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
>> is partly meant to address a comment in previous versions where we can't
>> access the IOMMU MR prior to the DMA mapping happening. Before this series
>> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
>> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
>> only purposes.
>> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
>> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
>> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
>> * untouched, except patch 12 which was greatly simplified.
>>
>> Changes since v1[4]:
>> - Rebased on latest master branch. As part of it, made some changes in
>>    pre-copy to adjust it to Juan's new patches:
>>    1. Added a new patch that passes threshold_size parameter to
>>       .state_pending_{estimate,exact}() handlers.
>>    2. Added a new patch that refactors vfio_save_block().
>>    3. Changed the pre-copy patch to cache and report pending pre-copy
>>       size in the .state_pending_estimate() handler.
>> - Removed unnecessary P2P code. This should be added later on when P2P
>>    support is added. (Alex)
>> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>>    (patch #11). (Alex)
>> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>>    variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
>> - Refactored the viommu device dirty tracking ranges creation code to
>>    make it clearer (patch #15).
>> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>>    emphasize that we specifically check for 2^64 wrap around (patch #15).
>> - Added R-bs / Acks.
>>
>> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
>> [1]
>> https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
>> [2]
>> https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
>> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
>> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
>> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
>> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
>> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
>> [8]
>> https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
>>
>> Avihai Horon (4):
>>    memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>>    intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>>    vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>>    vfio/common: Optimize device dirty page tracking with vIOMMU
>>
>> Joao Martins (7):
>>    memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>>    intel-iommu: Implement get_attr() method
>>    vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>>    vfio/common: Relax vIOMMU detection when DMA translation is off
>>    vfio/common: Move dirty tracking ranges update to helper
>>    vfio/common: Support device dirty page tracking with vIOMMU
>>    vfio/common: Block migration with vIOMMUs without address width limits
>>
>> Yi Liu (4):
>>    hw/pci: Add a pci_setup_iommu_ops() helper
>>    hw/pci: Refactor pci_device_iommu_address_space()
>>    hw/pci: Introduce pci_device_iommu_get_attr()
>>    intel-iommu: Switch to pci_setup_iommu_ops()
>>
>>   include/exec/memory.h         |   4 +-
>>   include/hw/pci/pci.h          |  11 ++
>>   include/hw/pci/pci_bus.h      |   1 +
>>   include/hw/vfio/vfio-common.h |   2 +
>>   hw/i386/intel_iommu.c         |  53 +++++++-
>>   hw/pci/pci.c                  |  58 +++++++-
>>   hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
>>   hw/vfio/pci.c                 |  22 +++-
>>   8 files changed, 329 insertions(+), 63 deletions(-)
>>
> 


Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Cédric Le Goater 1 year, 8 months ago
On 6/7/24 5:10 PM, Joao Martins wrote:
> On 06/06/2024 16:43, Cédric Le Goater wrote:
>> Hello Joao,
>>
>> On 6/22/23 23:48, Joao Martins wrote:
>>> Hey,
>>>
>>> This series introduces support for vIOMMU with VFIO device migration,
>>> particurlarly related to how we do the dirty page tracking.
>>>
>>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>>> provide dma translation services for guests to provide some form of
>>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>>> and remove the migration blocker when vIOMMU is present provided the
>>> conditions are met. I have both use-cases here in one series, but I am happy
>>> to tackle them in separate series.
>>>
>>> As I found out we don't necessarily need to expose the whole vIOMMU
>>> functionality in order to just support interrupt remapping. x86 IOMMUs
>>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>>> can instantiate a IOMMU just for interrupt remapping without needing to
>>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>>> the same, but Linux doesn't quite support the IR-only part there yet,
>>> only intel-iommu.
>>>
>>> The series is organized as following:
>>>
>>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>>> establishes their first DMA mapping via the vIOMMU. So these first four
>>> patches add a way for vIOMMUs to be asked of their properties at start
>>> of day. I choose the least churn possible way for now (as opposed to a
>>> treewide conversion) and allow easy conversion a posteriori. As
>>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>>> was doing in v3.
>>>
>>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>>> DMA translation allowed. Today the 'dma-translation' attribute is
>>> x86-iommu only, but the way this series is structured nothing stops from
>>> other vIOMMUs supporting it too as long as they use
>>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>>> we've then tackled item (1) of the second paragraph.
>>>
>>> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
>>> IOVA address space, leveraging the logic we use to compose the dirty ranges.
>>> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
>>> addressing limits. This tackles item (2). So far I mainly use it with
>>> intel-iommu, although I have a small set of patches for virtio-iommu per
>>> Alex's suggestion in v2.
>>>
>>> Comments, suggestions welcome. Thanks for the review!
>>
>>
>> I spent sometime refreshing your series on upstream QEMU (See [1]) and
>> gave migration a try with CX-7 VF. LGTM. It doesn't seem we are far
>> from acceptance in QEMU 9.1. Are we ?
>>
> Yeah.
> 
> There was a comment from Zhenzhong on the vfio_viommu_preset() here[0]. But I
> was looking at that to remind myself what was it that we had to change, but even
> with re-reading the thread I can't spot any flaw that needs change.
> 
> [0]
> https://lore.kernel.org/qemu-devel/de2b72d2-f56b-9350-ce0f-70edfb58eff5@intel.com/#r

I introduced a vfio_devices_all_viommu_preset() routine to check all devices
in a container and a simplified version of vfio_viommu_get_max_iova()
returning the space max_iova.


>> First, I will resend these with the changes I made :
>>
>>    vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>>    vfio/common: Move dirty tracking ranges update to helper()
>>
>> I guess the PCIIOMMUOps::get_iommu_attr needs a close review. Is
>> IOMMU_ATTR_DMA_TRANSLATION a must have ?
>>
> It's sort of the 'correct way' of relaxing vIOMMU checks, because you are 100%
> guaranteed that the guest won't do DMA. The other outstanding thing related to
> that is for older kernels which is to use the directmap for dirty page tracking,
> but the moment a mapping is attempted the migration doesn't start or if it's in
> progress it gets aborted[*]:
> 
> https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/
> 
> The above link and DMA_TRANSLATION is mostly for the usecase we use that only
> cares about vIOMMU for interrupt remapping only and no DMA translation services.
> But we can't just disable dma-translation in qemu because it may crash older
> kernels, so it supports both old and new this way.
> 
> [*] Recently I noticed you improved error reporting, so
> vfio_set_migration_error(-EOPNOTSUPP) probably has a better way of getting there.

Yes. So, I did a little more change to improve vfio_dirty_tracking_init().

>> The rest is mostly VFIO internals for dirty tracking.
>>
> Right.
> 
> I derailed with other work and also stuff required for iommu dirty tracking that
> I forgot about these patches, sorry.

That's fine.

I am trying to sort out which patches could be merged in advance for QEMU 9.1
and your series has shrunk a lot since it was last sent. I might resend the
whole to cherry pick the simple changes and get some R-b tags.


For the record, here is my watch list:

QEMU series under review:

* [v7] Add a host IOMMU device abstraction to check with vIOMMU
   https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.duan@intel.com

   Needs feedback on the PCI IOMMU ops. vIOMMU "iommufd" property ?
   Pushed on vfio-9.1 branch.
   
* [RFC v2] VIRTIO-IOMMU/VFIO: Fix host iommu geometry
   https://lore.kernel.org/all/20240607143905.765133-1-eric.auger@redhat.com
   
   Pushed on vfio-9.1 branch.

Need a resend :

* [v4] vfio: VFIO migration support with vIOMMU
   https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.martins@oracle.com/
   
   Refreshed the patchset on upstream and pushed on vfio-9.1 branch.
   
* [RFCv2] vfio/iommufd: IOMMUFD Dirty Tracking
   https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.martins@oracle.com/

* [v1] vfio: container: Fix missing allocation of VFIOSpaprContainer
   https://lore.kernel.org/qemu-devel/171528203026.8420.10620440513237875837.stgit@ltcd48-lp2.aus.stglabs.ibm.com/
   

Other interesting series (IOMMU related):

* [rfcv2] intel_iommu: Enable stage-1 translation for emulated device
   https://lore.kernel.org/all/20240522062313.453317-1-zhenzhong.duan@intel.com/

* [PATCH ats_vtd v2 00/25] ATS support for VT-d
   https://lore.kernel.org/all/20240515071057.33990-1-clement.mathieu--drif@eviden.com/

* [RFC v3] SMMUv3 nested translation support
   https://lore.kernel.org/qemu-devel/20240429032403.74910-1-smostafa@google.com/

* [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)
   https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/

   Yet to be sent,
   https://github.com/nicolinc/qemu/commits/wip/iommufd_vcmdq/

* [RFC] Multifd device state transfer support with VFIO consumer
   https://lore.kernel.org/all/cover.1713269378.git.maciej.szmigiero@oracle.com/


Thanks,

C.

  



Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 1 year, 7 months ago
[sorry for the delay, I was out last week]

On 10/06/2024 17:53, Cédric Le Goater wrote:
> On 6/7/24 5:10 PM, Joao Martins wrote:
>> On 06/06/2024 16:43, Cédric Le Goater wrote:
>>> Hello Joao,
>>>
>>> On 6/22/23 23:48, Joao Martins wrote:
>>>> Hey,
>>>>
>>>> This series introduces support for vIOMMU with VFIO device migration,
>>>> particurlarly related to how we do the dirty page tracking.
>>>>
>>>> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
>>>> provide dma translation services for guests to provide some form of
>>>> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
>>>> required for big VMs with VFs with more than 255 vcpus. We tackle both
>>>> and remove the migration blocker when vIOMMU is present provided the
>>>> conditions are met. I have both use-cases here in one series, but I am happy
>>>> to tackle them in separate series.
>>>>
>>>> As I found out we don't necessarily need to expose the whole vIOMMU
>>>> functionality in order to just support interrupt remapping. x86 IOMMUs
>>>> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
>>>> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
>>>> can instantiate a IOMMU just for interrupt remapping without needing to
>>>> be advertised/support DMA translation. AMD IOMMU in theory can provide
>>>> the same, but Linux doesn't quite support the IR-only part there yet,
>>>> only intel-iommu.
>>>>
>>>> The series is organized as following:
>>>>
>>>> Patches 1-5: Today we can't gather vIOMMU details before the guest
>>>> establishes their first DMA mapping via the vIOMMU. So these first four
>>>> patches add a way for vIOMMUs to be asked of their properties at start
>>>> of day. I choose the least churn possible way for now (as opposed to a
>>>> treewide conversion) and allow easy conversion a posteriori. As
>>>> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
>>>> allows us to fetch PCI backing vIOMMU attributes, without necessarily
>>>> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
>>>> was doing in v3.
>>>>
>>>> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
>>>> DMA translation allowed. Today the 'dma-translation' attribute is
>>>> x86-iommu only, but the way this series is structured nothing stops from
>>>> other vIOMMUs supporting it too as long as they use
>>>> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
>>>> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
>>>> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
>>>> we've then tackled item (1) of the second paragraph.
>>>>
>>>> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
>>>> IOVA address space, leveraging the logic we use to compose the dirty ranges.
>>>> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
>>>> addressing limits. This tackles item (2). So far I mainly use it with
>>>> intel-iommu, although I have a small set of patches for virtio-iommu per
>>>> Alex's suggestion in v2.
>>>>
>>>> Comments, suggestions welcome. Thanks for the review!
>>>
>>>
>>> I spent sometime refreshing your series on upstream QEMU (See [1]) and
>>> gave migration a try with CX-7 VF. LGTM. It doesn't seem we are far
>>> from acceptance in QEMU 9.1. Are we ?
>>>
>> Yeah.
>>
>> There was a comment from Zhenzhong on the vfio_viommu_preset() here[0]. But I
>> was looking at that to remind myself what was it that we had to change, but even
>> with re-reading the thread I can't spot any flaw that needs change.
>>
>> [0]
>> https://lore.kernel.org/qemu-devel/de2b72d2-f56b-9350-ce0f-70edfb58eff5@intel.com/#r
> 
> I introduced a vfio_devices_all_viommu_preset() routine to check all devices
> in a container and a simplified version of vfio_viommu_get_max_iova()
> returning the space max_iova.
>

/me nods

> 
>>> First, I will resend these with the changes I made :
>>>
>>>    vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>>>    vfio/common: Move dirty tracking ranges update to helper()
>>>
>>> I guess the PCIIOMMUOps::get_iommu_attr needs a close review. Is
>>> IOMMU_ATTR_DMA_TRANSLATION a must have ?
>>>
>> It's sort of the 'correct way' of relaxing vIOMMU checks, because you are 100%
>> guaranteed that the guest won't do DMA. The other outstanding thing related to
>> that is for older kernels which is to use the directmap for dirty page tracking,
>> but the moment a mapping is attempted the migration doesn't start or if it's in
>> progress it gets aborted[*]:
>>
>> https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/
>>
>> The above link and DMA_TRANSLATION is mostly for the usecase we use that only
>> cares about vIOMMU for interrupt remapping only and no DMA translation services.
>> But we can't just disable dma-translation in qemu because it may crash older
>> kernels, so it supports both old and new this way.
>>
>> [*] Recently I noticed you improved error reporting, so
>> vfio_set_migration_error(-EOPNOTSUPP) probably has a better way of getting there.
> 
> Yes. So, I did a little more change to improve vfio_dirty_tracking_init().
> 
/me nods

>>> The rest is mostly VFIO internals for dirty tracking.
>>>
>> Right.
>>
>> I derailed with other work and also stuff required for iommu dirty tracking that
>> I forgot about these patches, sorry.
> 
> That's fine.
> 
> I am trying to sort out which patches could be merged in advance for QEMU 9.1
> and your series has shrunk a lot since it was last sent. I might resend the
> whole to cherry pick the simple changes and get some R-b tags.
> 

OK, sounds good.

> 
> For the record, here is my watch list:
> 
> QEMU series under review:
> 
> * [v7] Add a host IOMMU device abstraction to check with vIOMMU
>   https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.duan@intel.com
> 
>   Needs feedback on the PCI IOMMU ops. vIOMMU "iommufd" property ?

Part of the reason I suggested splitting it to allow other things to progress as
the IOMMU ops is related to the nesting.

>   Pushed on vfio-9.1 branch.
>   * [RFC v2] VIRTIO-IOMMU/VFIO: Fix host iommu geometry
>   https://lore.kernel.org/all/20240607143905.765133-1-eric.auger@redhat.com
>     Pushed on vfio-9.1 branch.
> 
> Need a resend :
> 
> * [v4] vfio: VFIO migration support with vIOMMU
>  
> https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.martins@oracle.com/
>     Refreshed the patchset on upstream and pushed on vfio-9.1 branch.

/me nods Probably deserves an item on the list too related to this subject of
vIOMMU and migration after the vIOMMU series is done:

*
https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/

>   * [RFCv2] vfio/iommufd: IOMMUFD Dirty Tracking
>  
> https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.martins@oracle.com/
> 
I plan on still submitting a follow-up targetting 9.1 likely next week with
Avihai's comments on top of the vfio-9.1 branch after I sent some dirty tracking
fixes in kernel side. Though it is mostly to progress review as I think I am
still dependent on Zhenzhong prep series for merging because of this patch:
https://lore.kernel.org/all/20240605083043.317831-8-zhenzhong.duan@intel.com/

> * [v1] vfio: container: Fix missing allocation of VFIOSpaprContainer
>  
> https://lore.kernel.org/qemu-devel/171528203026.8420.10620440513237875837.stgit@ltcd48-lp2.aus.stglabs.ibm.com/
>  
> Other interesting series (IOMMU related):
> 
> * [rfcv2] intel_iommu: Enable stage-1 translation for emulated device
>   https://lore.kernel.org/all/20240522062313.453317-1-zhenzhong.duan@intel.com/
> 
> * [PATCH ats_vtd v2 00/25] ATS support for VT-d
>  
> https://lore.kernel.org/all/20240515071057.33990-1-clement.mathieu--drif@eviden.com/
> 
> * [RFC v3] SMMUv3 nested translation support
>   https://lore.kernel.org/qemu-devel/20240429032403.74910-1-smostafa@google.com/
> 
> * [PATCH RFCv1 00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2)
>   https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
> 
>   Yet to be sent,
>   https://github.com/nicolinc/qemu/commits/wip/iommufd_vcmdq/
> 
> * [RFC] Multifd device state transfer support with VFIO consumer
>   https://lore.kernel.org/all/cover.1713269378.git.maciej.szmigiero@oracle.com/
> 

I think Maciej (CC'ed) plans on submitting a -- simplified I think? -- v2 of
this one shortly still targetting 9.1. But it is largely migration core changes
with the last two patches on vfio.

	Joao

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Cédric Le Goater 1 year, 7 months ago
[ ... ]

>> * [v4] vfio: VFIO migration support with vIOMMU
>>   
>> https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.martins@oracle.com/
>>      Refreshed the patchset on upstream and pushed on vfio-9.1 branch.
> 
> /me nods Probably deserves an item on the list too related to this subject of
> vIOMMU and migration after the vIOMMU series is done:
> 
> *
> https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/
> 
>>    * [RFCv2] vfio/iommufd: IOMMUFD Dirty Tracking
>>   
>> https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.martins@oracle.com/
>>
> I plan on still submitting a follow-up targetting 9.1 likely next week with
> Avihai's comments on top of the vfio-9.1 branch after I sent some dirty tracking
> fixes in kernel side. Though it is mostly to progress review as I think I am
> still dependent on Zhenzhong prep series for merging because of this patch:
> https://lore.kernel.org/all/20240605083043.317831-8-zhenzhong.duan@intel.com/

This is ready to be pushed.

As soon as I get an ack, a nod, a smoke sign, from the PCI maintainers
regarding the new PCIIOMMUOps callbacks I will send a PR for:

   https://lore.kernel.org/all/20240522170107.289532-1-clg@redhat.com
   https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.duan@intel.com
   https://lore.kernel.org/all/20240614095402.904691-1-eric.auger@redhat.com
   https://lore.kernel.org/all/20240617063409.34393-1-clg@redhat.com

Thanks,

C.





Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 2 years, 5 months ago
On 22/06/2023 22:48, Joao Martins wrote:
> Hey,
> 
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
> 
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
> 
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
> 
> The series is organized as following:
> 
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
> 
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.
> 
> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
> IOVA address space, leveraging the logic we use to compose the dirty ranges.
> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
> addressing limits. This tackles item (2). So far I mainly use it with
> intel-iommu, although I have a small set of patches for virtio-iommu per
> Alex's suggestion in v2.
> 
> Comments, suggestions welcome. Thanks for the review!
> 
Cedric, you mentioned that you take a look at this after you come back, not sure
if that's still the plan. But it's been a while since the last version, so would
you have me repost/rebase on the latest (post your PR)?

Additionally, I should say that I have an alternative (as a single small patch),
where vIOMMU usage is allowed ... but behind a VFIO command line option, and as
soon as attempt *any* vIOMMU mapping we fail-to-start/block the migration. I
haven't posted that alternative as early in the dirty tracking work the idea was
to avoid guest vIOMMU usage dependency to allow migration (which made this
patchset the way it is). But thought it was OK to remind, if it was only be
allowed if the admin explicitly states such its intent behind a x- command line
option.


> Regards,
> 	Joao
> 
> Changes since v3[8]:
> * Pick up Yi's patches[5][6], and rework the first four patches.
>   These are a bit better splitted, and make the new iommu_ops *optional*
>   as opposed to a treewide conversion. Rather than returning an IOMMU MR
>   and let VFIO operate on it to fetch attributes, we instead let the
>   underlying IOMMU driver fetch the desired IOMMU MR and ask for the
>   desired IOMMU attribute. Callers only care about PCI Device backing
>   vIOMMU attributes regardless of its topology/association. (Peter Xu)
>   These patches are a bit better splitted compared to original ones,
>   and I've kept all the same authorship and note the changes from
>   original where applicable.
> * Because of the rework of the first four patches, switch to
>   individual attributes in the VFIOSpace that track dma_translation
>   and the max_iova. All are expected to be unused when zero to retain
>   the defaults of today in common code.
> * Improve the migration blocker message of the last patch to be
>   more obvious that vIOMMU migration blocker is added when no vIOMMU
>   address space limits are advertised. (Patch 15)
> * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
> * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
> * Change diffstat of patches with scripts/git.orderfile (Philippe).
> 
> Changes since v2[3]:
> * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
> introduce ways to know various IOMMU model attributes via the IOMMU MR. This
> is partly meant to address a comment in previous versions where we can't
> access the IOMMU MR prior to the DMA mapping happening. Before this series
> vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the
> guest. As well as better tackling of the IOMMU usage for interrupt-remapping
> only purposes. 
> * Dropped Peter Xu ack on patch 9 given that the code changed a bit.
> * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers.
> * The patches that existed in v2 of vIOMMU dirty tracking, are mostly
> * untouched, except patch 12 which was greatly simplified.
> 
> Changes since v1[4]:
> - Rebased on latest master branch. As part of it, made some changes in
>   pre-copy to adjust it to Juan's new patches:
>   1. Added a new patch that passes threshold_size parameter to
>      .state_pending_{estimate,exact}() handlers.
>   2. Added a new patch that refactors vfio_save_block().
>   3. Changed the pre-copy patch to cache and report pending pre-copy
>      size in the .state_pending_estimate() handler.
> - Removed unnecessary P2P code. This should be added later on when P2P
>   support is added. (Alex)
> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>   (patch #11). (Alex)
> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>   variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
> - Refactored the viommu device dirty tracking ranges creation code to
>   make it clearer (patch #15).
> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>   emphasize that we specifically check for 2^64 wrap around (patch #15).
> - Added R-bs / Acks.
> 
> [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/
> [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection
> [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/
> [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/
> [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/
> [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/
> [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/
> [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/
> 
> Avihai Horon (4):
>   memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
>   intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute
>   vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
>   vfio/common: Optimize device dirty page tracking with vIOMMU
> 
> Joao Martins (7):
>   memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute
>   intel-iommu: Implement get_attr() method
>   vfio/common: Track whether DMA Translation is enabled on the vIOMMU
>   vfio/common: Relax vIOMMU detection when DMA translation is off
>   vfio/common: Move dirty tracking ranges update to helper
>   vfio/common: Support device dirty page tracking with vIOMMU
>   vfio/common: Block migration with vIOMMUs without address width limits
> 
> Yi Liu (4):
>   hw/pci: Add a pci_setup_iommu_ops() helper
>   hw/pci: Refactor pci_device_iommu_address_space()
>   hw/pci: Introduce pci_device_iommu_get_attr()
>   intel-iommu: Switch to pci_setup_iommu_ops()
> 
>  include/exec/memory.h         |   4 +-
>  include/hw/pci/pci.h          |  11 ++
>  include/hw/pci/pci_bus.h      |   1 +
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/i386/intel_iommu.c         |  53 +++++++-
>  hw/pci/pci.c                  |  58 +++++++-
>  hw/vfio/common.c              | 241 ++++++++++++++++++++++++++--------
>  hw/vfio/pci.c                 |  22 +++-
>  8 files changed, 329 insertions(+), 63 deletions(-)
>
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Cédric Le Goater 2 years, 5 months ago
Hello Joao,

> Cedric, you mentioned that you take a look at this after you come back, not sure
> if that's still the plan. But it's been a while since the last version, so would
> you have me repost/rebase on the latest (post your PR)?

Yes please. That's next on the TODO list (after some downstream work
regarding the postcopy crash). My rough plan for 8.2 is :

  * P2P
  * VIOMMU
  * dynamic MSI-X
  * fixes

I think it is a bit early for iommufd and I will probably lack time.
The recent migration addition is requiring some attention in many
areas.

> Additionally, I should say that I have an alternative (as a single small patch),
> where vIOMMU usage is allowed ... but behind a VFIO command line option, and as
> soon as attempt *any* vIOMMU mapping we fail-to-start/block the migration. I
> haven't posted that alternative as early in the dirty tracking work the idea was
> to avoid guest vIOMMU usage dependency to allow migration (which made this
> patchset the way it is). But thought it was OK to remind, if it was only be
> allowed if the admin explicitly states such its intent behind a x- command line
> option.

I don't remember seeing it. It is worth resending as an RFC so that
people can comment.

Thanks,

C.
Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 2 years, 5 months ago
On 07/09/2023 13:40, Cédric Le Goater wrote:
> Hello Joao,
> 
>> Cedric, you mentioned that you take a look at this after you come back, not sure
>> if that's still the plan. But it's been a while since the last version, so would
>> you have me repost/rebase on the latest (post your PR)?
> 
> Yes please. That's next on the TODO list (after some downstream work
> regarding the postcopy crash). My rough plan for 8.2 is :
> 
>  * P2P
>  * VIOMMU
>  * dynamic MSI-X
>  * fixes
> 

Thanks for sharing

> I think it is a bit early for iommufd and I will probably lack time.
> The recent migration addition is requiring some attention in many
> areas.
> 
>> Additionally, I should say that I have an alternative (as a single small patch),
>> where vIOMMU usage is allowed ... but behind a VFIO command line option, and as
>> soon as attempt *any* vIOMMU mapping we fail-to-start/block the migration. I
>> haven't posted that alternative as early in the dirty tracking work the idea was
>> to avoid guest vIOMMU usage dependency to allow migration (which made this
>> patchset the way it is). But thought it was OK to remind, if it was only be
>> allowed if the admin explicitly states such its intent behind a x- command line
>> option.
> 
> I don't remember seeing it. It is worth resending as an RFC so that
> people can comment.

I haven't send it, largelly because in the first versions of dirty tracking the
situation revolved around whether or not we depend on guest vIOMMU usage
(passthrough or not) vs tracking something agnostic to guest (raw IOVA ranges).

In any case I can send out the patch and move the discussion there whether it's
a good idea or not (it's a simple patch)

	Joao

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU
Posted by Joao Martins 2 years, 7 months ago
On 22/06/2023 22:48, Joao Martins wrote:
> Hey,
> 
> This series introduces support for vIOMMU with VFIO device migration,
> particurlarly related to how we do the dirty page tracking.
> 
> Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
> provide dma translation services for guests to provide some form of
> guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
> required for big VMs with VFs with more than 255 vcpus. We tackle both
> and remove the migration blocker when vIOMMU is present provided the
> conditions are met. I have both use-cases here in one series, but I am happy
> to tackle them in separate series.
> 
> As I found out we don't necessarily need to expose the whole vIOMMU
> functionality in order to just support interrupt remapping. x86 IOMMUs
> on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
> Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8)
> can instantiate a IOMMU just for interrupt remapping without needing to
> be advertised/support DMA translation. AMD IOMMU in theory can provide
> the same, but Linux doesn't quite support the IR-only part there yet,
> only intel-iommu.
> 
> The series is organized as following:
> 
> Patches 1-5: Today we can't gather vIOMMU details before the guest
> establishes their first DMA mapping via the vIOMMU. So these first four
> patches add a way for vIOMMUs to be asked of their properties at start
> of day. I choose the least churn possible way for now (as opposed to a
> treewide conversion) and allow easy conversion a posteriori. As
> suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
> allows us to fetch PCI backing vIOMMU attributes, without necessarily
> tieing the caller (VFIO or anyone else) to an IOMMU MR like I
> was doing in v3.
> 
> Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
> DMA translation allowed. Today the 'dma-translation' attribute is
> x86-iommu only, but the way this series is structured nothing stops from
> other vIOMMUs supporting it too as long as they use
> pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
> are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
> the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
> we've then tackled item (1) of the second paragraph.
> 
> Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
> IOVA address space, leveraging the logic we use to compose the dirty ranges.
> The blocker is once again relaxed for vIOMMUs that advertise their IOVA
> addressing limits. This tackles item (2). So far I mainly use it with
> intel-iommu, although I have a small set of patches for virtio-iommu per
> Alex's suggestion in v2.
> 
> Comments, suggestions welcome. Thanks for the review!
> 

By mistake, I've spuriously sent this a little too early. There's some styling
errors in patch 1, 6 and 10. I've fixed the problems already, but I won't respin
the series as I don't wanna patch bomb folks again. I will give at least a week
or 2 before I do that. My apologies :/

Meanwhile, here's the diff of those fixes:

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 989993e303a6..7fad59126215 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3880,7 +3880,7 @@ static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
     {
         hwaddr *max_iova = (hwaddr *)(uintptr_t) data;

-        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);;
+        *max_iova = MAKE_64BIT_MASK(0, s->aw_bits);
         break;
     }
     default:
@@ -4071,8 +4071,9 @@ static int vtd_get_iommu_attr(PCIBus *bus, void *opaque,
int32_t devfn,
     assert(0 <= devfn && devfn < PCI_DEVFN_MAX);

     vtd_as = vtd_find_add_as(s, bus, devfn, PCI_NO_PASID);
-    if (!vtd_as)
-       return -EINVAL;
+    if (!vtd_as) {
+        return -EINVAL;
+    }

     return memory_region_iommu_get_attr(&vtd_as->iommu, attr, data);
 }
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 91ba6f0927a4..0cf000a9c1ff 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2700,10 +2700,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     pci_device_get_iommu_bus_devfn(dev, &bus, &iommu_bus, &devfn);
     if (!pci_bus_bypass_iommu(bus) && iommu_bus) {
         if (iommu_bus->iommu_fn) {
-           return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+            return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
         } else if (iommu_bus->iommu_ops &&
                    iommu_bus->iommu_ops->get_address_space) {
-           return iommu_bus->iommu_ops->get_address_space(bus,
+            return iommu_bus->iommu_ops->get_address_space(bus,
                                            iommu_bus->iommu_opaque, devfn);
         }
     }