[PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking

Joao Martins posted 10 patches 4 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240708143420.16953-1-joao.m.martins@oracle.com
Maintainers: Yi Liu <yi.l.liu@intel.com>, Eric Auger <eric.auger@redhat.com>, Zhenzhong Duan <zhenzhong.duan@intel.com>, Alex Williamson <alex.williamson@redhat.com>, "Cédric Le Goater" <clg@redhat.com>
include/hw/vfio/vfio-common.h      |  11 ++
include/sysemu/host_iommu_device.h |   2 +
include/sysemu/iommufd.h           |  12 +-
backends/iommufd.c                 |  81 ++++++++++-
hw/vfio/common.c                   |   3 +
hw/vfio/iommufd.c                  | 217 +++++++++++++++++++++++++++--
hw/vfio/migration.c                |   7 +-
hw/vfio/pci.c                      |   3 +
backends/trace-events              |   3 +
9 files changed, 325 insertions(+), 14 deletions(-)
[PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
Posted by Joao Martins 4 months, 2 weeks ago
This small series adds support for IOMMU dirty tracking support via the
IOMMUFD backend. The hardware capability is available on most recent x86
hardware. The series is divided organized as follows:

* Patch 1: Fixes a regression into mdev support with IOMMUFD. This
           one is independent of the series but happened to cross it
           while testing mdev with this series

* Patch 2: Adds a support to iommufd_get_device_info() for capabilities

* Patches 3 - 7: IOMMUFD backend support for dirty tracking;

Introduce auto domains -- Patch 3 goes into more detail, but the gist is that
we will find and attach a device to a compatible IOMMU domain, or allocate a new
hardware pagetable *or* rely on kernel IOAS attach (for mdevs). Afterwards the
workflow is relatively simple:

1) Probe device and allow dirty tracking in the HWPT
2) Toggling dirty tracking on/off
3) Read-and-clear of Dirty IOVAs

The heuristics selected for (1) were to always request the HWPT for
dirty tracking if supported, or rely on device dirty page tracking. This
is a little simplistic and we aren't necessarily utilizing IOMMU dirty
tracking even if we ask during hwpt allocation.

The unmap case is deferred until further vIOMMU support with migration
is added[3] which will then introduce the usage of
IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP ioctl in the
dma unmap bitmap flow.

* Patches 8-10: Don't block live migration where there's no VF dirty
tracker, considering that we have IOMMU dirty tracking.

Comments and feedback appreciated.

Cheers,
    Joao

P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
some bugs fixed there with regards to IOMMU hugepage dirty tracking.

Changes since RFCv2[4]:
* Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING even if
we end up not actually toggling dirty tracking. (Avihai)
* Fix error handling widely in auto domains logic and all patches (Avihai)
* Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
* New patches 1 and 2 taking into consideration previous comments.
* Store hwpt::flags to know if we have dirty tracking (Avihai)
* New patch 8, that allows to query dirty tracking support after
provisioning. This is a cleaner way to check IOMMU dirty tracking support
when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
device caps way is still used because at vfio attach we aren't yet with
a fully initialized migration state.
* Adopt error propagation in query,set dirty tracking
* Misc improvements overall broadly and Avihai
* Drop hugepages as it's a bit unrelated; I can pursue that patch
* separately. The main motivation is to provide a way to test
without hugepages similar to what vfio_type1_iommu.disable_hugepages=1
does.

Changes since RFCv1[2]:
* Remove intel/amd dirty tracking emulation enabling
* Remove the dirtyrate improvement for VF/IOMMU dirty tracking
[Will pursue these two in separate series]
* Introduce auto domains support
* Enforce dirty tracking following the IOMMUFD UAPI for this
* Add support for toggling hugepages in IOMMUFD
* Auto enable support when VF supports migration to use IOMMU
when it doesn't have VF dirty tracking
* Add a parameter to toggle VF dirty tracking

[0] https://lore.kernel.org/qemu-devel/20240201072818.327930-1-zhenzhong.duan@intel.com/
[1] https://lore.kernel.org/qemu-devel/20240201072818.327930-10-zhenzhong.duan@intel.com/
[2] https://lore.kernel.org/qemu-devel/20220428211351.3897-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.martins@oracle.com/
[4] https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.martins@oracle.com/

Joao Martins (10):
  vfio/iommufd: don't fail to realize on IOMMU_GET_HW_INFO failure
  backends/iommufd: Extend iommufd_backend_get_device_info() to fetch HW capabilities
  vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()
  vfio/iommufd: Introduce auto domain creation
  vfio/iommufd: Probe and request hwpt dirty tracking capability
  vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking support
  vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap support
  vfio/iommufd: Parse hw_caps and store dirty tracking support
  vfio/migration: Don't block migration device dirty tracking is unsupported
  vfio/common: Allow disabling device dirty page tracking

 include/hw/vfio/vfio-common.h      |  11 ++
 include/sysemu/host_iommu_device.h |   2 +
 include/sysemu/iommufd.h           |  12 +-
 backends/iommufd.c                 |  81 ++++++++++-
 hw/vfio/common.c                   |   3 +
 hw/vfio/iommufd.c                  | 217 +++++++++++++++++++++++++++--
 hw/vfio/migration.c                |   7 +-
 hw/vfio/pci.c                      |   3 +
 backends/trace-events              |   3 +
 9 files changed, 325 insertions(+), 14 deletions(-)

-- 
2.17.2
Re: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
Posted by Cédric Le Goater 4 months, 2 weeks ago
Hello Joao,

On 7/8/24 4:34 PM, Joao Martins wrote:
> This small series adds support for IOMMU dirty tracking support via the
> IOMMUFD backend. The hardware capability is available on most recent x86
> hardware. The series is divided organized as follows:
> 
> * Patch 1: Fixes a regression into mdev support with IOMMUFD. This
>             one is independent of the series but happened to cross it
>             while testing mdev with this series
> 
> * Patch 2: Adds a support to iommufd_get_device_info() for capabilities
> 
> * Patches 3 - 7: IOMMUFD backend support for dirty tracking;
> 
> Introduce auto domains -- Patch 3 goes into more detail, but the gist is that
> we will find and attach a device to a compatible IOMMU domain, or allocate a new
> hardware pagetable *or* rely on kernel IOAS attach (for mdevs). Afterwards the
> workflow is relatively simple:
> 
> 1) Probe device and allow dirty tracking in the HWPT
> 2) Toggling dirty tracking on/off
> 3) Read-and-clear of Dirty IOVAs
> 
> The heuristics selected for (1) were to always request the HWPT for
> dirty tracking if supported, or rely on device dirty page tracking. This
> is a little simplistic and we aren't necessarily utilizing IOMMU dirty
> tracking even if we ask during hwpt allocation.
> 
> The unmap case is deferred until further vIOMMU support with migration
> is added[3] which will then introduce the usage of
> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP ioctl in the
> dma unmap bitmap flow.
> 
> * Patches 8-10: Don't block live migration where there's no VF dirty
> tracker, considering that we have IOMMU dirty tracking.
> 
> Comments and feedback appreciated.
> 
> Cheers,
>      Joao
> 
> P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
> some bugs fixed there with regards to IOMMU hugepage dirty tracking.
> 
> Changes since RFCv2[4]:
> * Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING even if
> we end up not actually toggling dirty tracking. (Avihai)
> * Fix error handling widely in auto domains logic and all patches (Avihai)
> * Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
> * New patches 1 and 2 taking into consideration previous comments.
> * Store hwpt::flags to know if we have dirty tracking (Avihai)
> * New patch 8, that allows to query dirty tracking support after
> provisioning. This is a cleaner way to check IOMMU dirty tracking support
> when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
> device caps way is still used because at vfio attach we aren't yet with
> a fully initialized migration state.
> * Adopt error propagation in query,set dirty tracking
> * Misc improvements overall broadly and Avihai
> * Drop hugepages as it's a bit unrelated; I can pursue that patch
> * separately. The main motivation is to provide a way to test
> without hugepages similar to what vfio_type1_iommu.disable_hugepages=1
> does.
> 
> Changes since RFCv1[2]:
> * Remove intel/amd dirty tracking emulation enabling
> * Remove the dirtyrate improvement for VF/IOMMU dirty tracking
> [Will pursue these two in separate series]
> * Introduce auto domains support
> * Enforce dirty tracking following the IOMMUFD UAPI for this
> * Add support for toggling hugepages in IOMMUFD
> * Auto enable support when VF supports migration to use IOMMU
> when it doesn't have VF dirty tracking
> * Add a parameter to toggle VF dirty tracking
> 
> [0] https://lore.kernel.org/qemu-devel/20240201072818.327930-1-zhenzhong.duan@intel.com/
> [1] https://lore.kernel.org/qemu-devel/20240201072818.327930-10-zhenzhong.duan@intel.com/
> [2] https://lore.kernel.org/qemu-devel/20220428211351.3897-1-joao.m.martins@oracle.com/
> [3] https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.martins@oracle.com/
> [4] https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.martins@oracle.com/
> 
> Joao Martins (10):
>    vfio/iommufd: don't fail to realize on IOMMU_GET_HW_INFO failure
>    backends/iommufd: Extend iommufd_backend_get_device_info() to fetch HW capabilities
>    vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()
>    vfio/iommufd: Introduce auto domain creation
>    vfio/iommufd: Probe and request hwpt dirty tracking capability
>    vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking support
>    vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap support
>    vfio/iommufd: Parse hw_caps and store dirty tracking support
>    vfio/migration: Don't block migration device dirty tracking is unsupported
>    vfio/common: Allow disabling device dirty page tracking
> 
>   include/hw/vfio/vfio-common.h      |  11 ++
>   include/sysemu/host_iommu_device.h |   2 +
>   include/sysemu/iommufd.h           |  12 +-
>   backends/iommufd.c                 |  81 ++++++++++-
>   hw/vfio/common.c                   |   3 +
>   hw/vfio/iommufd.c                  | 217 +++++++++++++++++++++++++++--
>   hw/vfio/migration.c                |   7 +-
>   hw/vfio/pci.c                      |   3 +
>   backends/trace-events              |   3 +
>   9 files changed, 325 insertions(+), 14 deletions(-)


I am a bit confused with all the inline proposals. Would you mind
resending a v4 please ?

Regarding my comments on error handling,

The error should be set in case of failure, which means a routine
can not return 'false' or '-errno' and not setting 'Error **'
parameter at the same time.

If the returned value needs to be interpreted in some ways, for a
retry or any reason, then it makes sense to use an int, else please
use a bool. This is to avoid random negative values being interpreted
as an errno when they are not.

With VFIO migration support, low level errors (from the adapter FW
through the VFIO PCI variant driver) now reach to the core migration
subsystem. It is preferable to propagate this error, possibly literal,
to the VMM, monitor or libvirt. It's not fully symmetric today because
the log_global_stop handler for dirty tracking enablement is not
addressed. Anyhow, an effort on error reporting needs to be made and
any use of error_report() in a low level function is a sign for
improvement.

I think it would have value to probe early the host IOMMU device for
its HW features. If the results were cached in the HostIOMMUDevice
struct, it would then remove unnecessary and redundant calls to the
host kernel and avoid error handling in complex code paths. I hope
this is feasible. I haven't looked closely tbh.

We are reaching soft freeze, in ~10 days. There is a chance this
proposal could make it for 9.1.

Thanks,

C.
Re: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
Posted by Joao Martins 4 months, 2 weeks ago
On 11/07/2024 08:41, Cédric Le Goater wrote:
> Hello Joao,
> 
> On 7/8/24 4:34 PM, Joao Martins wrote:
>> This small series adds support for IOMMU dirty tracking support via the
>> IOMMUFD backend. The hardware capability is available on most recent x86
>> hardware. The series is divided organized as follows:
>>
>> * Patch 1: Fixes a regression into mdev support with IOMMUFD. This
>>             one is independent of the series but happened to cross it
>>             while testing mdev with this series
>>
>> * Patch 2: Adds a support to iommufd_get_device_info() for capabilities
>>
>> * Patches 3 - 7: IOMMUFD backend support for dirty tracking;
>>
>> Introduce auto domains -- Patch 3 goes into more detail, but the gist is that
>> we will find and attach a device to a compatible IOMMU domain, or allocate a new
>> hardware pagetable *or* rely on kernel IOAS attach (for mdevs). Afterwards the
>> workflow is relatively simple:
>>
>> 1) Probe device and allow dirty tracking in the HWPT
>> 2) Toggling dirty tracking on/off
>> 3) Read-and-clear of Dirty IOVAs
>>
>> The heuristics selected for (1) were to always request the HWPT for
>> dirty tracking if supported, or rely on device dirty page tracking. This
>> is a little simplistic and we aren't necessarily utilizing IOMMU dirty
>> tracking even if we ask during hwpt allocation.
>>
>> The unmap case is deferred until further vIOMMU support with migration
>> is added[3] which will then introduce the usage of
>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP ioctl in the
>> dma unmap bitmap flow.
>>
>> * Patches 8-10: Don't block live migration where there's no VF dirty
>> tracker, considering that we have IOMMU dirty tracking.
>>
>> Comments and feedback appreciated.
>>
>> Cheers,
>>      Joao
>>
>> P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
>> some bugs fixed there with regards to IOMMU hugepage dirty tracking.
>>
>> Changes since RFCv2[4]:
>> * Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING even if
>> we end up not actually toggling dirty tracking. (Avihai)
>> * Fix error handling widely in auto domains logic and all patches (Avihai)
>> * Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
>> * New patches 1 and 2 taking into consideration previous comments.
>> * Store hwpt::flags to know if we have dirty tracking (Avihai)
>> * New patch 8, that allows to query dirty tracking support after
>> provisioning. This is a cleaner way to check IOMMU dirty tracking support
>> when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
>> device caps way is still used because at vfio attach we aren't yet with
>> a fully initialized migration state.
>> * Adopt error propagation in query,set dirty tracking
>> * Misc improvements overall broadly and Avihai
>> * Drop hugepages as it's a bit unrelated; I can pursue that patch
>> * separately. The main motivation is to provide a way to test
>> without hugepages similar to what vfio_type1_iommu.disable_hugepages=1
>> does.
>>
>> Changes since RFCv1[2]:
>> * Remove intel/amd dirty tracking emulation enabling
>> * Remove the dirtyrate improvement for VF/IOMMU dirty tracking
>> [Will pursue these two in separate series]
>> * Introduce auto domains support
>> * Enforce dirty tracking following the IOMMUFD UAPI for this
>> * Add support for toggling hugepages in IOMMUFD
>> * Auto enable support when VF supports migration to use IOMMU
>> when it doesn't have VF dirty tracking
>> * Add a parameter to toggle VF dirty tracking
>>
>> [0]
>> https://lore.kernel.org/qemu-devel/20240201072818.327930-1-zhenzhong.duan@intel.com/
>> [1]
>> https://lore.kernel.org/qemu-devel/20240201072818.327930-10-zhenzhong.duan@intel.com/
>> [2]
>> https://lore.kernel.org/qemu-devel/20220428211351.3897-1-joao.m.martins@oracle.com/
>> [3]
>> https://lore.kernel.org/qemu-devel/20230622214845.3980-1-joao.m.martins@oracle.com/
>> [4]
>> https://lore.kernel.org/qemu-devel/20240212135643.5858-1-joao.m.martins@oracle.com/
>>
>> Joao Martins (10):
>>    vfio/iommufd: don't fail to realize on IOMMU_GET_HW_INFO failure
>>    backends/iommufd: Extend iommufd_backend_get_device_info() to fetch HW
>> capabilities
>>    vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()
>>    vfio/iommufd: Introduce auto domain creation
>>    vfio/iommufd: Probe and request hwpt dirty tracking capability
>>    vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking support
>>    vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap support
>>    vfio/iommufd: Parse hw_caps and store dirty tracking support
>>    vfio/migration: Don't block migration device dirty tracking is unsupported
>>    vfio/common: Allow disabling device dirty page tracking
>>
>>   include/hw/vfio/vfio-common.h      |  11 ++
>>   include/sysemu/host_iommu_device.h |   2 +
>>   include/sysemu/iommufd.h           |  12 +-
>>   backends/iommufd.c                 |  81 ++++++++++-
>>   hw/vfio/common.c                   |   3 +
>>   hw/vfio/iommufd.c                  | 217 +++++++++++++++++++++++++++--
>>   hw/vfio/migration.c                |   7 +-
>>   hw/vfio/pci.c                      |   3 +
>>   backends/trace-events              |   3 +
>>   9 files changed, 325 insertions(+), 14 deletions(-)
> 
> 
> I am a bit confused with all the inline proposals. Would you mind
> resending a v4 please ?
> 

Yeap, I'll send it out today, or worst case tomorrow morning.

> Regarding my comments on error handling,
> 
> The error should be set in case of failure, which means a routine
> can not return 'false' or '-errno' and not setting 'Error **'
> parameter at the same time.
> 
> If the returned value needs to be interpreted in some ways, for a
> retry or any reason, then it makes sense to use an int, else please
> use a bool. This is to avoid random negative values being interpreted
> as an errno when they are not.
> 
OK, I'll retain the Error* creation even when expecting to test the errno.

> With VFIO migration support, low level errors (from the adapter FW
> through the VFIO PCI variant driver) now reach to the core migration
> subsystem. It is preferable to propagate this error, possibly literal,
> to the VMM, monitor or libvirt. It's not fully symmetric today because
> the log_global_stop handler for dirty tracking enablement is not
> addressed. Anyhow, an effort on error reporting needs to be made and
> any use of error_report() in a low level function is a sign for
> improvement.
> 
Gotcha. My earlier comment was mostly that it sounded like there was no place
for returning -errno, but it seems it's not that binary and the Error* is the
thing that really matters here.

> I think it would have value to probe early the host IOMMU device for
> its HW features. If the results were cached in the HostIOMMUDevice
> struct, it would then remove unnecessary and redundant calls to the
> host kernel and avoid error handling in complex code paths. I hope
> this is feasible. I haven't looked closely tbh.
> 
OK, I'll post in this series what I had inline[0], as that's what I did.

[0]
https://lore.kernel.org/qemu-devel/4e85db04-fbaa-4a6b-b133-59170c471e24@oracle.com/

The gotcha in my opinion is that I cache IOMMUFD specific data returned by the
GET_HW_INFO ioctl inside a new HostIOMMUDeviceCaps::iommufd. The reason being
that vfio_device_get_aw_bits() has a hidden assumption that the container is
already populated with the list of allowed iova ranges, which is not true for
the first device. So rather than have partial set of caps initialized, I
essentially ended up with fetching the raw caps and store them, and serialize
caps into named features (e.g. caps::aw_bits) in  HostIOMMUDevice::realize().

> We are reaching soft freeze, in ~10 days. There is a chance this
> proposal could make it for 9.1.
> 
> Thanks,
> 
> C.
> 


RE: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
Posted by Duan, Zhenzhong 4 months, 2 weeks ago

>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
>
>On 11/07/2024 08:41, Cédric Le Goater wrote:
>> Hello Joao,
>>
>> On 7/8/24 4:34 PM, Joao Martins wrote:
>>> This small series adds support for IOMMU dirty tracking support via the
>>> IOMMUFD backend. The hardware capability is available on most recent
>x86
>>> hardware. The series is divided organized as follows:
>>>
>>> * Patch 1: Fixes a regression into mdev support with IOMMUFD. This
>>>             one is independent of the series but happened to cross it
>>>             while testing mdev with this series
>>>
>>> * Patch 2: Adds a support to iommufd_get_device_info() for capabilities
>>>
>>> * Patches 3 - 7: IOMMUFD backend support for dirty tracking;
>>>
>>> Introduce auto domains -- Patch 3 goes into more detail, but the gist is
>that
>>> we will find and attach a device to a compatible IOMMU domain, or
>allocate a new
>>> hardware pagetable *or* rely on kernel IOAS attach (for mdevs).
>Afterwards the
>>> workflow is relatively simple:
>>>
>>> 1) Probe device and allow dirty tracking in the HWPT
>>> 2) Toggling dirty tracking on/off
>>> 3) Read-and-clear of Dirty IOVAs
>>>
>>> The heuristics selected for (1) were to always request the HWPT for
>>> dirty tracking if supported, or rely on device dirty page tracking. This
>>> is a little simplistic and we aren't necessarily utilizing IOMMU dirty
>>> tracking even if we ask during hwpt allocation.
>>>
>>> The unmap case is deferred until further vIOMMU support with migration
>>> is added[3] which will then introduce the usage of
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP
>ioctl in the
>>> dma unmap bitmap flow.
>>>
>>> * Patches 8-10: Don't block live migration where there's no VF dirty
>>> tracker, considering that we have IOMMU dirty tracking.
>>>
>>> Comments and feedback appreciated.
>>>
>>> Cheers,
>>>      Joao
>>>
>>> P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
>>> some bugs fixed there with regards to IOMMU hugepage dirty tracking.
>>>
>>> Changes since RFCv2[4]:
>>> * Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>even if
>>> we end up not actually toggling dirty tracking. (Avihai)
>>> * Fix error handling widely in auto domains logic and all patches (Avihai)
>>> * Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
>>> * New patches 1 and 2 taking into consideration previous comments.
>>> * Store hwpt::flags to know if we have dirty tracking (Avihai)
>>> * New patch 8, that allows to query dirty tracking support after
>>> provisioning. This is a cleaner way to check IOMMU dirty tracking support
>>> when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
>>> device caps way is still used because at vfio attach we aren't yet with
>>> a fully initialized migration state.
>>> * Adopt error propagation in query,set dirty tracking
>>> * Misc improvements overall broadly and Avihai
>>> * Drop hugepages as it's a bit unrelated; I can pursue that patch
>>> * separately. The main motivation is to provide a way to test
>>> without hugepages similar to what
>vfio_type1_iommu.disable_hugepages=1
>>> does.
>>>
>>> Changes since RFCv1[2]:
>>> * Remove intel/amd dirty tracking emulation enabling
>>> * Remove the dirtyrate improvement for VF/IOMMU dirty tracking
>>> [Will pursue these two in separate series]
>>> * Introduce auto domains support
>>> * Enforce dirty tracking following the IOMMUFD UAPI for this
>>> * Add support for toggling hugepages in IOMMUFD
>>> * Auto enable support when VF supports migration to use IOMMU
>>> when it doesn't have VF dirty tracking
>>> * Add a parameter to toggle VF dirty tracking
>>>
>>> [0]
>>> https://lore.kernel.org/qemu-devel/20240201072818.327930-1-
>zhenzhong.duan@intel.com/
>>> [1]
>>> https://lore.kernel.org/qemu-devel/20240201072818.327930-10-
>zhenzhong.duan@intel.com/
>>> [2]
>>> https://lore.kernel.org/qemu-devel/20220428211351.3897-1-
>joao.m.martins@oracle.com/
>>> [3]
>>> https://lore.kernel.org/qemu-devel/20230622214845.3980-1-
>joao.m.martins@oracle.com/
>>> [4]
>>> https://lore.kernel.org/qemu-devel/20240212135643.5858-1-
>joao.m.martins@oracle.com/
>>>
>>> Joao Martins (10):
>>>    vfio/iommufd: don't fail to realize on IOMMU_GET_HW_INFO failure
>>>    backends/iommufd: Extend iommufd_backend_get_device_info() to
>fetch HW
>>> capabilities
>>>    vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()
>>>    vfio/iommufd: Introduce auto domain creation
>>>    vfio/iommufd: Probe and request hwpt dirty tracking capability
>>>    vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking
>support
>>>    vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap
>support
>>>    vfio/iommufd: Parse hw_caps and store dirty tracking support
>>>    vfio/migration: Don't block migration device dirty tracking is
>unsupported
>>>    vfio/common: Allow disabling device dirty page tracking
>>>
>>>   include/hw/vfio/vfio-common.h      |  11 ++
>>>   include/sysemu/host_iommu_device.h |   2 +
>>>   include/sysemu/iommufd.h           |  12 +-
>>>   backends/iommufd.c                 |  81 ++++++++++-
>>>   hw/vfio/common.c                   |   3 +
>>>   hw/vfio/iommufd.c                  | 217 +++++++++++++++++++++++++++--
>>>   hw/vfio/migration.c                |   7 +-
>>>   hw/vfio/pci.c                      |   3 +
>>>   backends/trace-events              |   3 +
>>>   9 files changed, 325 insertions(+), 14 deletions(-)
>>
>>
>> I am a bit confused with all the inline proposals. Would you mind
>> resending a v4 please ?
>>
>
>Yeap, I'll send it out today, or worst case tomorrow morning.
>
>> Regarding my comments on error handling,
>>
>> The error should be set in case of failure, which means a routine
>> can not return 'false' or '-errno' and not setting 'Error **'
>> parameter at the same time.
>>
>> If the returned value needs to be interpreted in some ways, for a
>> retry or any reason, then it makes sense to use an int, else please
>> use a bool. This is to avoid random negative values being interpreted
>> as an errno when they are not.
>>
>OK, I'll retain the Error* creation even when expecting to test the errno.
>
>> With VFIO migration support, low level errors (from the adapter FW
>> through the VFIO PCI variant driver) now reach to the core migration
>> subsystem. It is preferable to propagate this error, possibly literal,
>> to the VMM, monitor or libvirt. It's not fully symmetric today because
>> the log_global_stop handler for dirty tracking enablement is not
>> addressed. Anyhow, an effort on error reporting needs to be made and
>> any use of error_report() in a low level function is a sign for
>> improvement.
>>
>Gotcha. My earlier comment was mostly that it sounded like there was no
>place
>for returning -errno, but it seems it's not that binary and the Error* is the
>thing that really matters here.
>
>> I think it would have value to probe early the host IOMMU device for
>> its HW features. If the results were cached in the HostIOMMUDevice
>> struct, it would then remove unnecessary and redundant calls to the
>> host kernel and avoid error handling in complex code paths. I hope
>> this is feasible. I haven't looked closely tbh.
>>
>OK, I'll post in this series what I had inline[0], as that's what I did.
>
>[0]
>https://lore.kernel.org/qemu-devel/4e85db04-fbaa-4a6b-b133-
>59170c471e24@oracle.com/
>
>The gotcha in my opinion is that I cache IOMMUFD specific data returned by
>the
>GET_HW_INFO ioctl inside a new HostIOMMUDeviceCaps::iommufd. The
>reason being
>that vfio_device_get_aw_bits() has a hidden assumption that the container
>is
>already populated with the list of allowed iova ranges, which is not true for
>the first device. So rather than have partial set of caps initialized, I
>essentially ended up with fetching the raw caps and store them, and serialize
>caps into named features (e.g. caps::aw_bits) in
>HostIOMMUDevice::realize().

Another way is to call vfio_device_get_aw_bits() and return its result directly
in get_cap(), then no need to initialize caps::aw_bits.
This way host IOMMU device can be moved ahead as Cédric suggested.

Thanks
Zhenzhong

>
>> We are reaching soft freeze, in ~10 days. There is a chance this
>> proposal could make it for 9.1.
>>
>> Thanks,
>>
>> C.
>>

Re: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
Posted by Joao Martins 4 months, 2 weeks ago
On 11/07/2024 11:22, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH v3 00/10] hw/vfio: IOMMUFD Dirty Tracking
>>
>> On 11/07/2024 08:41, Cédric Le Goater wrote:
>>> Hello Joao,
>>>
>>> On 7/8/24 4:34 PM, Joao Martins wrote:
>>>> This small series adds support for IOMMU dirty tracking support via the
>>>> IOMMUFD backend. The hardware capability is available on most recent
>> x86
>>>> hardware. The series is divided organized as follows:
>>>>
>>>> * Patch 1: Fixes a regression into mdev support with IOMMUFD. This
>>>>             one is independent of the series but happened to cross it
>>>>             while testing mdev with this series
>>>>
>>>> * Patch 2: Adds a support to iommufd_get_device_info() for capabilities
>>>>
>>>> * Patches 3 - 7: IOMMUFD backend support for dirty tracking;
>>>>
>>>> Introduce auto domains -- Patch 3 goes into more detail, but the gist is
>> that
>>>> we will find and attach a device to a compatible IOMMU domain, or
>> allocate a new
>>>> hardware pagetable *or* rely on kernel IOAS attach (for mdevs).
>> Afterwards the
>>>> workflow is relatively simple:
>>>>
>>>> 1) Probe device and allow dirty tracking in the HWPT
>>>> 2) Toggling dirty tracking on/off
>>>> 3) Read-and-clear of Dirty IOVAs
>>>>
>>>> The heuristics selected for (1) were to always request the HWPT for
>>>> dirty tracking if supported, or rely on device dirty page tracking. This
>>>> is a little simplistic and we aren't necessarily utilizing IOMMU dirty
>>>> tracking even if we ask during hwpt allocation.
>>>>
>>>> The unmap case is deferred until further vIOMMU support with migration
>>>> is added[3] which will then introduce the usage of
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR in GET_DIRTY_BITMAP
>> ioctl in the
>>>> dma unmap bitmap flow.
>>>>
>>>> * Patches 8-10: Don't block live migration where there's no VF dirty
>>>> tracker, considering that we have IOMMU dirty tracking.
>>>>
>>>> Comments and feedback appreciated.
>>>>
>>>> Cheers,
>>>>      Joao
>>>>
>>>> P.S. Suggest linux-next (or future v6.11) as hypervisor kernel as there's
>>>> some bugs fixed there with regards to IOMMU hugepage dirty tracking.
>>>>
>>>> Changes since RFCv2[4]:
>>>> * Always allocate hwpt with IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>> even if
>>>> we end up not actually toggling dirty tracking. (Avihai)
>>>> * Fix error handling widely in auto domains logic and all patches (Avihai)
>>>> * Reuse iommufd_backend_get_device_info() for capabilities (Zhenzhong)
>>>> * New patches 1 and 2 taking into consideration previous comments.
>>>> * Store hwpt::flags to know if we have dirty tracking (Avihai)
>>>> * New patch 8, that allows to query dirty tracking support after
>>>> provisioning. This is a cleaner way to check IOMMU dirty tracking support
>>>> when vfio::migration is iniitalized, as opposed to RFCv2 via device caps.
>>>> device caps way is still used because at vfio attach we aren't yet with
>>>> a fully initialized migration state.
>>>> * Adopt error propagation in query,set dirty tracking
>>>> * Misc improvements overall broadly and Avihai
>>>> * Drop hugepages as it's a bit unrelated; I can pursue that patch
>>>> * separately. The main motivation is to provide a way to test
>>>> without hugepages similar to what
>> vfio_type1_iommu.disable_hugepages=1
>>>> does.
>>>>
>>>> Changes since RFCv1[2]:
>>>> * Remove intel/amd dirty tracking emulation enabling
>>>> * Remove the dirtyrate improvement for VF/IOMMU dirty tracking
>>>> [Will pursue these two in separate series]
>>>> * Introduce auto domains support
>>>> * Enforce dirty tracking following the IOMMUFD UAPI for this
>>>> * Add support for toggling hugepages in IOMMUFD
>>>> * Auto enable support when VF supports migration to use IOMMU
>>>> when it doesn't have VF dirty tracking
>>>> * Add a parameter to toggle VF dirty tracking
>>>>
>>>> [0]
>>>> https://lore.kernel.org/qemu-devel/20240201072818.327930-1-
>> zhenzhong.duan@intel.com/
>>>> [1]
>>>> https://lore.kernel.org/qemu-devel/20240201072818.327930-10-
>> zhenzhong.duan@intel.com/
>>>> [2]
>>>> https://lore.kernel.org/qemu-devel/20220428211351.3897-1-
>> joao.m.martins@oracle.com/
>>>> [3]
>>>> https://lore.kernel.org/qemu-devel/20230622214845.3980-1-
>> joao.m.martins@oracle.com/
>>>> [4]
>>>> https://lore.kernel.org/qemu-devel/20240212135643.5858-1-
>> joao.m.martins@oracle.com/
>>>>
>>>> Joao Martins (10):
>>>>    vfio/iommufd: don't fail to realize on IOMMU_GET_HW_INFO failure
>>>>    backends/iommufd: Extend iommufd_backend_get_device_info() to
>> fetch HW
>>>> capabilities
>>>>    vfio/iommufd: Return errno in iommufd_cdev_attach_ioas_hwpt()
>>>>    vfio/iommufd: Introduce auto domain creation
>>>>    vfio/iommufd: Probe and request hwpt dirty tracking capability
>>>>    vfio/iommufd: Implement VFIOIOMMUClass::set_dirty_tracking
>> support
>>>>    vfio/iommufd: Implement VFIOIOMMUClass::query_dirty_bitmap
>> support
>>>>    vfio/iommufd: Parse hw_caps and store dirty tracking support
>>>>    vfio/migration: Don't block migration device dirty tracking is
>> unsupported
>>>>    vfio/common: Allow disabling device dirty page tracking
>>>>
>>>>   include/hw/vfio/vfio-common.h      |  11 ++
>>>>   include/sysemu/host_iommu_device.h |   2 +
>>>>   include/sysemu/iommufd.h           |  12 +-
>>>>   backends/iommufd.c                 |  81 ++++++++++-
>>>>   hw/vfio/common.c                   |   3 +
>>>>   hw/vfio/iommufd.c                  | 217 +++++++++++++++++++++++++++--
>>>>   hw/vfio/migration.c                |   7 +-
>>>>   hw/vfio/pci.c                      |   3 +
>>>>   backends/trace-events              |   3 +
>>>>   9 files changed, 325 insertions(+), 14 deletions(-)
>>>
>>>
>>> I am a bit confused with all the inline proposals. Would you mind
>>> resending a v4 please ?
>>>
>>
>> Yeap, I'll send it out today, or worst case tomorrow morning.
>>
>>> Regarding my comments on error handling,
>>>
>>> The error should be set in case of failure, which means a routine
>>> can not return 'false' or '-errno' and not setting 'Error **'
>>> parameter at the same time.
>>>
>>> If the returned value needs to be interpreted in some ways, for a
>>> retry or any reason, then it makes sense to use an int, else please
>>> use a bool. This is to avoid random negative values being interpreted
>>> as an errno when they are not.
>>>
>> OK, I'll retain the Error* creation even when expecting to test the errno.
>>
>>> With VFIO migration support, low level errors (from the adapter FW
>>> through the VFIO PCI variant driver) now reach to the core migration
>>> subsystem. It is preferable to propagate this error, possibly literal,
>>> to the VMM, monitor or libvirt. It's not fully symmetric today because
>>> the log_global_stop handler for dirty tracking enablement is not
>>> addressed. Anyhow, an effort on error reporting needs to be made and
>>> any use of error_report() in a low level function is a sign for
>>> improvement.
>>>
>> Gotcha. My earlier comment was mostly that it sounded like there was no
>> place
>> for returning -errno, but it seems it's not that binary and the Error* is the
>> thing that really matters here.
>>
>>> I think it would have value to probe early the host IOMMU device for
>>> its HW features. If the results were cached in the HostIOMMUDevice
>>> struct, it would then remove unnecessary and redundant calls to the
>>> host kernel and avoid error handling in complex code paths. I hope
>>> this is feasible. I haven't looked closely tbh.
>>>
>> OK, I'll post in this series what I had inline[0], as that's what I did.
>>
>> [0]
>> https://lore.kernel.org/qemu-devel/4e85db04-fbaa-4a6b-b133-
>> 59170c471e24@oracle.com/
>>
>> The gotcha in my opinion is that I cache IOMMUFD specific data returned by
>> the
>> GET_HW_INFO ioctl inside a new HostIOMMUDeviceCaps::iommufd. The
>> reason being
>> that vfio_device_get_aw_bits() has a hidden assumption that the container
>> is
>> already populated with the list of allowed iova ranges, which is not true for
>> the first device. So rather than have partial set of caps initialized, I
>> essentially ended up with fetching the raw caps and store them, and serialize
>> caps into named features (e.g. caps::aw_bits) in
>> HostIOMMUDevice::realize().
> 
> Another way is to call vfio_device_get_aw_bits() and return its result directly
> in get_cap(), then no need to initialize caps::aw_bits.
> This way host IOMMU device can be moved ahead as Cédric suggested.

Oh, yes, that's a great alternative. Let me adopt that instead and we don't need
to make so huge changes structure wise.