Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.
This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.
Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO[1], rlist[2]), instead split up the DMA API
to allow callers to bring their own data structure.
The API is split up into parts:
- Allocate IOVA space:
To do any pre-allocation required. This is done based on the caller
supplying some details about how much IOMMU address space it would need
in worst case.
- Map and unmap relevant structures to pre-allocated IOVA space:
Perform the actual mapping into the pre-allocated IOVA. This is very
similar to dma_map_page().
In this and the next series [1], examples of three different users are converted
to the new API to show the benefits and its versatility. Each user has a unique
flow:
1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
dynamically map/unmap large numbers of single pages. This becomes
significantly faster in the IOMMU case as the map/unmap is now just
a page table walk, the IOVA allocation is pre-computed once. Significant
amounts of memory are saved as there is no longer a need to store the
dma_addr_t of each page.
2. VFIO PCI live migration code is building a very large "page list"
for the device. Instead of allocating a scatter list entry per allocated
page it can just allocate an array of 'struct page *', saving a large
amount of memory.
3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
list without having to allocate then populate an intermediate SG table.
To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
* Memory reduction as in most real use cases there is no need to store mapped
DMA addresses and unmap them.
* Reducing the function call overhead by removing the need to call function
pointers and use direct calls instead.
This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.
Thanks
[1] https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org
Christoph Hellwig (6):
PCI/P2PDMA: refactor the p2pdma mapping helpers
dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
iommu: generalize the batched sync after map interface
iommu/dma: Factor out a iommu_dma_map_swiotlb helper
dma-mapping: add a dma_need_unmap helper
docs: core-api: document the IOVA-based API
Leon Romanovsky (12):
dma-mapping: Add check if IOVA can be used
dma: Provide an interface to allow allocate IOVA
dma-mapping: Implement link/unlink ranges API
mm/hmm: let users to tag specific PFN with DMA mapped bit
mm/hmm: provide generic DMA managing logic
RDMA/umem: Store ODP access mask information in PFN
RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
linkage
RDMA/umem: Separate implicit ODP initialization from explicit ODP
vfio/mlx5: Explicitly use number of pages instead of allocated length
vfio/mlx5: Rewrite create mkey flow to allow better code reuse
vfio/mlx5: Explicitly store page list
vfio/mlx5: Convert vfio to use DMA link API
Documentation/core-api/dma-api.rst | 70 +++++
drivers/infiniband/core/umem_odp.c | 250 +++++----------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +-
drivers/infiniband/hw/mlx5/odp.c | 65 ++--
drivers/infiniband/hw/mlx5/umr.c | 12 +-
drivers/iommu/dma-iommu.c | 455 +++++++++++++++++++++++----
drivers/iommu/iommu.c | 65 ++--
drivers/pci/p2pdma.c | 38 +--
drivers/vfio/pci/mlx5/cmd.c | 312 +++++++++---------
drivers/vfio/pci/mlx5/cmd.h | 24 +-
drivers/vfio/pci/mlx5/main.c | 87 +++--
include/linux/dma-map-ops.h | 54 ----
include/linux/dma-mapping.h | 84 +++++
include/linux/hmm-dma.h | 32 ++
include/linux/hmm.h | 16 +
include/linux/iommu.h | 4 +
include/linux/pci-p2pdma.h | 84 +++++
include/rdma/ib_umem_odp.h | 25 +-
kernel/dma/direct.c | 43 ++-
kernel/dma/mapping.c | 20 ++
mm/hmm.c | 229 +++++++++++++-
21 files changed, 1345 insertions(+), 636 deletions(-)
create mode 100644 include/linux/hmm-dma.h
--
2.46.2