Documentation/core-api/dma-api.rst | 4 +- Documentation/core-api/dma-attributes.rst | 18 ++++ arch/powerpc/kernel/dma-iommu.c | 4 +- block/blk-mq-dma.c | 15 ++- drivers/iommu/dma-iommu.c | 61 +++++------ drivers/nvme/host/pci.c | 18 +++- drivers/virtio/virtio_ring.c | 4 +- drivers/xen/swiotlb-xen.c | 21 +++- include/linux/blk-mq-dma.h | 6 +- include/linux/blk_types.h | 2 + include/linux/dma-direct.h | 2 - include/linux/dma-map-ops.h | 8 +- include/linux/dma-mapping.h | 33 ++++++ include/linux/iommu-dma.h | 11 +- include/linux/kmsan.h | 9 +- include/trace/events/dma.h | 9 +- kernel/dma/debug.c | 71 ++++--------- kernel/dma/debug.h | 37 ++----- kernel/dma/direct.c | 22 +--- kernel/dma/direct.h | 52 ++++++---- kernel/dma/mapping.c | 117 +++++++++++++--------- kernel/dma/ops_helpers.c | 6 +- mm/hmm.c | 19 ++-- mm/kmsan/hooks.c | 5 +- rust/kernel/dma.rs | 3 + tools/virtio/linux/kmsan.h | 2 +- 26 files changed, 305 insertions(+), 254 deletions(-)
Changelog: v4: * Fixed kbuild error with mismatch in kmsan function declaration due to rebase error. v3: https://lore.kernel.org/all/cover.1755193625.git.leon@kernel.org * Fixed typo in "cacheable" word * Simplified kmsan patch a lot to be simple argument refactoring v2: https://lore.kernel.org/all/cover.1755153054.git.leon@kernel.org * Used commit messages and cover letter from Jason * Moved setting IOMMU_MMIO flag to dma_info_to_prot function * Micro-optimized the code * Rebased code on v6.17-rc1 v1: https://lore.kernel.org/all/cover.1754292567.git.leon@kernel.org * Added new DMA_ATTR_MMIO attribute to indicate PCI_P2PDMA_MAP_THRU_HOST_BRIDGE path. * Rewrote dma_map_* functions to use thus new attribute v0: https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/ ------------------------------------------------------------------------ This series refactors the DMA mapping to use physical addresses as the primary interface instead of page+offset parameters. This change aligns the DMA API with the underlying hardware reality where DMA operations work with physical addresses, not page structures. The series maintains export symbol backward compatibility by keeping the old page-based API as wrapper functions around the new physical address-based implementations. This series refactors the DMA mapping API to provide a phys_addr_t based, and struct-page free, external API that can handle all the mapping cases we want in modern systems: - struct page based cachable DRAM - struct page MEMORY_DEVICE_PCI_P2PDMA PCI peer to peer non-cachable MMIO - struct page-less PCI peer to peer non-cachable MMIO - struct page-less "resource" MMIO Overall this gets much closer to Matthew's long term wish for struct-pageless IO to cachable DRAM. The remaining primary work would be in the mm side to allow kmap_local_pfn()/phys_to_virt() to work on phys_addr_t without a struct page. The general design is to remove struct page usage entirely from the DMA API inner layers. For flows that need to have a KVA for the physical address they can use kmap_local_pfn() or phys_to_virt(). This isolates the struct page requirements to MM code only. Long term all removals of struct page usage are supporting Matthew's memdesc project which seeks to substantially transform how struct page works. Instead make the DMA API internals work on phys_addr_t. Internally there are still dedicated 'page' and 'resource' flows, except they are now distinguished by a new DMA_ATTR_MMIO instead of by callchain. Both flows use the same phys_addr_t. When DMA_ATTR_MMIO is specified things work similar to the existing 'resource' flow. kmap_local_pfn(), phys_to_virt(), phys_to_page(), pfn_valid(), etc are never called on the phys_addr_t. This requires rejecting any configuration that would need swiotlb. CPU cache flushing is not required, and avoided, as ATTR_MMIO also indicates the address have no cachable mappings. This effectively removes any DMA API side requirement to have struct page when DMA_ATTR_MMIO is used. In the !DMA_ATTR_MMIO mode things work similarly to the 'page' flow, except on the common path of no cache flush, no swiotlb it never touches a struct page. When cache flushing or swiotlb copying kmap_local_pfn()/phys_to_virt() are used to get a KVA for CPU usage. This was already the case on the unmap side, now the map side is symmetric. Callers are adjusted to set DMA_ATTR_MMIO. Existing 'resource' users must set it. The existing struct page based MEMORY_DEVICE_PCI_P2PDMA path must also set it. This corrects some existing bugs where iommu mappings for P2P MMIO were improperly marked IOMMU_CACHE. Since ATTR_MMIO is made to work with all the existing DMA map entry points, particularly dma_iova_link(), this finally allows a way to use the new DMA API to map PCI P2P MMIO without creating struct page. The VFIO DMABUF series demonstrates how this works. This is intended to replace the incorrect driver use of dma_map_resource() on PCI BAR addresses. This series does the core code and modern flows. A followup series will give the same treatment to the legacy dma_ops implementation. Thanks Leon Romanovsky (16): dma-mapping: introduce new DMA attribute to indicate MMIO memory iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). dma-debug: refactor to use physical addresses for page mapping dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys iommu/dma: extend iommu_dma_*map_phys API to handle MMIO memory dma-mapping: convert dma_direct_*map_page to be phys_addr_t based kmsan: convert kmsan_handle_dma to use physical addresses dma-mapping: handle MMIO flow in dma_map|unmap_page xen: swiotlb: Open code map_resource callback dma-mapping: export new dma_*map_phys() interface mm/hmm: migrate to physical address-based DMA mapping API mm/hmm: properly take MMIO path block-dma: migrate to dma_map_phys instead of map_page block-dma: properly take MMIO path nvme-pci: unmap MMIO pages with appropriate interface Documentation/core-api/dma-api.rst | 4 +- Documentation/core-api/dma-attributes.rst | 18 ++++ arch/powerpc/kernel/dma-iommu.c | 4 +- block/blk-mq-dma.c | 15 ++- drivers/iommu/dma-iommu.c | 61 +++++------ drivers/nvme/host/pci.c | 18 +++- drivers/virtio/virtio_ring.c | 4 +- drivers/xen/swiotlb-xen.c | 21 +++- include/linux/blk-mq-dma.h | 6 +- include/linux/blk_types.h | 2 + include/linux/dma-direct.h | 2 - include/linux/dma-map-ops.h | 8 +- include/linux/dma-mapping.h | 33 ++++++ include/linux/iommu-dma.h | 11 +- include/linux/kmsan.h | 9 +- include/trace/events/dma.h | 9 +- kernel/dma/debug.c | 71 ++++--------- kernel/dma/debug.h | 37 ++----- kernel/dma/direct.c | 22 +--- kernel/dma/direct.h | 52 ++++++---- kernel/dma/mapping.c | 117 +++++++++++++--------- kernel/dma/ops_helpers.c | 6 +- mm/hmm.c | 19 ++-- mm/kmsan/hooks.c | 5 +- rust/kernel/dma.rs | 3 + tools/virtio/linux/kmsan.h | 2 +- 26 files changed, 305 insertions(+), 254 deletions(-) -- 2.50.1
On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote: > This series does the core code and modern flows. A followup series > will give the same treatment to the legacy dma_ops implementation. I took a quick check over this to see that it is sane. I think using phys is an improvement for most of the dma_ops implemenations. arch/sparc/kernel/pci_sun4v.c arch/sparc/kernel/iommu.c Uses __pa to get phys from the page, never touches page arch/alpha/kernel/pci_iommu.c arch/sparc/mm/io-unit.c drivers/parisc/ccio-dma.c drivers/parisc/sba_iommu.c Does page_addres() and later does __pa on it. Doesn't touch struct page arch/x86/kernel/amd_gart_64.c drivers/xen/swiotlb-xen.c arch/mips/jazz/jazzdma.c Immediately does page_to_phys(), never touches struct page drivers/vdpa/vdpa_user/vduse_dev.c Does page_to_phys() to call iommu_map() drivers/xen/grant-dma-ops.c Does page_to_pfn() and nothing else arch/powerpc/platforms/ps3/system-bus.c This is a maze but I think it wants only phys and the virt is only used for debug prints. The above all never touch a KVA and just want a phys_addr_t. The below are touching the KVA somehow: arch/sparc/mm/iommu.c arch/arm/mm/dma-mapping.c Uses page_address to cache flush, would be happy with phys_to_virt() and a PhysHighMem() arch/powerpc/kernel/dma-iommu.c arch/powerpc/platforms/pseries/vio.c Uses iommu_map_page() which wants phys_to_virt(), doesn't touch struct page arch/powerpc/platforms/pseries/ibmebus.c Returns phys_to_virt() as dma_addr_t. The two PPC ones are weird, I didn't figure out how that was working.. It would be easy to make map_phys patches for about half of these, in the first grouping. Doing so would also grant those arches map_resource capability. Overall I didn't think there was any reduction in maintainability in these places. Most are improvements eliminating code, and some are just switching to phys_to_virt() from page_address(), which we could further guard with DMA_ATTR_MMIO and a check for highmem. Jason
On 29.08.2025 15:16, Jason Gunthorpe wrote: > On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote: > >> This series does the core code and modern flows. A followup series >> will give the same treatment to the legacy dma_ops implementation. > I took a quick check over this to see that it is sane. I think using > phys is an improvement for most of the dma_ops implemenations. > > arch/sparc/kernel/pci_sun4v.c > arch/sparc/kernel/iommu.c > Uses __pa to get phys from the page, never touches page > > arch/alpha/kernel/pci_iommu.c > arch/sparc/mm/io-unit.c > drivers/parisc/ccio-dma.c > drivers/parisc/sba_iommu.c > Does page_addres() and later does __pa on it. Doesn't touch struct page > > arch/x86/kernel/amd_gart_64.c > drivers/xen/swiotlb-xen.c > arch/mips/jazz/jazzdma.c > Immediately does page_to_phys(), never touches struct page > > drivers/vdpa/vdpa_user/vduse_dev.c > Does page_to_phys() to call iommu_map() > > drivers/xen/grant-dma-ops.c > Does page_to_pfn() and nothing else > > arch/powerpc/platforms/ps3/system-bus.c > This is a maze but I think it wants only phys and the virt is only > used for debug prints. > > The above all never touch a KVA and just want a phys_addr_t. > > The below are touching the KVA somehow: > > arch/sparc/mm/iommu.c > arch/arm/mm/dma-mapping.c > Uses page_address to cache flush, would be happy with phys_to_virt() > and a PhysHighMem() > > arch/powerpc/kernel/dma-iommu.c > arch/powerpc/platforms/pseries/vio.c > Uses iommu_map_page() which wants phys_to_virt(), doesn't touch > struct page > > arch/powerpc/platforms/pseries/ibmebus.c > Returns phys_to_virt() as dma_addr_t. > > The two PPC ones are weird, I didn't figure out how that was working.. > > It would be easy to make map_phys patches for about half of these, in > the first grouping. Doing so would also grant those arches > map_resource capability. > > Overall I didn't think there was any reduction in maintainability in > these places. Most are improvements eliminating code, and some are > just switching to phys_to_virt() from page_address(), which we could > further guard with DMA_ATTR_MMIO and a check for highmem. Thanks for this summary. However I would still like to get an answer for the simple question - why all this work cannot be replaced by a simple use of dma_map_resource()? I've checked the most advertised use case in https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio and I still don't see the reason why it cannot be based on dma_map_resource() API? I'm aware of the little asymmetry of the client calls is such case, indeed it is not preety, but this should work even now: phys = phys_vec[i].paddr; if (is_mmio) dma_map_resource(phys, len, ...); else dma_map_page(phys_to_page(phys), offset_in_page(phys), ...); What did I miss? I'm not against this rework, but I would really like to know the rationale. I know that the 2-step dma-mapping API also use phys addresses and this is the same direction. This patchset focuses only on the dma_map_page -> dma_map_phys rework. There are also other interfaces, like dma_alloc_pages() and so far nothing has been proposed for them so far. Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland
On Fri, Sep 05, 2025 at 06:20:51PM +0200, Marek Szyprowski wrote: > I've checked the most advertised use case in > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio > and I still don't see the reason why it cannot be based > on dma_map_resource() API? I'm aware of the little asymmetry of the > client calls is such case, indeed it is not preety, but this should work > even now: > > phys = phys_vec[i].paddr; > > if (is_mmio) > dma_map_resource(phys, len, ...); > else > dma_map_page(phys_to_page(phys), offset_in_page(phys), ...); > > What did I miss? I have a somewhat different answer than Leon.. The link path would need a resource variation too: + ret = dma_iova_link(attachment->dev, state, + phys_vec[i].paddr, 0, + phys_vec[i].len, dir, attrs); + if (ret) + goto err_unmap_dma; + + mapped_len += phys_vec[i].len; It is an existing bug that we don't properly handle all details of MMIO for link. Since this is already a phys_addr_t I wouldn't strongly argue that should be done by adding ATTR_MMIO to dma_iova_link(). If you did that, then you'd still want a dma_(un)map_phys() helper that handled ATTR_MMIO too. It could be an inline "if () resource else page" wrapper like you say. So API wise I think we have the right design here. I think the question you are asking is how much changing to the internals of the DMA API do you want to do to make ATTR_MMIO. It is not zero, but there is some minimum that is less than this. So reason #1 much of this ATTR_MMIO is needed anyhow. Being consistent and unifying the dma_map_resource path with ATTR_MMIO should improve the long term maintainability of the code. We already uncovered paths where map_resource is not behaving consistently with map_page and it is unclear if these are bugs or deliberate. Reason #2 we do actually want to get rid of struct page usage to help advance Matthew's work. This means we want to build a clean struct page less path for IO. Meaning we can do phys to virt, or kmap phys, but none of: phys to page, page to virt, page to phys. Stopping at a phys based public API and then leaving all the phys to page/etc conversions hidden inside is not enough. This is why I was looking at the dma_ops path, to see just how much page usage there is, and I found very little. So this dream is achievable and with this series we are there for ARM64 and x86 environments. > This patchset focuses only on the dma_map_page -> dma_map_phys rework. > There are also other interfaces, like dma_alloc_pages() and so far > nothing has been proposed for them so far. That's because they already have non-page alternatives. Allmost all places call dma_alloc_noncoherent(): static inline void *dma_alloc_noncoherent(struct device *dev, size_t size, dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp) { struct page *page = dma_alloc_pages(dev, size, dma_handle, dir, gfp); return page ? page_address(page) : NULL; Which is KVA based. There is only one user I found of alloc_pages: drivers/firewire/ohci.c: ctx->pages[i] = dma_alloc_pages(dev, PAGE_SIZE, &dma_addr, And it deliberately uses page->private: set_page_private(ctx->pages[i], dma_addr); So it is correct to use the struct page API. Some usages of dma_alloc_noncontiguous() can be implemented using the dma_iova_link() flow like drivers/vfio/pci/mlx5/cmd.c shows by using alloc_pages_bulk() for the allocator. We don't yet have a 'dma alloc link' operation though, and there are only 4 users of dma_alloc_noncontiguous().. Jason
Hi, I'm a present maintainer of Linux FireWire subsystem, and recent years have been working to modernize the subsystem. On Fri, Sep 05, 2025 at 14:43:24PM -0300, Jason Gunthorpe wrote: > There is only one user I found of alloc_pages: > > drivers/firewire/ohci.c: ctx->pages[i] = dma_alloc_pages(dev, PAGE_SIZE, &dma_addr, > > And it deliberately uses page->private: > > set_page_private(ctx->pages[i], dma_addr); > > So it is correct to use the struct page API. I've already realized it, and it is in my TODO list to use modern alternative APIs to replace it (but not yet). If you know some candidates for this purpose, it is really helpful to accomplish it. Regards Takashi Sakamoto
On Sun, Sep 07, 2025 at 11:25:09PM +0900, Takashi Sakamoto wrote: > Hi, > > I'm a present maintainer of Linux FireWire subsystem, and recent years > have been working to modernize the subsystem. > > On Fri, Sep 05, 2025 at 14:43:24PM -0300, Jason Gunthorpe wrote: > > There is only one user I found of alloc_pages: > > > > drivers/firewire/ohci.c: ctx->pages[i] = dma_alloc_pages(dev, PAGE_SIZE, &dma_addr, > > > > And it deliberately uses page->private: > > > > set_page_private(ctx->pages[i], dma_addr); > > > > So it is correct to use the struct page API. > > I've already realized it, and it is in my TODO list to use modern > alternative APIs to replace it (but not yet). If you know some > candidates for this purpose, it is really helpful to accomplish it. I think for now it is probably OKish, but in the medium/longer term this probably wants to have its own memdesc like other cases. Ie instead of using page->private you'd have a struct ohci_desc { unsigned long __page_flags; dma_addr_t dma_addr; [..] }; And instead of using page->private you'd use ohci_desc::dma_addr. This would require changing dma_alloc_pages() to be able to allocate the frozen memdescs.. Which we are not quite there yet, but maybe come back to this in 2026? Jason
On Fri, Sep 05, 2025 at 06:20:51PM +0200, Marek Szyprowski wrote: > On 29.08.2025 15:16, Jason Gunthorpe wrote: > > On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote: > > > >> This series does the core code and modern flows. A followup series > >> will give the same treatment to the legacy dma_ops implementation. > > I took a quick check over this to see that it is sane. I think using > > phys is an improvement for most of the dma_ops implemenations. > > > > arch/sparc/kernel/pci_sun4v.c > > arch/sparc/kernel/iommu.c > > Uses __pa to get phys from the page, never touches page > > > > arch/alpha/kernel/pci_iommu.c > > arch/sparc/mm/io-unit.c > > drivers/parisc/ccio-dma.c > > drivers/parisc/sba_iommu.c > > Does page_addres() and later does __pa on it. Doesn't touch struct page > > > > arch/x86/kernel/amd_gart_64.c > > drivers/xen/swiotlb-xen.c > > arch/mips/jazz/jazzdma.c > > Immediately does page_to_phys(), never touches struct page > > > > drivers/vdpa/vdpa_user/vduse_dev.c > > Does page_to_phys() to call iommu_map() > > > > drivers/xen/grant-dma-ops.c > > Does page_to_pfn() and nothing else > > > > arch/powerpc/platforms/ps3/system-bus.c > > This is a maze but I think it wants only phys and the virt is only > > used for debug prints. > > > > The above all never touch a KVA and just want a phys_addr_t. > > > > The below are touching the KVA somehow: > > > > arch/sparc/mm/iommu.c > > arch/arm/mm/dma-mapping.c > > Uses page_address to cache flush, would be happy with phys_to_virt() > > and a PhysHighMem() > > > > arch/powerpc/kernel/dma-iommu.c > > arch/powerpc/platforms/pseries/vio.c > > Uses iommu_map_page() which wants phys_to_virt(), doesn't touch > > struct page > > > > arch/powerpc/platforms/pseries/ibmebus.c > > Returns phys_to_virt() as dma_addr_t. > > > > The two PPC ones are weird, I didn't figure out how that was working.. > > > > It would be easy to make map_phys patches for about half of these, in > > the first grouping. Doing so would also grant those arches > > map_resource capability. > > > > Overall I didn't think there was any reduction in maintainability in > > these places. Most are improvements eliminating code, and some are > > just switching to phys_to_virt() from page_address(), which we could > > further guard with DMA_ATTR_MMIO and a check for highmem. > > Thanks for this summary. > > However I would still like to get an answer for the simple question - > why all this work cannot be replaced by a simple use of dma_map_resource()? > > I've checked the most advertised use case in > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio > and I still don't see the reason why it cannot be based > on dma_map_resource() API? I'm aware of the little asymmetry of the > client calls is such case, indeed it is not preety, but this should work > even now: > > phys = phys_vec[i].paddr; > > if (is_mmio) > dma_map_resource(phys, len, ...); > else > dma_map_page(phys_to_page(phys), offset_in_page(phys), ...); > > What did I miss? "Even now" can't work mainly because both of these interfaces don't support p2p case (PCI_P2PDMA_MAP_BUS_ADDR). It is unclear how to extend them without introducing new functions and/or changing whole kernel. In PCI_P2PDMA_MAP_BUS_ADDR case, there is no struct page, so dma_map_page() is unlikely to be possible to extend and dma_map_resource() has no direct way to access PCI bus_offset. In theory, it is doable, but will be layer violation as DMA will need to rely on PCI layer for address calculations. If we don't extend, in general case (for HMM, RDMA and NVMe) end result will be something like that: if (...PCI_P2PDMA_MAP_BUS_ADDR) pci_p2pdma_bus_addr_map else if (mmio) dma_map_resource else <- this case is not applicable to VFIO-DMABUF dma_map_page In case, we will somehow extend these functions to support it, we will lose very important optimization where we are performing one IOTLB sync for whole DMABUF region == dma_iova_state, and I was told that it is very large region. 103 for (i = 0; i < priv->nr_ranges; i++) { <...> 107 } else if (dma_use_iova(state)) { 108 ret = dma_iova_link(attachment->dev, state, 109 phys_vec[i].paddr, 0, 110 phys_vec[i].len, dir, attrs); 111 if (ret) 112 goto err_unmap_dma; 113 114 mapped_len += phys_vec[i].len; <...> 132 } 133 134 if (state && dma_use_iova(state)) { 135 WARN_ON_ONCE(mapped_len != priv->size); 136 ret = dma_iova_sync(attachment->dev, state, 0, mapped_len); > > I'm not against this rework, but I would really like to know the > rationale. I know that the 2-step dma-mapping API also use phys > addresses and this is the same direction. This series is continuation of 2-step dma-mapping API. The plan to provide dma_map_phys() was from the beginning. Thanks
On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote: > Changelog: > v4: > * Fixed kbuild error with mismatch in kmsan function declaration due to > rebase error. > v3: https://lore.kernel.org/all/cover.1755193625.git.leon@kernel.org > * Fixed typo in "cacheable" word > * Simplified kmsan patch a lot to be simple argument refactoring > v2: https://lore.kernel.org/all/cover.1755153054.git.leon@kernel.org > * Used commit messages and cover letter from Jason > * Moved setting IOMMU_MMIO flag to dma_info_to_prot function > * Micro-optimized the code > * Rebased code on v6.17-rc1 > v1: https://lore.kernel.org/all/cover.1754292567.git.leon@kernel.org > * Added new DMA_ATTR_MMIO attribute to indicate > PCI_P2PDMA_MAP_THRU_HOST_BRIDGE path. > * Rewrote dma_map_* functions to use thus new attribute > v0: https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/ > ------------------------------------------------------------------------ > > This series refactors the DMA mapping to use physical addresses > as the primary interface instead of page+offset parameters. This > change aligns the DMA API with the underlying hardware reality where > DMA operations work with physical addresses, not page structures. > > The series maintains export symbol backward compatibility by keeping > the old page-based API as wrapper functions around the new physical > address-based implementations. > > This series refactors the DMA mapping API to provide a phys_addr_t > based, and struct-page free, external API that can handle all the > mapping cases we want in modern systems: > > - struct page based cachable DRAM > - struct page MEMORY_DEVICE_PCI_P2PDMA PCI peer to peer non-cachable > MMIO > - struct page-less PCI peer to peer non-cachable MMIO > - struct page-less "resource" MMIO > > Overall this gets much closer to Matthew's long term wish for > struct-pageless IO to cachable DRAM. The remaining primary work would > be in the mm side to allow kmap_local_pfn()/phys_to_virt() to work on > phys_addr_t without a struct page. > > The general design is to remove struct page usage entirely from the > DMA API inner layers. For flows that need to have a KVA for the > physical address they can use kmap_local_pfn() or phys_to_virt(). This > isolates the struct page requirements to MM code only. Long term all > removals of struct page usage are supporting Matthew's memdesc > project which seeks to substantially transform how struct page works. > > Instead make the DMA API internals work on phys_addr_t. Internally > there are still dedicated 'page' and 'resource' flows, except they are > now distinguished by a new DMA_ATTR_MMIO instead of by callchain. Both > flows use the same phys_addr_t. > > When DMA_ATTR_MMIO is specified things work similar to the existing > 'resource' flow. kmap_local_pfn(), phys_to_virt(), phys_to_page(), > pfn_valid(), etc are never called on the phys_addr_t. This requires > rejecting any configuration that would need swiotlb. CPU cache > flushing is not required, and avoided, as ATTR_MMIO also indicates the > address have no cachable mappings. This effectively removes any > DMA API side requirement to have struct page when DMA_ATTR_MMIO is > used. > > In the !DMA_ATTR_MMIO mode things work similarly to the 'page' flow, > except on the common path of no cache flush, no swiotlb it never > touches a struct page. When cache flushing or swiotlb copying > kmap_local_pfn()/phys_to_virt() are used to get a KVA for CPU > usage. This was already the case on the unmap side, now the map side > is symmetric. > > Callers are adjusted to set DMA_ATTR_MMIO. Existing 'resource' users > must set it. The existing struct page based MEMORY_DEVICE_PCI_P2PDMA > path must also set it. This corrects some existing bugs where iommu > mappings for P2P MMIO were improperly marked IOMMU_CACHE. > > Since ATTR_MMIO is made to work with all the existing DMA map entry > points, particularly dma_iova_link(), this finally allows a way to use > the new DMA API to map PCI P2P MMIO without creating struct page. The > VFIO DMABUF series demonstrates how this works. This is intended to > replace the incorrect driver use of dma_map_resource() on PCI BAR > addresses. > > This series does the core code and modern flows. A followup series > will give the same treatment to the legacy dma_ops implementation. > > Thanks > > Leon Romanovsky (16): > dma-mapping: introduce new DMA attribute to indicate MMIO memory > iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). > dma-debug: refactor to use physical addresses for page mapping > dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys > iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys > iommu/dma: extend iommu_dma_*map_phys API to handle MMIO memory > dma-mapping: convert dma_direct_*map_page to be phys_addr_t based > kmsan: convert kmsan_handle_dma to use physical addresses > dma-mapping: handle MMIO flow in dma_map|unmap_page > xen: swiotlb: Open code map_resource callback > dma-mapping: export new dma_*map_phys() interface > mm/hmm: migrate to physical address-based DMA mapping API > mm/hmm: properly take MMIO path > block-dma: migrate to dma_map_phys instead of map_page > block-dma: properly take MMIO path > nvme-pci: unmap MMIO pages with appropriate interface > > Documentation/core-api/dma-api.rst | 4 +- > Documentation/core-api/dma-attributes.rst | 18 ++++ > arch/powerpc/kernel/dma-iommu.c | 4 +- > block/blk-mq-dma.c | 15 ++- > drivers/iommu/dma-iommu.c | 61 +++++------ > drivers/nvme/host/pci.c | 18 +++- > drivers/virtio/virtio_ring.c | 4 +- > drivers/xen/swiotlb-xen.c | 21 +++- > include/linux/blk-mq-dma.h | 6 +- > include/linux/blk_types.h | 2 + > include/linux/dma-direct.h | 2 - > include/linux/dma-map-ops.h | 8 +- > include/linux/dma-mapping.h | 33 ++++++ > include/linux/iommu-dma.h | 11 +- > include/linux/kmsan.h | 9 +- > include/trace/events/dma.h | 9 +- > kernel/dma/debug.c | 71 ++++--------- > kernel/dma/debug.h | 37 ++----- > kernel/dma/direct.c | 22 +--- > kernel/dma/direct.h | 52 ++++++---- > kernel/dma/mapping.c | 117 +++++++++++++--------- > kernel/dma/ops_helpers.c | 6 +- > mm/hmm.c | 19 ++-- > mm/kmsan/hooks.c | 5 +- > rust/kernel/dma.rs | 3 + > tools/virtio/linux/kmsan.h | 2 +- > 26 files changed, 305 insertions(+), 254 deletions(-) Marek, So what are the next steps here? This series is pre-requirement for the VFIO MMIO patches. Thanks > > -- > 2.50.1 > >
On 28.08.2025 13:57, Leon Romanovsky wrote: > On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote: >> Changelog: >> v4: >> * Fixed kbuild error with mismatch in kmsan function declaration due to >> rebase error. >> v3: https://lore.kernel.org/all/cover.1755193625.git.leon@kernel.org >> * Fixed typo in "cacheable" word >> * Simplified kmsan patch a lot to be simple argument refactoring >> v2: https://lore.kernel.org/all/cover.1755153054.git.leon@kernel.org >> * Used commit messages and cover letter from Jason >> * Moved setting IOMMU_MMIO flag to dma_info_to_prot function >> * Micro-optimized the code >> * Rebased code on v6.17-rc1 >> v1: https://lore.kernel.org/all/cover.1754292567.git.leon@kernel.org >> * Added new DMA_ATTR_MMIO attribute to indicate >> PCI_P2PDMA_MAP_THRU_HOST_BRIDGE path. >> * Rewrote dma_map_* functions to use thus new attribute >> v0: https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/ >> ------------------------------------------------------------------------ >> >> This series refactors the DMA mapping to use physical addresses >> as the primary interface instead of page+offset parameters. This >> change aligns the DMA API with the underlying hardware reality where >> DMA operations work with physical addresses, not page structures. >> >> The series maintains export symbol backward compatibility by keeping >> the old page-based API as wrapper functions around the new physical >> address-based implementations. >> >> This series refactors the DMA mapping API to provide a phys_addr_t >> based, and struct-page free, external API that can handle all the >> mapping cases we want in modern systems: >> >> - struct page based cachable DRAM >> - struct page MEMORY_DEVICE_PCI_P2PDMA PCI peer to peer non-cachable >> MMIO >> - struct page-less PCI peer to peer non-cachable MMIO >> - struct page-less "resource" MMIO >> >> Overall this gets much closer to Matthew's long term wish for >> struct-pageless IO to cachable DRAM. The remaining primary work would >> be in the mm side to allow kmap_local_pfn()/phys_to_virt() to work on >> phys_addr_t without a struct page. >> >> The general design is to remove struct page usage entirely from the >> DMA API inner layers. For flows that need to have a KVA for the >> physical address they can use kmap_local_pfn() or phys_to_virt(). This >> isolates the struct page requirements to MM code only. Long term all >> removals of struct page usage are supporting Matthew's memdesc >> project which seeks to substantially transform how struct page works. >> >> Instead make the DMA API internals work on phys_addr_t. Internally >> there are still dedicated 'page' and 'resource' flows, except they are >> now distinguished by a new DMA_ATTR_MMIO instead of by callchain. Both >> flows use the same phys_addr_t. >> >> When DMA_ATTR_MMIO is specified things work similar to the existing >> 'resource' flow. kmap_local_pfn(), phys_to_virt(), phys_to_page(), >> pfn_valid(), etc are never called on the phys_addr_t. This requires >> rejecting any configuration that would need swiotlb. CPU cache >> flushing is not required, and avoided, as ATTR_MMIO also indicates the >> address have no cachable mappings. This effectively removes any >> DMA API side requirement to have struct page when DMA_ATTR_MMIO is >> used. >> >> In the !DMA_ATTR_MMIO mode things work similarly to the 'page' flow, >> except on the common path of no cache flush, no swiotlb it never >> touches a struct page. When cache flushing or swiotlb copying >> kmap_local_pfn()/phys_to_virt() are used to get a KVA for CPU >> usage. This was already the case on the unmap side, now the map side >> is symmetric. >> >> Callers are adjusted to set DMA_ATTR_MMIO. Existing 'resource' users >> must set it. The existing struct page based MEMORY_DEVICE_PCI_P2PDMA >> path must also set it. This corrects some existing bugs where iommu >> mappings for P2P MMIO were improperly marked IOMMU_CACHE. >> >> Since ATTR_MMIO is made to work with all the existing DMA map entry >> points, particularly dma_iova_link(), this finally allows a way to use >> the new DMA API to map PCI P2P MMIO without creating struct page. The >> VFIO DMABUF series demonstrates how this works. This is intended to >> replace the incorrect driver use of dma_map_resource() on PCI BAR >> addresses. >> >> This series does the core code and modern flows. A followup series >> will give the same treatment to the legacy dma_ops implementation. >> >> Thanks >> >> Leon Romanovsky (16): >> dma-mapping: introduce new DMA attribute to indicate MMIO memory >> iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). >> dma-debug: refactor to use physical addresses for page mapping >> dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys >> iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys >> iommu/dma: extend iommu_dma_*map_phys API to handle MMIO memory >> dma-mapping: convert dma_direct_*map_page to be phys_addr_t based >> kmsan: convert kmsan_handle_dma to use physical addresses >> dma-mapping: handle MMIO flow in dma_map|unmap_page >> xen: swiotlb: Open code map_resource callback >> dma-mapping: export new dma_*map_phys() interface >> mm/hmm: migrate to physical address-based DMA mapping API >> mm/hmm: properly take MMIO path >> block-dma: migrate to dma_map_phys instead of map_page >> block-dma: properly take MMIO path >> nvme-pci: unmap MMIO pages with appropriate interface >> >> Documentation/core-api/dma-api.rst | 4 +- >> Documentation/core-api/dma-attributes.rst | 18 ++++ >> arch/powerpc/kernel/dma-iommu.c | 4 +- >> block/blk-mq-dma.c | 15 ++- >> drivers/iommu/dma-iommu.c | 61 +++++------ >> drivers/nvme/host/pci.c | 18 +++- >> drivers/virtio/virtio_ring.c | 4 +- >> drivers/xen/swiotlb-xen.c | 21 +++- >> include/linux/blk-mq-dma.h | 6 +- >> include/linux/blk_types.h | 2 + >> include/linux/dma-direct.h | 2 - >> include/linux/dma-map-ops.h | 8 +- >> include/linux/dma-mapping.h | 33 ++++++ >> include/linux/iommu-dma.h | 11 +- >> include/linux/kmsan.h | 9 +- >> include/trace/events/dma.h | 9 +- >> kernel/dma/debug.c | 71 ++++--------- >> kernel/dma/debug.h | 37 ++----- >> kernel/dma/direct.c | 22 +--- >> kernel/dma/direct.h | 52 ++++++---- >> kernel/dma/mapping.c | 117 +++++++++++++--------- >> kernel/dma/ops_helpers.c | 6 +- >> mm/hmm.c | 19 ++-- >> mm/kmsan/hooks.c | 5 +- >> rust/kernel/dma.rs | 3 + >> tools/virtio/linux/kmsan.h | 2 +- >> 26 files changed, 305 insertions(+), 254 deletions(-) > Marek, > > So what are the next steps here? This series is pre-requirement for the > VFIO MMIO patches. I waited a bit with a hope to get a comment from Robin. It looks that there is no other alternative for the phys addr in the struct page removal process. I would like to give those patches a try in linux-next, but in meantime I tested it on my test farm and found a regression in dma_map_resource() handling. Namely the dma_map_resource() is no longer possible with size not aligned to kmalloc()'ed buffer, as dma_direct_map_phys() calls dma_kmalloc_needs_bounce(), which in turn calls dma_kmalloc_size_aligned(). It looks that the check for !(attrs & DMA_ATTR_MMIO) should be moved one level up in dma_direct_map_phys(). Here is the log: ------------[ cut here ]------------ dma-pl330 fe550000.dma-controller: DMA addr 0x00000000fe410024+4 overflow (mask ffffffff, bus limit 0). WARNING: kernel/dma/direct.h:116 at dma_map_phys+0x3a4/0x3ec, CPU#1: speaker-test/405 Modules linked in: ... CPU: 1 UID: 0 PID: 405 Comm: speaker-test Not tainted 6.17.0-rc4-next-20250901+ #10958 PREEMPT Hardware name: Hardkernel ODROID-M1 (DT) pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : dma_map_phys+0x3a4/0x3ec lr : dma_map_phys+0x3a4/0x3ec ... Call trace: dma_map_phys+0x3a4/0x3ec (P) dma_map_resource+0x14/0x20 pl330_prep_slave_fifo+0x78/0xd0 pl330_prep_dma_cyclic+0x70/0x2b0 snd_dmaengine_pcm_trigger+0xec/0x8bc [snd_pcm_dmaengine] dmaengine_pcm_trigger+0x18/0x24 [snd_soc_core] snd_soc_pcm_component_trigger+0x164/0x208 [snd_soc_core] soc_pcm_trigger+0xe4/0x1ec [snd_soc_core] snd_pcm_do_start+0x44/0x70 [snd_pcm] snd_pcm_action_single+0x48/0xa4 [snd_pcm] snd_pcm_action+0x7c/0x98 [snd_pcm] snd_pcm_action_lock_irq+0x48/0xb4 [snd_pcm] snd_pcm_common_ioctl+0xf00/0x1f1c [snd_pcm] snd_pcm_ioctl+0x30/0x48 [snd_pcm] __arm64_sys_ioctl+0xac/0x104 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc+0x20/0x2c el0_svc+0x4c/0x160 el0t_64_sync_handler+0xa0/0xe4 el0t_64_sync+0x198/0x19c irq event stamp: 6596 hardirqs last enabled at (6595): [<ffff800081344624>] _raw_spin_unlock_irqrestore+0x74/0x78 hardirqs last disabled at (6596): [<ffff8000813439b0>] _raw_spin_lock_irq+0x78/0x7c softirqs last enabled at (6076): [<ffff8000800c2294>] handle_softirqs+0x4c4/0x4dc softirqs last disabled at (6071): [<ffff800080010690>] __do_softirq+0x14/0x20 ---[ end trace 0000000000000000 ]--- rockchip-i2s-tdm fe410000.i2s: ASoC error (-12): at soc_component_trigger() on fe410000.i2s Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland
On Mon, Sep 01, 2025 at 11:47:59PM +0200, Marek Szyprowski wrote: > I would like to give those patches a try in linux-next, but in meantime > I tested it on my test farm and found a regression in dma_map_resource() > handling. Namely the dma_map_resource() is no longer possible with size > not aligned to kmalloc()'ed buffer, as dma_direct_map_phys() calls > dma_kmalloc_needs_bounce(), Hmm, it's this bit: capable = dma_capable(dev, dma_addr, size, !(attrs & DMA_ATTR_MMIO)); if (unlikely(!capable) || dma_kmalloc_needs_bounce(dev, size, dir)) { if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_MMIO)) return swiotlb_map(dev, phys, size, dir, attrs); goto err_overflow; } We shouldn't be checking dma_kmalloc_needs_bounce() on mmio as there is no cache flushing so the "dma safe alignment" for non-coherent DMA does not apply. Like you say looks good to me, and more of the surrouding code can be pulled in too, no sense in repeating the boolean logic: if (attrs & DMA_ATTR_MMIO) { dma_addr = phys; if (unlikely(!dma_capable(dev, dma_addr, size, false))) goto err_overflow; } else { dma_addr = phys_to_dma(dev, phys); if (unlikely(!dma_capable(dev, dma_addr, size, true)) || dma_kmalloc_needs_bounce(dev, size, dir)) { if (is_swiotlb_active(dev)) return swiotlb_map(dev, phys, size, dir, attrs); goto err_overflow; } if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) arch_sync_dma_for_device(phys, size, dir); } Jason
On Mon, Sep 01, 2025 at 07:23:02PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 01, 2025 at 11:47:59PM +0200, Marek Szyprowski wrote: > > I would like to give those patches a try in linux-next, but in meantime > > I tested it on my test farm and found a regression in dma_map_resource() > > handling. Namely the dma_map_resource() is no longer possible with size > > not aligned to kmalloc()'ed buffer, as dma_direct_map_phys() calls > > dma_kmalloc_needs_bounce(), > > Hmm, it's this bit: > > capable = dma_capable(dev, dma_addr, size, !(attrs & DMA_ATTR_MMIO)); > if (unlikely(!capable) || dma_kmalloc_needs_bounce(dev, size, dir)) { > if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_MMIO)) > return swiotlb_map(dev, phys, size, dir, attrs); > > goto err_overflow; > } > > We shouldn't be checking dma_kmalloc_needs_bounce() on mmio as there > is no cache flushing so the "dma safe alignment" for non-coherent DMA > does not apply. > > Like you say looks good to me, and more of the surrouding code can be > pulled in too, no sense in repeating the boolean logic: > > if (attrs & DMA_ATTR_MMIO) { > dma_addr = phys; > if (unlikely(!dma_capable(dev, dma_addr, size, false))) > goto err_overflow; > } else { > dma_addr = phys_to_dma(dev, phys); > if (unlikely(!dma_capable(dev, dma_addr, size, true)) || I tried to reuse same code as much as possible :( > dma_kmalloc_needs_bounce(dev, size, dir)) { > if (is_swiotlb_active(dev)) > return swiotlb_map(dev, phys, size, dir, attrs); > > goto err_overflow; > } > if (!dev_is_dma_coherent(dev) && > !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > arch_sync_dma_for_device(phys, size, dir); > } Like Jason wrote, but in diff format: diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h index 92dbadcd3b2f..3f4792910604 100644 --- a/kernel/dma/direct.h +++ b/kernel/dma/direct.h @@ -85,7 +85,6 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev, unsigned long attrs) { dma_addr_t dma_addr; - bool capable; if (is_swiotlb_force_bounce(dev)) { if (attrs & DMA_ATTR_MMIO) @@ -94,17 +93,19 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev, return swiotlb_map(dev, phys, size, dir, attrs); } - if (attrs & DMA_ATTR_MMIO) + if (attrs & DMA_ATTR_MMIO) { dma_addr = phys; - else + if (unlikely(dma_capable(dev, dma_addr, size, false))) + goto err_overflow; + } else { dma_addr = phys_to_dma(dev, phys); + if (unlikely(!dma_capable(dev, dma_addr, size, true)) || + dma_kmalloc_needs_bounce(dev, size, dir)) { + if (is_swiotlb_active(dev)) + return swiotlb_map(dev, phys, size, dir, attrs); - capable = dma_capable(dev, dma_addr, size, !(attrs & DMA_ATTR_MMIO)); - if (unlikely(!capable) || dma_kmalloc_needs_bounce(dev, size, dir)) { - if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_MMIO)) - return swiotlb_map(dev, phys, size, dir, attrs); - - goto err_overflow; + goto err_overflow; + } } if (!dev_is_dma_coherent(dev) && I created new tag with fixed code. https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/tag/?h=dma-phys-Sep-2 Thanks > > Jason
© 2016 - 2025 Red Hat, Inc.