[v4] dma-mapping: migrate to physical address-based API

[PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Leon Romanovsky 1 month, 2 weeks ago

Changelog:
v4:
 * Fixed kbuild error with mismatch in kmsan function declaration due to
   rebase error.
v3: https://lore.kernel.org/all/cover.1755193625.git.leon@kernel.org
 * Fixed typo in "cacheable" word
 * Simplified kmsan patch a lot to be simple argument refactoring
v2: https://lore.kernel.org/all/cover.1755153054.git.leon@kernel.org
 * Used commit messages and cover letter from Jason
 * Moved setting IOMMU_MMIO flag to dma_info_to_prot function
 * Micro-optimized the code
 * Rebased code on v6.17-rc1
v1: https://lore.kernel.org/all/cover.1754292567.git.leon@kernel.org
 * Added new DMA_ATTR_MMIO attribute to indicate
   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE path.
 * Rewrote dma_map_* functions to use thus new attribute
v0: https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/
------------------------------------------------------------------------

This series refactors the DMA mapping to use physical addresses
as the primary interface instead of page+offset parameters. This
change aligns the DMA API with the underlying hardware reality where
DMA operations work with physical addresses, not page structures.

The series maintains export symbol backward compatibility by keeping
the old page-based API as wrapper functions around the new physical
address-based implementations.

This series refactors the DMA mapping API to provide a phys_addr_t
based, and struct-page free, external API that can handle all the
mapping cases we want in modern systems:

 - struct page based cachable DRAM
 - struct page MEMORY_DEVICE_PCI_P2PDMA PCI peer to peer non-cachable
   MMIO
 - struct page-less PCI peer to peer non-cachable MMIO
 - struct page-less "resource" MMIO

Overall this gets much closer to Matthew's long term wish for
struct-pageless IO to cachable DRAM. The remaining primary work would
be in the mm side to allow kmap_local_pfn()/phys_to_virt() to work on
phys_addr_t without a struct page.

The general design is to remove struct page usage entirely from the
DMA API inner layers. For flows that need to have a KVA for the
physical address they can use kmap_local_pfn() or phys_to_virt(). This
isolates the struct page requirements to MM code only. Long term all
removals of struct page usage are supporting Matthew's memdesc
project which seeks to substantially transform how struct page works.

Instead make the DMA API internals work on phys_addr_t. Internally
there are still dedicated 'page' and 'resource' flows, except they are
now distinguished by a new DMA_ATTR_MMIO instead of by callchain. Both
flows use the same phys_addr_t.

When DMA_ATTR_MMIO is specified things work similar to the existing
'resource' flow. kmap_local_pfn(), phys_to_virt(), phys_to_page(),
pfn_valid(), etc are never called on the phys_addr_t. This requires
rejecting any configuration that would need swiotlb. CPU cache
flushing is not required, and avoided, as ATTR_MMIO also indicates the
address have no cachable mappings. This effectively removes any
DMA API side requirement to have struct page when DMA_ATTR_MMIO is
used.

In the !DMA_ATTR_MMIO mode things work similarly to the 'page' flow,
except on the common path of no cache flush, no swiotlb it never
touches a struct page. When cache flushing or swiotlb copying
kmap_local_pfn()/phys_to_virt() are used to get a KVA for CPU
usage. This was already the case on the unmap side, now the map side
is symmetric.

Callers are adjusted to set DMA_ATTR_MMIO. Existing 'resource' users
must set it. The existing struct page based MEMORY_DEVICE_PCI_P2PDMA
path must also set it. This corrects some existing bugs where iommu
mappings for P2P MMIO were improperly marked IOMMU_CACHE.

Since ATTR_MMIO is made to work with all the existing DMA map entry
points, particularly dma_iova_link(), this finally allows a way to use
the new DMA API to map PCI P2P MMIO without creating struct page. The
VFIO DMABUF series demonstrates how this works. This is intended to
replace the incorrect driver use of dma_map_resource() on PCI BAR
addresses.

This series does the core code and modern flows. A followup series
will give the same treatment to the legacy dma_ops implementation.

Thanks

Leon Romanovsky (16):
  dma-mapping: introduce new DMA attribute to indicate MMIO memory
  iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link().
  dma-debug: refactor to use physical addresses for page mapping
  dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys
  iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys
  iommu/dma: extend iommu_dma_*map_phys API to handle MMIO memory
  dma-mapping: convert dma_direct_*map_page to be phys_addr_t based
  kmsan: convert kmsan_handle_dma to use physical addresses
  dma-mapping: handle MMIO flow in dma_map|unmap_page
  xen: swiotlb: Open code map_resource callback
  dma-mapping: export new dma_*map_phys() interface
  mm/hmm: migrate to physical address-based DMA mapping API
  mm/hmm: properly take MMIO path
  block-dma: migrate to dma_map_phys instead of map_page
  block-dma: properly take MMIO path
  nvme-pci: unmap MMIO pages with appropriate interface

 Documentation/core-api/dma-api.rst        |   4 +-
 Documentation/core-api/dma-attributes.rst |  18 ++++
 arch/powerpc/kernel/dma-iommu.c           |   4 +-
 block/blk-mq-dma.c                        |  15 ++-
 drivers/iommu/dma-iommu.c                 |  61 +++++------
 drivers/nvme/host/pci.c                   |  18 +++-
 drivers/virtio/virtio_ring.c              |   4 +-
 drivers/xen/swiotlb-xen.c                 |  21 +++-
 include/linux/blk-mq-dma.h                |   6 +-
 include/linux/blk_types.h                 |   2 +
 include/linux/dma-direct.h                |   2 -
 include/linux/dma-map-ops.h               |   8 +-
 include/linux/dma-mapping.h               |  33 ++++++
 include/linux/iommu-dma.h                 |  11 +-
 include/linux/kmsan.h                     |   9 +-
 include/trace/events/dma.h                |   9 +-
 kernel/dma/debug.c                        |  71 ++++---------
 kernel/dma/debug.h                        |  37 ++-----
 kernel/dma/direct.c                       |  22 +---
 kernel/dma/direct.h                       |  52 ++++++----
 kernel/dma/mapping.c                      | 117 +++++++++++++---------
 kernel/dma/ops_helpers.c                  |   6 +-
 mm/hmm.c                                  |  19 ++--
 mm/kmsan/hooks.c                          |   5 +-
 rust/kernel/dma.rs                        |   3 +
 tools/virtio/linux/kmsan.h                |   2 +-
 26 files changed, 305 insertions(+), 254 deletions(-)

-- 
2.50.1

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Jason Gunthorpe 1 month ago

On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote:

> This series does the core code and modern flows. A followup series
> will give the same treatment to the legacy dma_ops implementation.

I took a quick check over this to see that it is sane.  I think using
phys is an improvement for most of the dma_ops implemenations.

  arch/sparc/kernel/pci_sun4v.c
  arch/sparc/kernel/iommu.c
    Uses __pa to get phys from the page, never touches page

  arch/alpha/kernel/pci_iommu.c
  arch/sparc/mm/io-unit.c
  drivers/parisc/ccio-dma.c
  drivers/parisc/sba_iommu.c
    Does page_addres() and later does __pa on it. Doesn't touch struct page

  arch/x86/kernel/amd_gart_64.c
  drivers/xen/swiotlb-xen.c
  arch/mips/jazz/jazzdma.c
    Immediately does page_to_phys(), never touches struct page

  drivers/vdpa/vdpa_user/vduse_dev.c
    Does page_to_phys() to call iommu_map()

  drivers/xen/grant-dma-ops.c
    Does page_to_pfn() and nothing else

  arch/powerpc/platforms/ps3/system-bus.c
   This is a maze but I think it wants only phys and the virt is only
   used for debug prints.

The above all never touch a KVA and just want a phys_addr_t.

The below are touching the KVA somehow:

  arch/sparc/mm/iommu.c
  arch/arm/mm/dma-mapping.c
    Uses page_address to cache flush, would be happy with phys_to_virt()
    and a PhysHighMem()

  arch/powerpc/kernel/dma-iommu.c
  arch/powerpc/platforms/pseries/vio.c
   Uses iommu_map_page() which wants phys_to_virt(), doesn't touch
   struct page

  arch/powerpc/platforms/pseries/ibmebus.c
    Returns phys_to_virt() as dma_addr_t.

The two PPC ones are weird, I didn't figure out how that was working..

It would be easy to make map_phys patches for about half of these, in
the first grouping. Doing so would also grant those arches
map_resource capability.

Overall I didn't think there was any reduction in maintainability in
these places. Most are improvements eliminating code, and some are
just switching to phys_to_virt() from page_address(), which we could
further guard with DMA_ATTR_MMIO and a check for highmem.

Jason

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Marek Szyprowski 4 weeks ago

On 29.08.2025 15:16, Jason Gunthorpe wrote:
> On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote:
>
>> This series does the core code and modern flows. A followup series
>> will give the same treatment to the legacy dma_ops implementation.
> I took a quick check over this to see that it is sane.  I think using
> phys is an improvement for most of the dma_ops implemenations.
>
>    arch/sparc/kernel/pci_sun4v.c
>    arch/sparc/kernel/iommu.c
>      Uses __pa to get phys from the page, never touches page
>
>    arch/alpha/kernel/pci_iommu.c
>    arch/sparc/mm/io-unit.c
>    drivers/parisc/ccio-dma.c
>    drivers/parisc/sba_iommu.c
>      Does page_addres() and later does __pa on it. Doesn't touch struct page
>
>    arch/x86/kernel/amd_gart_64.c
>    drivers/xen/swiotlb-xen.c
>    arch/mips/jazz/jazzdma.c
>      Immediately does page_to_phys(), never touches struct page
>
>    drivers/vdpa/vdpa_user/vduse_dev.c
>      Does page_to_phys() to call iommu_map()
>
>    drivers/xen/grant-dma-ops.c
>      Does page_to_pfn() and nothing else
>
>    arch/powerpc/platforms/ps3/system-bus.c
>     This is a maze but I think it wants only phys and the virt is only
>     used for debug prints.
>
> The above all never touch a KVA and just want a phys_addr_t.
>
> The below are touching the KVA somehow:
>
>    arch/sparc/mm/iommu.c
>    arch/arm/mm/dma-mapping.c
>      Uses page_address to cache flush, would be happy with phys_to_virt()
>      and a PhysHighMem()
>
>    arch/powerpc/kernel/dma-iommu.c
>    arch/powerpc/platforms/pseries/vio.c
>     Uses iommu_map_page() which wants phys_to_virt(), doesn't touch
>     struct page
>
>    arch/powerpc/platforms/pseries/ibmebus.c
>      Returns phys_to_virt() as dma_addr_t.
>
> The two PPC ones are weird, I didn't figure out how that was working..
>
> It would be easy to make map_phys patches for about half of these, in
> the first grouping. Doing so would also grant those arches
> map_resource capability.
>
> Overall I didn't think there was any reduction in maintainability in
> these places. Most are improvements eliminating code, and some are
> just switching to phys_to_virt() from page_address(), which we could
> further guard with DMA_ATTR_MMIO and a check for highmem.

Thanks for this summary.

However I would still like to get an answer for the simple question - 
why all this work cannot be replaced by a simple use of dma_map_resource()?

I've checked the most advertised use case in 
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio 
and I still don't see the reason why it cannot be based 
on dma_map_resource() API? I'm aware of the little asymmetry of the 
client calls is such case, indeed it is not preety, but this should work 
even now:

phys = phys_vec[i].paddr;

if (is_mmio)
     dma_map_resource(phys, len, ...);
else
     dma_map_page(phys_to_page(phys), offset_in_page(phys), ...);

What did I miss?

I'm not against this rework, but I would really like to know the 
rationale. I know that the 2-step dma-mapping API also use phys 
addresses and this is the same direction.

This patchset focuses only on the dma_map_page -> dma_map_phys rework. 
There are also other interfaces, like dma_alloc_pages() and so far 
nothing has been proposed for them so far.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Jason Gunthorpe 4 weeks ago

On Fri, Sep 05, 2025 at 06:20:51PM +0200, Marek Szyprowski wrote:

> I've checked the most advertised use case in 
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio
> and I still don't see the reason why it cannot be based 
> on dma_map_resource() API? I'm aware of the little asymmetry of the 
> client calls is such case, indeed it is not preety, but this should work 
> even now:
> 
> phys = phys_vec[i].paddr;
> 
> if (is_mmio)
>      dma_map_resource(phys, len, ...);
> else
>      dma_map_page(phys_to_page(phys), offset_in_page(phys), ...);
> 
> What did I miss?

I have a somewhat different answer than Leon..

The link path would need a resource variation too:

+			ret = dma_iova_link(attachment->dev, state,
+					    phys_vec[i].paddr, 0,
+					    phys_vec[i].len, dir, attrs);
+			if (ret)
+				goto err_unmap_dma;
+
+			mapped_len += phys_vec[i].len;

It is an existing bug that we don't properly handle all details of
MMIO for link.

Since this is already a phys_addr_t I wouldn't strongly argue that
should be done by adding ATTR_MMIO to dma_iova_link().

If you did that, then you'd still want a dma_(un)map_phys() helper
that handled ATTR_MMIO too. It could be an inline "if () resource else
page" wrapper like you say.

So API wise I think we have the right design here.

I think the question you are asking is how much changing to the
internals of the DMA API do you want to do to make ATTR_MMIO.

It is not zero, but there is some minimum that is less than this.

So reason #1 much of this ATTR_MMIO is needed anyhow. Being consistent
and unifying the dma_map_resource path with ATTR_MMIO should improve
the long term maintainability of the code. We already uncovered paths
where map_resource is not behaving consistently with map_page and it
is unclear if these are bugs or deliberate.

Reason #2 we do actually want to get rid of struct page usage to help
advance Matthew's work. This means we want to build a clean struct
page less path for IO. Meaning we can do phys to virt, or kmap phys,
but none of: phys to page, page to virt, page to phys. Stopping at a
phys based public API and then leaving all the phys to page/etc
conversions hidden inside is not enough.

This is why I was looking at the dma_ops path, to see just how much
page usage there is, and I found very little. So this dream is
achievable and with this series we are there for ARM64 and x86
environments.

> This patchset focuses only on the dma_map_page -> dma_map_phys rework. 
> There are also other interfaces, like dma_alloc_pages() and so far 
> nothing has been proposed for them so far.

That's because they already have non-page alternatives.

Allmost all places call dma_alloc_noncoherent():

static inline void *dma_alloc_noncoherent(struct device *dev, size_t size,
		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
{
	struct page *page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
	return page ? page_address(page) : NULL;

Which is KVA based.

There is only one user I found of alloc_pages:

drivers/firewire/ohci.c:                ctx->pages[i] = dma_alloc_pages(dev, PAGE_SIZE, &dma_addr,

And it deliberately uses page->private:

		set_page_private(ctx->pages[i], dma_addr);

So it is correct to use the struct page API.

Some usages of dma_alloc_noncontiguous() can be implemented using the
dma_iova_link() flow like drivers/vfio/pci/mlx5/cmd.c shows by using
alloc_pages_bulk() for the allocator. We don't yet have a 'dma alloc
link' operation though, and there are only 4 users of
dma_alloc_noncontiguous()..

Jason

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Takashi Sakamoto 3 weeks, 5 days ago

Hi,

I'm a present maintainer of Linux FireWire subsystem, and recent years
have been working to modernize the subsystem.

On Fri, Sep 05, 2025 at 14:43:24PM -0300, Jason Gunthorpe wrote:
> There is only one user I found of alloc_pages:
>
> drivers/firewire/ohci.c:                ctx->pages[i] = dma_alloc_pages(dev, PAGE_SIZE, &dma_addr,
>
> And it deliberately uses page->private:
>
>		set_page_private(ctx->pages[i], dma_addr);
>
> So it is correct to use the struct page API.

I've already realized it, and it is in my TODO list to use modern
alternative APIs to replace it (but not yet). If you know some
candidates for this purpose, it is really helpful to accomplish it.

Regards

Takashi Sakamoto

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Jason Gunthorpe 3 weeks, 4 days ago

On Sun, Sep 07, 2025 at 11:25:09PM +0900, Takashi Sakamoto wrote:
> Hi,
> 
> I'm a present maintainer of Linux FireWire subsystem, and recent years
> have been working to modernize the subsystem.
> 
> On Fri, Sep 05, 2025 at 14:43:24PM -0300, Jason Gunthorpe wrote:
> > There is only one user I found of alloc_pages:
> >
> > drivers/firewire/ohci.c:                ctx->pages[i] = dma_alloc_pages(dev, PAGE_SIZE, &dma_addr,
> >
> > And it deliberately uses page->private:
> >
> >		set_page_private(ctx->pages[i], dma_addr);
> >
> > So it is correct to use the struct page API.
> 
> I've already realized it, and it is in my TODO list to use modern
> alternative APIs to replace it (but not yet). If you know some
> candidates for this purpose, it is really helpful to accomplish it.

I think for now it is probably OKish, but in the medium/longer term
this probably wants to have its own memdesc like other cases.

Ie instead of using page->private you'd have a

struct ohci_desc {
	unsigned long __page_flags;
	dma_addr_t dma_addr;
[..]
};

And instead of using page->private you'd use ohci_desc::dma_addr.

This would require changing dma_alloc_pages() to be able to allocate
the frozen memdescs..

Which we are not quite there yet, but maybe come back to this in 2026?

Jason

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Leon Romanovsky 4 weeks ago

On Fri, Sep 05, 2025 at 06:20:51PM +0200, Marek Szyprowski wrote:
> On 29.08.2025 15:16, Jason Gunthorpe wrote:
> > On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote:
> >
> >> This series does the core code and modern flows. A followup series
> >> will give the same treatment to the legacy dma_ops implementation.
> > I took a quick check over this to see that it is sane.  I think using
> > phys is an improvement for most of the dma_ops implemenations.
> >
> >    arch/sparc/kernel/pci_sun4v.c
> >    arch/sparc/kernel/iommu.c
> >      Uses __pa to get phys from the page, never touches page
> >
> >    arch/alpha/kernel/pci_iommu.c
> >    arch/sparc/mm/io-unit.c
> >    drivers/parisc/ccio-dma.c
> >    drivers/parisc/sba_iommu.c
> >      Does page_addres() and later does __pa on it. Doesn't touch struct page
> >
> >    arch/x86/kernel/amd_gart_64.c
> >    drivers/xen/swiotlb-xen.c
> >    arch/mips/jazz/jazzdma.c
> >      Immediately does page_to_phys(), never touches struct page
> >
> >    drivers/vdpa/vdpa_user/vduse_dev.c
> >      Does page_to_phys() to call iommu_map()
> >
> >    drivers/xen/grant-dma-ops.c
> >      Does page_to_pfn() and nothing else
> >
> >    arch/powerpc/platforms/ps3/system-bus.c
> >     This is a maze but I think it wants only phys and the virt is only
> >     used for debug prints.
> >
> > The above all never touch a KVA and just want a phys_addr_t.
> >
> > The below are touching the KVA somehow:
> >
> >    arch/sparc/mm/iommu.c
> >    arch/arm/mm/dma-mapping.c
> >      Uses page_address to cache flush, would be happy with phys_to_virt()
> >      and a PhysHighMem()
> >
> >    arch/powerpc/kernel/dma-iommu.c
> >    arch/powerpc/platforms/pseries/vio.c
> >     Uses iommu_map_page() which wants phys_to_virt(), doesn't touch
> >     struct page
> >
> >    arch/powerpc/platforms/pseries/ibmebus.c
> >      Returns phys_to_virt() as dma_addr_t.
> >
> > The two PPC ones are weird, I didn't figure out how that was working..
> >
> > It would be easy to make map_phys patches for about half of these, in
> > the first grouping. Doing so would also grant those arches
> > map_resource capability.
> >
> > Overall I didn't think there was any reduction in maintainability in
> > these places. Most are improvements eliminating code, and some are
> > just switching to phys_to_virt() from page_address(), which we could
> > further guard with DMA_ATTR_MMIO and a check for highmem.
> 
> Thanks for this summary.
> 
> However I would still like to get an answer for the simple question - 
> why all this work cannot be replaced by a simple use of dma_map_resource()?
> 
> I've checked the most advertised use case in 
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio 
> and I still don't see the reason why it cannot be based 
> on dma_map_resource() API? I'm aware of the little asymmetry of the 
> client calls is such case, indeed it is not preety, but this should work 
> even now:
> 
> phys = phys_vec[i].paddr;
> 
> if (is_mmio)
>      dma_map_resource(phys, len, ...);
> else
>      dma_map_page(phys_to_page(phys), offset_in_page(phys), ...);
> 
> What did I miss?

"Even now" can't work mainly because both of these interfaces don't
support p2p case (PCI_P2PDMA_MAP_BUS_ADDR).

It is unclear how to extend them without introducing new functions
and/or changing whole kernel. In PCI_P2PDMA_MAP_BUS_ADDR case, there
is no struct page, so dma_map_page() is unlikely to be possible to
extend and dma_map_resource() has no direct way to access PCI
bus_offset. In theory, it is doable, but will be layer violation as DMA
will need to rely on PCI layer for address calculations.

If we don't extend, in general case (for HMM, RDMA and NVMe) end result will be something like that:
if (...PCI_P2PDMA_MAP_BUS_ADDR)
  pci_p2pdma_bus_addr_map
else if (mmio)
  dma_map_resource
else              <- this case is not applicable to VFIO-DMABUF
  dma_map_page

In case, we will somehow extend these functions to support it, we will
lose very important optimization where we are performing one IOTLB
sync for whole DMABUF region == dma_iova_state, and I was told that
it is very large region.

  103         for (i = 0; i < priv->nr_ranges; i++) {
  <...>
  107                 } else if (dma_use_iova(state)) {
  108                         ret = dma_iova_link(attachment->dev, state,
  109                                             phys_vec[i].paddr, 0,
  110                                             phys_vec[i].len, dir, attrs);
  111                         if (ret)
  112                                 goto err_unmap_dma;
  113
  114                         mapped_len += phys_vec[i].len;
  <...>
  132         }
  133
  134         if (state && dma_use_iova(state)) {
  135                 WARN_ON_ONCE(mapped_len != priv->size);
  136                 ret = dma_iova_sync(attachment->dev, state, 0, mapped_len);

> 
> I'm not against this rework, but I would really like to know the 
> rationale. I know that the 2-step dma-mapping API also use phys 
> addresses and this is the same direction.

This series is continuation of 2-step dma-mapping API. The plan to
provide dma_map_phys() was from the beginning.

Thanks

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Leon Romanovsky 1 month ago

On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote:
> Changelog:
> v4:
>  * Fixed kbuild error with mismatch in kmsan function declaration due to
>    rebase error.
> v3: https://lore.kernel.org/all/cover.1755193625.git.leon@kernel.org
>  * Fixed typo in "cacheable" word
>  * Simplified kmsan patch a lot to be simple argument refactoring
> v2: https://lore.kernel.org/all/cover.1755153054.git.leon@kernel.org
>  * Used commit messages and cover letter from Jason
>  * Moved setting IOMMU_MMIO flag to dma_info_to_prot function
>  * Micro-optimized the code
>  * Rebased code on v6.17-rc1
> v1: https://lore.kernel.org/all/cover.1754292567.git.leon@kernel.org
>  * Added new DMA_ATTR_MMIO attribute to indicate
>    PCI_P2PDMA_MAP_THRU_HOST_BRIDGE path.
>  * Rewrote dma_map_* functions to use thus new attribute
> v0: https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/
> ------------------------------------------------------------------------
> 
> This series refactors the DMA mapping to use physical addresses
> as the primary interface instead of page+offset parameters. This
> change aligns the DMA API with the underlying hardware reality where
> DMA operations work with physical addresses, not page structures.
> 
> The series maintains export symbol backward compatibility by keeping
> the old page-based API as wrapper functions around the new physical
> address-based implementations.
> 
> This series refactors the DMA mapping API to provide a phys_addr_t
> based, and struct-page free, external API that can handle all the
> mapping cases we want in modern systems:
> 
>  - struct page based cachable DRAM
>  - struct page MEMORY_DEVICE_PCI_P2PDMA PCI peer to peer non-cachable
>    MMIO
>  - struct page-less PCI peer to peer non-cachable MMIO
>  - struct page-less "resource" MMIO
> 
> Overall this gets much closer to Matthew's long term wish for
> struct-pageless IO to cachable DRAM. The remaining primary work would
> be in the mm side to allow kmap_local_pfn()/phys_to_virt() to work on
> phys_addr_t without a struct page.
> 
> The general design is to remove struct page usage entirely from the
> DMA API inner layers. For flows that need to have a KVA for the
> physical address they can use kmap_local_pfn() or phys_to_virt(). This
> isolates the struct page requirements to MM code only. Long term all
> removals of struct page usage are supporting Matthew's memdesc
> project which seeks to substantially transform how struct page works.
> 
> Instead make the DMA API internals work on phys_addr_t. Internally
> there are still dedicated 'page' and 'resource' flows, except they are
> now distinguished by a new DMA_ATTR_MMIO instead of by callchain. Both
> flows use the same phys_addr_t.
> 
> When DMA_ATTR_MMIO is specified things work similar to the existing
> 'resource' flow. kmap_local_pfn(), phys_to_virt(), phys_to_page(),
> pfn_valid(), etc are never called on the phys_addr_t. This requires
> rejecting any configuration that would need swiotlb. CPU cache
> flushing is not required, and avoided, as ATTR_MMIO also indicates the
> address have no cachable mappings. This effectively removes any
> DMA API side requirement to have struct page when DMA_ATTR_MMIO is
> used.
> 
> In the !DMA_ATTR_MMIO mode things work similarly to the 'page' flow,
> except on the common path of no cache flush, no swiotlb it never
> touches a struct page. When cache flushing or swiotlb copying
> kmap_local_pfn()/phys_to_virt() are used to get a KVA for CPU
> usage. This was already the case on the unmap side, now the map side
> is symmetric.
> 
> Callers are adjusted to set DMA_ATTR_MMIO. Existing 'resource' users
> must set it. The existing struct page based MEMORY_DEVICE_PCI_P2PDMA
> path must also set it. This corrects some existing bugs where iommu
> mappings for P2P MMIO were improperly marked IOMMU_CACHE.
> 
> Since ATTR_MMIO is made to work with all the existing DMA map entry
> points, particularly dma_iova_link(), this finally allows a way to use
> the new DMA API to map PCI P2P MMIO without creating struct page. The
> VFIO DMABUF series demonstrates how this works. This is intended to
> replace the incorrect driver use of dma_map_resource() on PCI BAR
> addresses.
> 
> This series does the core code and modern flows. A followup series
> will give the same treatment to the legacy dma_ops implementation.
> 
> Thanks
> 
> Leon Romanovsky (16):
>   dma-mapping: introduce new DMA attribute to indicate MMIO memory
>   iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link().
>   dma-debug: refactor to use physical addresses for page mapping
>   dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys
>   iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys
>   iommu/dma: extend iommu_dma_*map_phys API to handle MMIO memory
>   dma-mapping: convert dma_direct_*map_page to be phys_addr_t based
>   kmsan: convert kmsan_handle_dma to use physical addresses
>   dma-mapping: handle MMIO flow in dma_map|unmap_page
>   xen: swiotlb: Open code map_resource callback
>   dma-mapping: export new dma_*map_phys() interface
>   mm/hmm: migrate to physical address-based DMA mapping API
>   mm/hmm: properly take MMIO path
>   block-dma: migrate to dma_map_phys instead of map_page
>   block-dma: properly take MMIO path
>   nvme-pci: unmap MMIO pages with appropriate interface
> 
>  Documentation/core-api/dma-api.rst        |   4 +-
>  Documentation/core-api/dma-attributes.rst |  18 ++++
>  arch/powerpc/kernel/dma-iommu.c           |   4 +-
>  block/blk-mq-dma.c                        |  15 ++-
>  drivers/iommu/dma-iommu.c                 |  61 +++++------
>  drivers/nvme/host/pci.c                   |  18 +++-
>  drivers/virtio/virtio_ring.c              |   4 +-
>  drivers/xen/swiotlb-xen.c                 |  21 +++-
>  include/linux/blk-mq-dma.h                |   6 +-
>  include/linux/blk_types.h                 |   2 +
>  include/linux/dma-direct.h                |   2 -
>  include/linux/dma-map-ops.h               |   8 +-
>  include/linux/dma-mapping.h               |  33 ++++++
>  include/linux/iommu-dma.h                 |  11 +-
>  include/linux/kmsan.h                     |   9 +-
>  include/trace/events/dma.h                |   9 +-
>  kernel/dma/debug.c                        |  71 ++++---------
>  kernel/dma/debug.h                        |  37 ++-----
>  kernel/dma/direct.c                       |  22 +---
>  kernel/dma/direct.h                       |  52 ++++++----
>  kernel/dma/mapping.c                      | 117 +++++++++++++---------
>  kernel/dma/ops_helpers.c                  |   6 +-
>  mm/hmm.c                                  |  19 ++--
>  mm/kmsan/hooks.c                          |   5 +-
>  rust/kernel/dma.rs                        |   3 +
>  tools/virtio/linux/kmsan.h                |   2 +-
>  26 files changed, 305 insertions(+), 254 deletions(-)

Marek,

So what are the next steps here? This series is pre-requirement for the
VFIO MMIO patches.

Thanks

> 
> -- 
> 2.50.1
> 
>

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Marek Szyprowski 1 month ago

On 28.08.2025 13:57, Leon Romanovsky wrote:
> On Tue, Aug 19, 2025 at 08:36:44PM +0300, Leon Romanovsky wrote:
>> Changelog:
>> v4:
>>   * Fixed kbuild error with mismatch in kmsan function declaration due to
>>     rebase error.
>> v3: https://lore.kernel.org/all/cover.1755193625.git.leon@kernel.org
>>   * Fixed typo in "cacheable" word
>>   * Simplified kmsan patch a lot to be simple argument refactoring
>> v2: https://lore.kernel.org/all/cover.1755153054.git.leon@kernel.org
>>   * Used commit messages and cover letter from Jason
>>   * Moved setting IOMMU_MMIO flag to dma_info_to_prot function
>>   * Micro-optimized the code
>>   * Rebased code on v6.17-rc1
>> v1: https://lore.kernel.org/all/cover.1754292567.git.leon@kernel.org
>>   * Added new DMA_ATTR_MMIO attribute to indicate
>>     PCI_P2PDMA_MAP_THRU_HOST_BRIDGE path.
>>   * Rewrote dma_map_* functions to use thus new attribute
>> v0: https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/
>> ------------------------------------------------------------------------
>>
>> This series refactors the DMA mapping to use physical addresses
>> as the primary interface instead of page+offset parameters. This
>> change aligns the DMA API with the underlying hardware reality where
>> DMA operations work with physical addresses, not page structures.
>>
>> The series maintains export symbol backward compatibility by keeping
>> the old page-based API as wrapper functions around the new physical
>> address-based implementations.
>>
>> This series refactors the DMA mapping API to provide a phys_addr_t
>> based, and struct-page free, external API that can handle all the
>> mapping cases we want in modern systems:
>>
>>   - struct page based cachable DRAM
>>   - struct page MEMORY_DEVICE_PCI_P2PDMA PCI peer to peer non-cachable
>>     MMIO
>>   - struct page-less PCI peer to peer non-cachable MMIO
>>   - struct page-less "resource" MMIO
>>
>> Overall this gets much closer to Matthew's long term wish for
>> struct-pageless IO to cachable DRAM. The remaining primary work would
>> be in the mm side to allow kmap_local_pfn()/phys_to_virt() to work on
>> phys_addr_t without a struct page.
>>
>> The general design is to remove struct page usage entirely from the
>> DMA API inner layers. For flows that need to have a KVA for the
>> physical address they can use kmap_local_pfn() or phys_to_virt(). This
>> isolates the struct page requirements to MM code only. Long term all
>> removals of struct page usage are supporting Matthew's memdesc
>> project which seeks to substantially transform how struct page works.
>>
>> Instead make the DMA API internals work on phys_addr_t. Internally
>> there are still dedicated 'page' and 'resource' flows, except they are
>> now distinguished by a new DMA_ATTR_MMIO instead of by callchain. Both
>> flows use the same phys_addr_t.
>>
>> When DMA_ATTR_MMIO is specified things work similar to the existing
>> 'resource' flow. kmap_local_pfn(), phys_to_virt(), phys_to_page(),
>> pfn_valid(), etc are never called on the phys_addr_t. This requires
>> rejecting any configuration that would need swiotlb. CPU cache
>> flushing is not required, and avoided, as ATTR_MMIO also indicates the
>> address have no cachable mappings. This effectively removes any
>> DMA API side requirement to have struct page when DMA_ATTR_MMIO is
>> used.
>>
>> In the !DMA_ATTR_MMIO mode things work similarly to the 'page' flow,
>> except on the common path of no cache flush, no swiotlb it never
>> touches a struct page. When cache flushing or swiotlb copying
>> kmap_local_pfn()/phys_to_virt() are used to get a KVA for CPU
>> usage. This was already the case on the unmap side, now the map side
>> is symmetric.
>>
>> Callers are adjusted to set DMA_ATTR_MMIO. Existing 'resource' users
>> must set it. The existing struct page based MEMORY_DEVICE_PCI_P2PDMA
>> path must also set it. This corrects some existing bugs where iommu
>> mappings for P2P MMIO were improperly marked IOMMU_CACHE.
>>
>> Since ATTR_MMIO is made to work with all the existing DMA map entry
>> points, particularly dma_iova_link(), this finally allows a way to use
>> the new DMA API to map PCI P2P MMIO without creating struct page. The
>> VFIO DMABUF series demonstrates how this works. This is intended to
>> replace the incorrect driver use of dma_map_resource() on PCI BAR
>> addresses.
>>
>> This series does the core code and modern flows. A followup series
>> will give the same treatment to the legacy dma_ops implementation.
>>
>> Thanks
>>
>> Leon Romanovsky (16):
>>    dma-mapping: introduce new DMA attribute to indicate MMIO memory
>>    iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link().
>>    dma-debug: refactor to use physical addresses for page mapping
>>    dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys
>>    iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys
>>    iommu/dma: extend iommu_dma_*map_phys API to handle MMIO memory
>>    dma-mapping: convert dma_direct_*map_page to be phys_addr_t based
>>    kmsan: convert kmsan_handle_dma to use physical addresses
>>    dma-mapping: handle MMIO flow in dma_map|unmap_page
>>    xen: swiotlb: Open code map_resource callback
>>    dma-mapping: export new dma_*map_phys() interface
>>    mm/hmm: migrate to physical address-based DMA mapping API
>>    mm/hmm: properly take MMIO path
>>    block-dma: migrate to dma_map_phys instead of map_page
>>    block-dma: properly take MMIO path
>>    nvme-pci: unmap MMIO pages with appropriate interface
>>
>>   Documentation/core-api/dma-api.rst        |   4 +-
>>   Documentation/core-api/dma-attributes.rst |  18 ++++
>>   arch/powerpc/kernel/dma-iommu.c           |   4 +-
>>   block/blk-mq-dma.c                        |  15 ++-
>>   drivers/iommu/dma-iommu.c                 |  61 +++++------
>>   drivers/nvme/host/pci.c                   |  18 +++-
>>   drivers/virtio/virtio_ring.c              |   4 +-
>>   drivers/xen/swiotlb-xen.c                 |  21 +++-
>>   include/linux/blk-mq-dma.h                |   6 +-
>>   include/linux/blk_types.h                 |   2 +
>>   include/linux/dma-direct.h                |   2 -
>>   include/linux/dma-map-ops.h               |   8 +-
>>   include/linux/dma-mapping.h               |  33 ++++++
>>   include/linux/iommu-dma.h                 |  11 +-
>>   include/linux/kmsan.h                     |   9 +-
>>   include/trace/events/dma.h                |   9 +-
>>   kernel/dma/debug.c                        |  71 ++++---------
>>   kernel/dma/debug.h                        |  37 ++-----
>>   kernel/dma/direct.c                       |  22 +---
>>   kernel/dma/direct.h                       |  52 ++++++----
>>   kernel/dma/mapping.c                      | 117 +++++++++++++---------
>>   kernel/dma/ops_helpers.c                  |   6 +-
>>   mm/hmm.c                                  |  19 ++--
>>   mm/kmsan/hooks.c                          |   5 +-
>>   rust/kernel/dma.rs                        |   3 +
>>   tools/virtio/linux/kmsan.h                |   2 +-
>>   26 files changed, 305 insertions(+), 254 deletions(-)
> Marek,
>
> So what are the next steps here? This series is pre-requirement for the
> VFIO MMIO patches.

I waited a bit with a hope to get a comment from Robin. It looks that 
there is no other alternative for the phys addr in the struct page 
removal process.

I would like to give those patches a try in linux-next, but in meantime 
I tested it on my test farm and found a regression in dma_map_resource() 
handling. Namely the dma_map_resource() is no longer possible with size 
not aligned to kmalloc()'ed buffer, as dma_direct_map_phys() calls 
dma_kmalloc_needs_bounce(), which in turn calls 
dma_kmalloc_size_aligned(). It looks that the check for !(attrs & 
DMA_ATTR_MMIO) should be moved one level up in dma_direct_map_phys(). 
Here is the log:

------------[ cut here ]------------
dma-pl330 fe550000.dma-controller: DMA addr 0x00000000fe410024+4 
overflow (mask ffffffff, bus limit 0).
WARNING: kernel/dma/direct.h:116 at dma_map_phys+0x3a4/0x3ec, CPU#1: 
speaker-test/405
Modules linked in: ...
CPU: 1 UID: 0 PID: 405 Comm: speaker-test Not tainted 
6.17.0-rc4-next-20250901+ #10958 PREEMPT
Hardware name: Hardkernel ODROID-M1 (DT)
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : dma_map_phys+0x3a4/0x3ec
lr : dma_map_phys+0x3a4/0x3ec
...
Call trace:
  dma_map_phys+0x3a4/0x3ec (P)
  dma_map_resource+0x14/0x20
  pl330_prep_slave_fifo+0x78/0xd0
  pl330_prep_dma_cyclic+0x70/0x2b0
  snd_dmaengine_pcm_trigger+0xec/0x8bc [snd_pcm_dmaengine]
  dmaengine_pcm_trigger+0x18/0x24 [snd_soc_core]
  snd_soc_pcm_component_trigger+0x164/0x208 [snd_soc_core]
  soc_pcm_trigger+0xe4/0x1ec [snd_soc_core]
  snd_pcm_do_start+0x44/0x70 [snd_pcm]
  snd_pcm_action_single+0x48/0xa4 [snd_pcm]
  snd_pcm_action+0x7c/0x98 [snd_pcm]
  snd_pcm_action_lock_irq+0x48/0xb4 [snd_pcm]
  snd_pcm_common_ioctl+0xf00/0x1f1c [snd_pcm]
  snd_pcm_ioctl+0x30/0x48 [snd_pcm]
  __arm64_sys_ioctl+0xac/0x104
  invoke_syscall+0x48/0x110
  el0_svc_common.constprop.0+0x40/0xe8
  do_el0_svc+0x20/0x2c
  el0_svc+0x4c/0x160
  el0t_64_sync_handler+0xa0/0xe4
  el0t_64_sync+0x198/0x19c
irq event stamp: 6596
hardirqs last  enabled at (6595): [<ffff800081344624>] 
_raw_spin_unlock_irqrestore+0x74/0x78
hardirqs last disabled at (6596): [<ffff8000813439b0>] 
_raw_spin_lock_irq+0x78/0x7c
softirqs last  enabled at (6076): [<ffff8000800c2294>] 
handle_softirqs+0x4c4/0x4dc
softirqs last disabled at (6071): [<ffff800080010690>] 
__do_softirq+0x14/0x20
---[ end trace 0000000000000000 ]---
rockchip-i2s-tdm fe410000.i2s: ASoC error (-12): at 
soc_component_trigger() on fe410000.i2s

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Jason Gunthorpe 1 month ago

On Mon, Sep 01, 2025 at 11:47:59PM +0200, Marek Szyprowski wrote:
> I would like to give those patches a try in linux-next, but in meantime 
> I tested it on my test farm and found a regression in dma_map_resource() 
> handling. Namely the dma_map_resource() is no longer possible with size 
> not aligned to kmalloc()'ed buffer, as dma_direct_map_phys() calls 
> dma_kmalloc_needs_bounce(),

Hmm, it's this bit:

	capable = dma_capable(dev, dma_addr, size, !(attrs & DMA_ATTR_MMIO));
	if (unlikely(!capable) || dma_kmalloc_needs_bounce(dev, size, dir)) {
		if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_MMIO))
			return swiotlb_map(dev, phys, size, dir, attrs);

		goto err_overflow;
	}

We shouldn't be checking dma_kmalloc_needs_bounce() on mmio as there
is no cache flushing so the "dma safe alignment" for non-coherent DMA
does not apply.

Like you say looks good to me, and more of the surrouding code can be
pulled in too, no sense in repeating the boolean logic:

	if (attrs & DMA_ATTR_MMIO) {
		dma_addr = phys;
		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
			goto err_overflow;
	} else {
		dma_addr = phys_to_dma(dev, phys);
		if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
		    dma_kmalloc_needs_bounce(dev, size, dir)) {
			if (is_swiotlb_active(dev))
				return swiotlb_map(dev, phys, size, dir, attrs);

			goto err_overflow;
		}
		if (!dev_is_dma_coherent(dev) &&
		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
			arch_sync_dma_for_device(phys, size, dir);
	}

Jason

Re: [PATCH v4 00/16] dma-mapping: migrate to physical address-based API

Posted by Leon Romanovsky 1 month ago

On Mon, Sep 01, 2025 at 07:23:02PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 01, 2025 at 11:47:59PM +0200, Marek Szyprowski wrote:
> > I would like to give those patches a try in linux-next, but in meantime 
> > I tested it on my test farm and found a regression in dma_map_resource() 
> > handling. Namely the dma_map_resource() is no longer possible with size 
> > not aligned to kmalloc()'ed buffer, as dma_direct_map_phys() calls 
> > dma_kmalloc_needs_bounce(),
> 
> Hmm, it's this bit:
> 
> 	capable = dma_capable(dev, dma_addr, size, !(attrs & DMA_ATTR_MMIO));
> 	if (unlikely(!capable) || dma_kmalloc_needs_bounce(dev, size, dir)) {
> 		if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_MMIO))
> 			return swiotlb_map(dev, phys, size, dir, attrs);
> 
> 		goto err_overflow;
> 	}
> 
> We shouldn't be checking dma_kmalloc_needs_bounce() on mmio as there
> is no cache flushing so the "dma safe alignment" for non-coherent DMA
> does not apply.
> 
> Like you say looks good to me, and more of the surrouding code can be
> pulled in too, no sense in repeating the boolean logic:
> 
> 	if (attrs & DMA_ATTR_MMIO) {
> 		dma_addr = phys;
> 		if (unlikely(!dma_capable(dev, dma_addr, size, false)))
> 			goto err_overflow;
> 	} else {
> 		dma_addr = phys_to_dma(dev, phys);
> 		if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||

I tried to reuse same code as much as possible :(

> 		    dma_kmalloc_needs_bounce(dev, size, dir)) {
> 			if (is_swiotlb_active(dev))
> 				return swiotlb_map(dev, phys, size, dir, attrs);
> 
> 			goto err_overflow;
> 		}
> 		if (!dev_is_dma_coherent(dev) &&
> 		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> 			arch_sync_dma_for_device(phys, size, dir);
> 	}

Like Jason wrote, but in diff format:

diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 92dbadcd3b2f..3f4792910604 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -85,7 +85,6 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
                unsigned long attrs)
 {
        dma_addr_t dma_addr;
-       bool capable;

        if (is_swiotlb_force_bounce(dev)) {
                if (attrs & DMA_ATTR_MMIO)
@@ -94,17 +93,19 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
                return swiotlb_map(dev, phys, size, dir, attrs);
        }

-       if (attrs & DMA_ATTR_MMIO)
+       if (attrs & DMA_ATTR_MMIO) {
                dma_addr = phys;
-       else
+               if (unlikely(dma_capable(dev, dma_addr, size, false)))
+                       goto err_overflow;
+       } else {
                dma_addr = phys_to_dma(dev, phys);
+               if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
+                   dma_kmalloc_needs_bounce(dev, size, dir)) {
+                       if (is_swiotlb_active(dev))
+                               return swiotlb_map(dev, phys, size, dir, attrs);

-       capable = dma_capable(dev, dma_addr, size, !(attrs & DMA_ATTR_MMIO));
-       if (unlikely(!capable) || dma_kmalloc_needs_bounce(dev, size, dir)) {
-               if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_MMIO))
-                       return swiotlb_map(dev, phys, size, dir, attrs);
-
-               goto err_overflow;
+                       goto err_overflow;
+               }
        }

        if (!dev_is_dma_coherent(dev) &&


I created new tag with fixed code.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/tag/?h=dma-phys-Sep-2

Thanks

> 
> Jason