*** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

[PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Yonatan Maman 2 months, 2 weeks ago

From: Yonatan Maman <Ymaman@Nvidia.com>

hmm_range_fault() by default triggered a page fault on device private
when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
cases, such as with RDMA devices, the migration overhead between the
device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
private page might be crucial for minimizing data transfer overhead.

Introduced an API to support P2P DMA for device private pages,includes:
 - Leveraging the struct pagemap_ops for P2P Page Callbacks. This callback
   involves mapping the page for P2P DMA and returning the corresponding
   PCI_P2P page.

 - Utilizing hmm_range_fault for initializing P2P DMA. The API
   also adds the HMM_PFN_REQ_TRY_P2P flag option for the
   hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
   attempts initializing the P2P connection first, if the owner device
   supports P2P, using p2p_page. In case of failure or lack of support,
   hmm_range_fault will continue with the regular flow of migrating the
   page to RAM.

This change does not affect previous use-cases of hmm_range_fault,
because both the caller and the page owner must explicitly request and
support it to initialize P2P connection.

Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
 include/linux/hmm.h      |  2 ++
 include/linux/memremap.h |  8 ++++++
 mm/hmm.c                 | 57 +++++++++++++++++++++++++++++++---------
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7..988c98c0edcc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -27,6 +27,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
  *                      to mark that page is already DMA mapped
+ * HMM_PFN_ALLOW_P2P - Allow returning PCI P2PDMA page
  *
  * On input:
  * 0                 - Return the current state of the page, do not fault it.
@@ -47,6 +48,7 @@ enum hmm_pfn_flags {
 	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 4),
 	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
 	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
+	HMM_PFN_ALLOW_P2P = 1UL << (BITS_PER_LONG - 7),
 
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..79becc37df00 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,14 @@ struct dev_pagemap_ops {
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
 
+	/*
+	 * Used for private (un-addressable) device memory only. Return a
+	 * corresponding PFN for a page that can be mapped to device
+	 * (e.g using dma_map_page)
+	 */
+	int (*get_dma_pfn_for_device)(struct page *private_page,
+				      unsigned long *dma_pfn);
+
 	/*
 	 * Handle the memory failure happens on a range of pfns.  Notify the
 	 * processes who are using these pfns, and try to recover the data on
diff --git a/mm/hmm.c b/mm/hmm.c
index feac86196a65..089e522b346b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -232,6 +232,49 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
 	return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
 }
 
+static bool hmm_handle_device_private(struct hmm_range *range,
+				      unsigned long pfn_req_flags,
+				      swp_entry_t entry,
+				      unsigned long *hmm_pfn)
+{
+	struct page *page = pfn_swap_entry_to_page(entry);
+	struct dev_pagemap *pgmap = page_pgmap(page);
+	int ret;
+
+	pfn_req_flags &= range->pfn_flags_mask;
+	pfn_req_flags |= range->default_flags;
+
+	/*
+	 * Don't fault in device private pages owned by the caller,
+	 * just report the PFN.
+	 */
+	if (pgmap->owner == range->dev_private_owner) {
+		*hmm_pfn = swp_offset_pfn(entry);
+		goto found;
+	}
+
+	/*
+	 * P2P for supported pages, and according to caller request
+	 * translate the private page to the match P2P page if it fails
+	 * continue with the regular flow
+	 */
+	if (pfn_req_flags & HMM_PFN_ALLOW_P2P &&
+	    pgmap->ops->get_dma_pfn_for_device) {
+		ret = pgmap->ops->get_dma_pfn_for_device(page, hmm_pfn);
+		if (!ret)
+			goto found;
+
+	}
+
+	return false;
+
+found:
+	*hmm_pfn |= HMM_PFN_VALID;
+	if (is_writable_device_private_entry(entry))
+		*hmm_pfn |= HMM_PFN_WRITE;
+	return true;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
 			      unsigned long *hmm_pfn)
@@ -255,19 +298,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	if (!pte_present(pte)) {
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
-		/*
-		 * Don't fault in device private pages owned by the caller,
-		 * just report the PFN.
-		 */
 		if (is_device_private_entry(entry) &&
-		    page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
-		    range->dev_private_owner) {
-			cpu_flags = HMM_PFN_VALID;
-			if (is_writable_device_private_entry(entry))
-				cpu_flags |= HMM_PFN_WRITE;
-			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
-			goto out;
-		}
+		    hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))
+			return 0;
 
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
-- 
2.34.1

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Christoph Hellwig 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
> 
> hmm_range_fault() by default triggered a page fault on device private
> when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
> cases, such as with RDMA devices, the migration overhead between the
> device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
> performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
> private page might be crucial for minimizing data transfer overhead.

You don't enable DMA for device private pages.  You allow discovering
a DMAable alias for device private pages.

Also absolutely nothing GPU specific here.

> +	/*
> +	 * Don't fault in device private pages owned by the caller,
> +	 * just report the PFN.
> +	 */
> +	if (pgmap->owner == range->dev_private_owner) {
> +		*hmm_pfn = swp_offset_pfn(entry);
> +		goto found;

This is dangerous because it mixes actual DMAable alias PFNs with the
device private fake PFNs.  Maybe your hardware / driver can handle
it, but just leaking this out is not a good idea.

> +		    hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))

Overly long line here.

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Jason Gunthorpe 2 months ago

On Sun, Jul 20, 2025 at 11:59:10PM -0700, Christoph Hellwig wrote:
> > +	/*
> > +	 * Don't fault in device private pages owned by the caller,
> > +	 * just report the PFN.
> > +	 */
> > +	if (pgmap->owner == range->dev_private_owner) {
> > +		*hmm_pfn = swp_offset_pfn(entry);
> > +		goto found;
> 
> This is dangerous because it mixes actual DMAable alias PFNs with the
> device private fake PFNs.  Maybe your hardware / driver can handle
> it, but just leaking this out is not a good idea.

For better or worse that is how the hmm API works today.

Recall the result is an array of unsigned long with a pfn and flags:

enum hmm_pfn_flags {
	/* Output fields and flags */
	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),

The only promise is that every pfn has a struct page behind it.

If the caller specifies dev_private_owner then it must also look into
the struct page of every returned pfn to see if it is device private
or not.

hmm_dma_map_pfn() already unconditionally calls pci_p2pdma_state()
which checks for P2P struct pages.

It does sound like a good improvement to return the type of the pfn
(normal, p2p, private) in the flags bits as well to optimize away
these extra struct page lookups.

But this is a different project..

Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Yonatan Maman 2 months, 2 weeks ago


On 21/07/2025 9:59, Christoph Hellwig wrote:
> On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> hmm_range_fault() by default triggered a page fault on device private
>> when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
>> cases, such as with RDMA devices, the migration overhead between the
>> device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
>> performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
>> private page might be crucial for minimizing data transfer overhead.
> 
> You don't enable DMA for device private pages.  You allow discovering
> a DMAable alias for device private pages.
>
> Also absolutely nothing GPU specific here.
>
Ok, understood, I will change it (v3).
  >> +	/*
>> +	 * Don't fault in device private pages owned by the caller,
>> +	 * just report the PFN.
>> +	 */
>> +	if (pgmap->owner == range->dev_private_owner) {
>> +		*hmm_pfn = swp_offset_pfn(entry);
>> +		goto found;
> 
> This is dangerous because it mixes actual DMAable alias PFNs with the
> device private fake PFNs.  Maybe your hardware / driver can handle
> it, but just leaking this out is not a good idea.
>

In the current implementation, regular pci_p2p pages are returned as-is 
from hmm_range_fault() - for virtual address backed by pci_p2p page, it 
will return the corresponding PFN.
That said, we can mark these via the hmm_pfn output flags so the caller 
can handle them appropriately.

>> +		    hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))
> 
> Overly long line here.
> 

will be fixed (v3)

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Matthew Wilcox 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> +++ b/include/linux/memremap.h
> @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
>  	 */
>  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>  
> +	/*
> +	 * Used for private (un-addressable) device memory only. Return a
> +	 * corresponding PFN for a page that can be mapped to device
> +	 * (e.g using dma_map_page)
> +	 */
> +	int (*get_dma_pfn_for_device)(struct page *private_page,
> +				      unsigned long *dma_pfn);

This makes no sense.  If a page is addressable then it has a PFN.
If a page is not addressable then it doesn't have a PFN.

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Jason Gunthorpe 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > +++ b/include/linux/memremap.h
> > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> >  	 */
> >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> >  
> > +	/*
> > +	 * Used for private (un-addressable) device memory only. Return a
> > +	 * corresponding PFN for a page that can be mapped to device
> > +	 * (e.g using dma_map_page)
> > +	 */
> > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > +				      unsigned long *dma_pfn);
> 
> This makes no sense.  If a page is addressable then it has a PFN.
> If a page is not addressable then it doesn't have a PFN.

The DEVICE_PRIVATE pages have a PFN, but it is not usable for
anything.

This is effectively converting from a DEVICE_PRIVATE page to an actual
DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
proxy, like a swap entry, for where the real data is sitting.

Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Matthew Wilcox 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > > +++ b/include/linux/memremap.h
> > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> > >  	 */
> > >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > >  
> > > +	/*
> > > +	 * Used for private (un-addressable) device memory only. Return a
> > > +	 * corresponding PFN for a page that can be mapped to device
> > > +	 * (e.g using dma_map_page)
> > > +	 */
> > > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > > +				      unsigned long *dma_pfn);
> > 
> > This makes no sense.  If a page is addressable then it has a PFN.
> > If a page is not addressable then it doesn't have a PFN.
> 
> The DEVICE_PRIVATE pages have a PFN, but it is not usable for
> anything.

OK, then I don't understand what DEVICE PRIVATE means.

I thought it was for memory on a PCIe device that isn't even visible
through a BAR and so the CPU has no way of addressing it directly.
But now you say that it has a PFN, which means it has a physical
address, which means it's accessible to the CPU.

So what is it?

> This is effectively converting from a DEVICE_PRIVATE page to an actual
> DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
> proxy, like a swap entry, for where the real data is sitting.
> 
> Jason
>

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Alistair Popple 2 months, 2 weeks ago

On Mon, Jul 21, 2025 at 02:23:13PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > > > +++ b/include/linux/memremap.h
> > > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> > > >  	 */
> > > >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > >  
> > > > +	/*
> > > > +	 * Used for private (un-addressable) device memory only. Return a
> > > > +	 * corresponding PFN for a page that can be mapped to device
> > > > +	 * (e.g using dma_map_page)
> > > > +	 */
> > > > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > > > +				      unsigned long *dma_pfn);
> > > 
> > > This makes no sense.  If a page is addressable then it has a PFN.
> > > If a page is not addressable then it doesn't have a PFN.
> > 
> > The DEVICE_PRIVATE pages have a PFN, but it is not usable for
> > anything.
> 
> OK, then I don't understand what DEVICE PRIVATE means.
> 
> I thought it was for memory on a PCIe device that isn't even visible
> through a BAR and so the CPU has no way of addressing it directly.

Correct.

> But now you say that it has a PFN, which means it has a physical
> address, which means it's accessible to the CPU.

Having a PFN doesn't mean it's actually accessible to the CPU. It is a real
physical address in the CPU address space, but it is a completely bogus/invalid
address - if the CPU actually tries to access it will cause a machine check
or whatever other exception gets generated when accessing an invalid physical
address.

Obviously we're careful to avoid that. The PFN is used solely to get to/from a
struct page (via pfn_to_page() or page_to_pfn()).

> So what is it?

IMHO a hack, because obviously we shouldn't require real physical addresses for
something the CPU can't actually address anyway and this causes real problems
(eg. it doesn't actually work on anything other than x86_64). There's no reason
the "PFN" we store in device-private entries couldn't instead just be an index
into some data structure holding pointers to the struct pages. So instead of
using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
and page_to_device_private_index().

We discussed this briefly at LSFMM, I think your suggestion for a data structure
was to use a maple tree. I'm yet to look at this more deeply but I'd like to
figure out where memdescs fit in this picture too.

 - Alistair

> > This is effectively converting from a DEVICE_PRIVATE page to an actual
> > DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
> > proxy, like a swap entry, for where the real data is sitting.
> > 
> > Jason
> >

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Jason Gunthorpe 2 months, 2 weeks ago

On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
> > So what is it?
> 
> IMHO a hack, because obviously we shouldn't require real physical addresses for
> something the CPU can't actually address anyway and this causes real
> problems

IMHO what DEVICE PRIVATE really boils down to is a way to have swap
entries that point to some kind of opaque driver managed memory.

We have alot of assumptions all over about pfn/phys to page
relationships so anything that has a struct page also has to come with
a fake PFN today..

> (eg. it doesn't actually work on anything other than x86_64). There's no reason
> the "PFN" we store in device-private entries couldn't instead just be an index
> into some data structure holding pointers to the struct pages. So instead of
> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
> and page_to_device_private_index().

It could work, but any of the pfn conversions would have to be tracked
down.. Could be troublesome.

Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Alistair Popple 2 months, 2 weeks ago

On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
> > > So what is it?
> > 
> > IMHO a hack, because obviously we shouldn't require real physical addresses for
> > something the CPU can't actually address anyway and this causes real
> > problems
> 
> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
> entries that point to some kind of opaque driver managed memory.
> 
> We have alot of assumptions all over about pfn/phys to page
> relationships so anything that has a struct page also has to come with
> a fake PFN today..

Hmm ... maybe. To get that PFN though we have to come from either a special
swap entry which we already have special cases for, or a struct page (which is
a device private page) which we mostly have to handle specially anyway. I'm not
sure there's too many places that can sensibly handle a fake PFN without somehow
already knowing it is device-private PFN.

> > (eg. it doesn't actually work on anything other than x86_64). There's no reason
> > the "PFN" we store in device-private entries couldn't instead just be an index
> > into some data structure holding pointers to the struct pages. So instead of
> > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
> > and page_to_device_private_index().
> 
> It could work, but any of the pfn conversions would have to be tracked
> down.. Could be troublesome.

I looked at this a while back and I'm reasonably optimistic that this is doable
because we already have to treat these specially everywhere anyway. The proof
will be writing the patches of course.

 - Alistair

> Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by David Hildenbrand 2 months, 2 weeks ago

On 23.07.25 06:10, Alistair Popple wrote:
> On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
>> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
>>>> So what is it?
>>>
>>> IMHO a hack, because obviously we shouldn't require real physical addresses for
>>> something the CPU can't actually address anyway and this causes real
>>> problems
>>
>> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
>> entries that point to some kind of opaque driver managed memory.
>>
>> We have alot of assumptions all over about pfn/phys to page
>> relationships so anything that has a struct page also has to come with
>> a fake PFN today..
> 
> Hmm ... maybe. To get that PFN though we have to come from either a special
> swap entry which we already have special cases for, or a struct page (which is
> a device private page) which we mostly have to handle specially anyway. I'm not
> sure there's too many places that can sensibly handle a fake PFN without somehow
> already knowing it is device-private PFN.
> 
>>> (eg. it doesn't actually work on anything other than x86_64). There's no reason
>>> the "PFN" we store in device-private entries couldn't instead just be an index
>>> into some data structure holding pointers to the struct pages. So instead of
>>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
>>> and page_to_device_private_index().
>>
>> It could work, but any of the pfn conversions would have to be tracked
>> down.. Could be troublesome.
> 
> I looked at this a while back and I'm reasonably optimistic that this is doable
> because we already have to treat these specially everywhere anyway.
How would that look like?

E.g., we have code like

if (is_device_private_entry(entry)) {
	page = pfn_swap_entry_to_page(entry);
	folio = page_folio(page);

	...
	folio_get(folio);
	...
}

We could easily stop allowing pfn_swap_entry_to_page(), turning these 
into non-pfn swap entries.

Would it then be something like

if (is_device_private_entry(entry)) {
	page = device_private_entry_to_page(entry);
	
	...
}

Whereby device_private_entry_to_page() obtains the "struct page" not via 
the PFN but some other magical (index) value?

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Alistair Popple 2 months, 1 week ago

On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote:
> On 23.07.25 06:10, Alistair Popple wrote:
> > On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
> > > > > So what is it?
> > > > 
> > > > IMHO a hack, because obviously we shouldn't require real physical addresses for
> > > > something the CPU can't actually address anyway and this causes real
> > > > problems
> > > 
> > > IMHO what DEVICE PRIVATE really boils down to is a way to have swap
> > > entries that point to some kind of opaque driver managed memory.
> > > 
> > > We have alot of assumptions all over about pfn/phys to page
> > > relationships so anything that has a struct page also has to come with
> > > a fake PFN today..
> > 
> > Hmm ... maybe. To get that PFN though we have to come from either a special
> > swap entry which we already have special cases for, or a struct page (which is
> > a device private page) which we mostly have to handle specially anyway. I'm not
> > sure there's too many places that can sensibly handle a fake PFN without somehow
> > already knowing it is device-private PFN.
> > 
> > > > (eg. it doesn't actually work on anything other than x86_64). There's no reason
> > > > the "PFN" we store in device-private entries couldn't instead just be an index
> > > > into some data structure holding pointers to the struct pages. So instead of
> > > > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
> > > > and page_to_device_private_index().
> > > 
> > > It could work, but any of the pfn conversions would have to be tracked
> > > down.. Could be troublesome.
> > 
> > I looked at this a while back and I'm reasonably optimistic that this is doable
> > because we already have to treat these specially everywhere anyway.
> How would that look like?
> 
> E.g., we have code like
> 
> if (is_device_private_entry(entry)) {
> 	page = pfn_swap_entry_to_page(entry);
> 	folio = page_folio(page);
> 
> 	...
> 	folio_get(folio);
> 	...
> }
> 
> We could easily stop allowing pfn_swap_entry_to_page(), turning these into
> non-pfn swap entries.
> 
> Would it then be something like
> 
> if (is_device_private_entry(entry)) {
> 	page = device_private_entry_to_page(entry);
> 	
> 	...
> }
> 
> Whereby device_private_entry_to_page() obtains the "struct page" not via the
> PFN but some other magical (index) value?

Exactly. The observation being that when you convert a PTE from a swap entry
to a page we already know it is a device private entry, so can go look up the
struct page with special magic (eg. an index into some other array or data
structure).

And if you have a struct page you already know it's a device private page so if
you need to create the swap entry you can look up the magic index using some
alternate function.

The only issue would be if there were generic code paths that somehow have a
raw pfn obtained from neither a page-table walk or struct page. My assumption
(yet to be proven/tested) is that these paths don't exist.

 - Alistair

> 
> -- 
> Cheers,
> 
> David / dhildenb
>

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Jason Gunthorpe 2 months ago

On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:

> The only issue would be if there were generic code paths that somehow have a
> raw pfn obtained from neither a page-table walk or struct page. My assumption
> (yet to be proven/tested) is that these paths don't exist.

hmm does it, it encodes the device private into a pfn and expects the
caller to do pfn to page.

This isn't set in stone and could be changed..

But broadly, you'd want to entirely eliminate the ability to go from
pfn to device private or from device private to pfn.

Instead you'd want to work on some (space #, space index) tuple, maybe
encoded in a pfn_t, but absolutely and typesafely distinct. Each
driver gets its own 0 based space for device private information, the
space is effectively the pgmap.

And if you do this, maybe we don't need struct page (I mean the type!)
backing device memory at all.... Which would be a very worthwhile
project.

Do we ever even use anything in the device private struct page? Do we
refcount it?

Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by David Hildenbrand 2 months ago

On 01.08.25 18:40, Jason Gunthorpe wrote:
> On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> 
>> The only issue would be if there were generic code paths that somehow have a
>> raw pfn obtained from neither a page-table walk or struct page. My assumption
>> (yet to be proven/tested) is that these paths don't exist.
> 
> hmm does it, it encodes the device private into a pfn and expects the
> caller to do pfn to page.
> 
> This isn't set in stone and could be changed..
> 
> But broadly, you'd want to entirely eliminate the ability to go from
> pfn to device private or from device private to pfn.
> 
> Instead you'd want to work on some (space #, space index) tuple, maybe
> encoded in a pfn_t, but absolutely and typesafely distinct. Each
> driver gets its own 0 based space for device private information, the
> space is effectively the pgmap.
> 
> And if you do this, maybe we don't need struct page (I mean the type!)
> backing device memory at all.... Which would be a very worthwhile
> project.
> 
> Do we ever even use anything in the device private struct page? Do we
> refcount it?

ref-counted and map-counted ...

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Jason Gunthorpe 2 months ago

On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
> On 01.08.25 18:40, Jason Gunthorpe wrote:
> > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> > 
> > > The only issue would be if there were generic code paths that somehow have a
> > > raw pfn obtained from neither a page-table walk or struct page. My assumption
> > > (yet to be proven/tested) is that these paths don't exist.
> > 
> > hmm does it, it encodes the device private into a pfn and expects the
> > caller to do pfn to page.
> > 
> > This isn't set in stone and could be changed..
> > 
> > But broadly, you'd want to entirely eliminate the ability to go from
> > pfn to device private or from device private to pfn.
> > 
> > Instead you'd want to work on some (space #, space index) tuple, maybe
> > encoded in a pfn_t, but absolutely and typesafely distinct. Each
> > driver gets its own 0 based space for device private information, the
> > space is effectively the pgmap.
> > 
> > And if you do this, maybe we don't need struct page (I mean the type!)
> > backing device memory at all.... Which would be a very worthwhile
> > project.
> > 
> > Do we ever even use anything in the device private struct page? Do we
> > refcount it?
> 
> ref-counted and map-counted ...

Hm, so it would turn into another struct page split up where we get
ourselves a struct device_private and change all the places touching
its refcount and mapcount to use the new type.

If we could use some index scheme we could then divorce from struct
page and strink the struct size sooner.

Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Alistair Popple 2 months ago

On Fri, Aug 01, 2025 at 01:57:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
> > On 01.08.25 18:40, Jason Gunthorpe wrote:
> > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> > > 
> > > > The only issue would be if there were generic code paths that somehow have a
> > > > raw pfn obtained from neither a page-table walk or struct page. My assumption
> > > > (yet to be proven/tested) is that these paths don't exist.
> > > 
> > > hmm does it, it encodes the device private into a pfn and expects the
> > > caller to do pfn to page.

What callers need to do pfn to page when finding a device private pfn via
hmm_range_fault()? GPU drivers don't, they tend just to use the pfn as an offset
from the start of the pgmap to find whatever data structure they are using to
track device memory allocations.

The migrate_vma_*() calls do, but they could easily be changed to whatever
index scheme we use so long as we can encode that this is a device entry in the
MIGRATE_PFN flags.

So other than adding a HMM_PFN flag to say this is really a device index I don't
see too many issues here.

> > > This isn't set in stone and could be changed..
> > > 
> > > But broadly, you'd want to entirely eliminate the ability to go from
> > > pfn to device private or from device private to pfn.
> > > 
> > > Instead you'd want to work on some (space #, space index) tuple, maybe
> > > encoded in a pfn_t, but absolutely and typesafely distinct. Each
> > > driver gets its own 0 based space for device private information, the
> > > space is effectively the pgmap.
> > > 
> > > And if you do this, maybe we don't need struct page (I mean the type!)
> > > backing device memory at all.... Which would be a very worthwhile
> > > project.

Exactly! Although we still need enough of a struct page or something else to
still be able to map them in normal anonymous VMAs. Short term the motivation
for this project is that the current scheme of "stealing" pfns for the device
doesn't actually work in a lot of cases.

> > > Do we ever even use anything in the device private struct page? Do we
> > > refcount it?
> > 
> > ref-counted and map-counted ...
> 
> Hm, so it would turn into another struct page split up where we get
> ourselves a struct device_private and change all the places touching
> its refcount and mapcount to use the new type.
> 
> If we could use some index scheme we could then divorce from struct
> page and strink the struct size sooner.

Right, that is roughly along the lines of what I was thinking.

> Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Jason Gunthorpe 2 months ago

On Mon, Aug 04, 2025 at 11:51:38AM +1000, Alistair Popple wrote:
> On Fri, Aug 01, 2025 at 01:57:49PM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
> > > On 01.08.25 18:40, Jason Gunthorpe wrote:
> > > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> > > > 
> > > > > The only issue would be if there were generic code paths that somehow have a
> > > > > raw pfn obtained from neither a page-table walk or struct page. My assumption
> > > > > (yet to be proven/tested) is that these paths don't exist.
> > > > 
> > > > hmm does it, it encodes the device private into a pfn and expects the
> > > > caller to do pfn to page.
> 
> What callers need to do pfn to page when finding a device private pfn via
> hmm_range_fault()? GPU drivers don't, they tend just to use the pfn as an offset
> from the start of the pgmap to find whatever data structure they are using to
> track device memory allocations.

All drivers today must. You have no idea if the PFN returned is a
private or CPU page. The only way to know is to check the struct page
type, by looking inside the struct page.

> So other than adding a HMM_PFN flag to say this is really a device index I don't
> see too many issues here.

Christoph suggested exactly this, and it would solve the issue. Seems
quite easy too. Let's do it.

Jason

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by David Hildenbrand 2 months ago

On 01.08.25 18:57, Jason Gunthorpe wrote:
> On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
>> On 01.08.25 18:40, Jason Gunthorpe wrote:
>>> On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
>>>
>>>> The only issue would be if there were generic code paths that somehow have a
>>>> raw pfn obtained from neither a page-table walk or struct page. My assumption
>>>> (yet to be proven/tested) is that these paths don't exist.
>>>
>>> hmm does it, it encodes the device private into a pfn and expects the
>>> caller to do pfn to page.
>>>
>>> This isn't set in stone and could be changed..
>>>
>>> But broadly, you'd want to entirely eliminate the ability to go from
>>> pfn to device private or from device private to pfn.
>>>
>>> Instead you'd want to work on some (space #, space index) tuple, maybe
>>> encoded in a pfn_t, but absolutely and typesafely distinct. Each
>>> driver gets its own 0 based space for device private information, the
>>> space is effectively the pgmap.
>>>
>>> And if you do this, maybe we don't need struct page (I mean the type!)
>>> backing device memory at all.... Which would be a very worthwhile
>>> project.
>>>
>>> Do we ever even use anything in the device private struct page? Do we
>>> refcount it?
>>
>> ref-counted and map-counted ...
> 
> Hm, so it would turn into another struct page split up where we get
> ourselves a struct device_private and change all the places touching
> its refcount and mapcount to use the new type.

We're already working with folios in all cases where we modify either 
refcount or mapcount IIUC.

The rmap handling (try to migrate, soon folio splitting) currently 
depends on the mapcount.

Not sure how that will all look like without a ... struct folio / struct 
page.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by David Hildenbrand 2 months, 1 week ago

On 25.07.25 02:31, Alistair Popple wrote:
> On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote:
>> On 23.07.25 06:10, Alistair Popple wrote:
>>> On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
>>>> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
>>>>>> So what is it?
>>>>>
>>>>> IMHO a hack, because obviously we shouldn't require real physical addresses for
>>>>> something the CPU can't actually address anyway and this causes real
>>>>> problems
>>>>
>>>> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
>>>> entries that point to some kind of opaque driver managed memory.
>>>>
>>>> We have alot of assumptions all over about pfn/phys to page
>>>> relationships so anything that has a struct page also has to come with
>>>> a fake PFN today..
>>>
>>> Hmm ... maybe. To get that PFN though we have to come from either a special
>>> swap entry which we already have special cases for, or a struct page (which is
>>> a device private page) which we mostly have to handle specially anyway. I'm not
>>> sure there's too many places that can sensibly handle a fake PFN without somehow
>>> already knowing it is device-private PFN.
>>>
>>>>> (eg. it doesn't actually work on anything other than x86_64). There's no reason
>>>>> the "PFN" we store in device-private entries couldn't instead just be an index
>>>>> into some data structure holding pointers to the struct pages. So instead of
>>>>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
>>>>> and page_to_device_private_index().
>>>>
>>>> It could work, but any of the pfn conversions would have to be tracked
>>>> down.. Could be troublesome.
>>>
>>> I looked at this a while back and I'm reasonably optimistic that this is doable
>>> because we already have to treat these specially everywhere anyway.
>> How would that look like?
>>
>> E.g., we have code like
>>
>> if (is_device_private_entry(entry)) {
>> 	page = pfn_swap_entry_to_page(entry);
>> 	folio = page_folio(page);
>>
>> 	...
>> 	folio_get(folio);
>> 	...
>> }
>>
>> We could easily stop allowing pfn_swap_entry_to_page(), turning these into
>> non-pfn swap entries.
>>
>> Would it then be something like
>>
>> if (is_device_private_entry(entry)) {
>> 	page = device_private_entry_to_page(entry);
>> 	
>> 	...
>> }
>>
>> Whereby device_private_entry_to_page() obtains the "struct page" not via the
>> PFN but some other magical (index) value?
> 
> Exactly. The observation being that when you convert a PTE from a swap entry
> to a page we already know it is a device private entry, so can go look up the
> struct page with special magic (eg. an index into some other array or data
> structure).
> 
> And if you have a struct page you already know it's a device private page so if
> you need to create the swap entry you can look up the magic index using some
> alternate function.
> 
> The only issue would be if there were generic code paths that somehow have a
> raw pfn obtained from neither a page-table walk or struct page. My assumption
> (yet to be proven/tested) is that these paths don't exist.

I guess memory compaction and friends don't apply to ZONE_DEVICE, and 
even memory_failure() handling goes a separate path.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages

Posted by Alistair Popple 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > > +++ b/include/linux/memremap.h
> > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> > >  	 */
> > >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > >  
> > > +	/*
> > > +	 * Used for private (un-addressable) device memory only. Return a
> > > +	 * corresponding PFN for a page that can be mapped to device
> > > +	 * (e.g using dma_map_page)
> > > +	 */
> > > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > > +				      unsigned long *dma_pfn);
> > 
> > This makes no sense.  If a page is addressable then it has a PFN.
> > If a page is not addressable then it doesn't have a PFN.
> 
> The DEVICE_PRIVATE pages have a PFN, but it is not usable for
> anything.
> 
> This is effectively converting from a DEVICE_PRIVATE page to an actual
> DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
> proxy, like a swap entry, for where the real data is sitting.

Yes, it's on my backlog to start looking at using something other than a real
PFN for this proxy. Because having it as an actual PFN has caused us all sorts
of random issues as it still needs to reserve a real physical address range
which may or may not be available on a given machine.

> 
> Jason
>