From: Yonatan Maman <Ymaman@Nvidia.com>
hmm_range_fault() by default triggered a page fault on device private
when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
cases, such as with RDMA devices, the migration overhead between the
device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
private page might be crucial for minimizing data transfer overhead.
Introduced an API to support P2P DMA for device private pages,includes:
- Leveraging the struct pagemap_ops for P2P Page Callbacks. This callback
involves mapping the page for P2P DMA and returning the corresponding
PCI_P2P page.
- Utilizing hmm_range_fault for initializing P2P DMA. The API
also adds the HMM_PFN_REQ_TRY_P2P flag option for the
hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
attempts initializing the P2P connection first, if the owner device
supports P2P, using p2p_page. In case of failure or lack of support,
hmm_range_fault will continue with the regular flow of migrating the
page to RAM.
This change does not affect previous use-cases of hmm_range_fault,
because both the caller and the page owner must explicitly request and
support it to initialize P2P connection.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
include/linux/hmm.h | 2 ++
include/linux/memremap.h | 8 ++++++
mm/hmm.c | 57 +++++++++++++++++++++++++++++++---------
3 files changed, 55 insertions(+), 12 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7..988c98c0edcc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -27,6 +27,7 @@ struct mmu_interval_notifier;
* HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
* HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
* to mark that page is already DMA mapped
+ * HMM_PFN_ALLOW_P2P - Allow returning PCI P2PDMA page
*
* On input:
* 0 - Return the current state of the page, do not fault it.
@@ -47,6 +48,7 @@ enum hmm_pfn_flags {
HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 4),
HMM_PFN_P2PDMA = 1UL << (BITS_PER_LONG - 5),
HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
+ HMM_PFN_ALLOW_P2P = 1UL << (BITS_PER_LONG - 7),
HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..79becc37df00 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,14 @@ struct dev_pagemap_ops {
*/
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+ /*
+ * Used for private (un-addressable) device memory only. Return a
+ * corresponding PFN for a page that can be mapped to device
+ * (e.g using dma_map_page)
+ */
+ int (*get_dma_pfn_for_device)(struct page *private_page,
+ unsigned long *dma_pfn);
+
/*
* Handle the memory failure happens on a range of pfns. Notify the
* processes who are using these pfns, and try to recover the data on
diff --git a/mm/hmm.c b/mm/hmm.c
index feac86196a65..089e522b346b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -232,6 +232,49 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
}
+static bool hmm_handle_device_private(struct hmm_range *range,
+ unsigned long pfn_req_flags,
+ swp_entry_t entry,
+ unsigned long *hmm_pfn)
+{
+ struct page *page = pfn_swap_entry_to_page(entry);
+ struct dev_pagemap *pgmap = page_pgmap(page);
+ int ret;
+
+ pfn_req_flags &= range->pfn_flags_mask;
+ pfn_req_flags |= range->default_flags;
+
+ /*
+ * Don't fault in device private pages owned by the caller,
+ * just report the PFN.
+ */
+ if (pgmap->owner == range->dev_private_owner) {
+ *hmm_pfn = swp_offset_pfn(entry);
+ goto found;
+ }
+
+ /*
+ * P2P for supported pages, and according to caller request
+ * translate the private page to the match P2P page if it fails
+ * continue with the regular flow
+ */
+ if (pfn_req_flags & HMM_PFN_ALLOW_P2P &&
+ pgmap->ops->get_dma_pfn_for_device) {
+ ret = pgmap->ops->get_dma_pfn_for_device(page, hmm_pfn);
+ if (!ret)
+ goto found;
+
+ }
+
+ return false;
+
+found:
+ *hmm_pfn |= HMM_PFN_VALID;
+ if (is_writable_device_private_entry(entry))
+ *hmm_pfn |= HMM_PFN_WRITE;
+ return true;
+}
+
static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
unsigned long end, pmd_t *pmdp, pte_t *ptep,
unsigned long *hmm_pfn)
@@ -255,19 +298,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
if (!pte_present(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);
- /*
- * Don't fault in device private pages owned by the caller,
- * just report the PFN.
- */
if (is_device_private_entry(entry) &&
- page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
- range->dev_private_owner) {
- cpu_flags = HMM_PFN_VALID;
- if (is_writable_device_private_entry(entry))
- cpu_flags |= HMM_PFN_WRITE;
- new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
- goto out;
- }
+ hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))
+ return 0;
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
--
2.34.1
On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: > From: Yonatan Maman <Ymaman@Nvidia.com> > > hmm_range_fault() by default triggered a page fault on device private > when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some > cases, such as with RDMA devices, the migration overhead between the > device (e.g., GPU) and the CPU, and vice-versa, significantly degrades > performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device > private page might be crucial for minimizing data transfer overhead. You don't enable DMA for device private pages. You allow discovering a DMAable alias for device private pages. Also absolutely nothing GPU specific here. > + /* > + * Don't fault in device private pages owned by the caller, > + * just report the PFN. > + */ > + if (pgmap->owner == range->dev_private_owner) { > + *hmm_pfn = swp_offset_pfn(entry); > + goto found; This is dangerous because it mixes actual DMAable alias PFNs with the device private fake PFNs. Maybe your hardware / driver can handle it, but just leaking this out is not a good idea. > + hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn)) Overly long line here.
On Sun, Jul 20, 2025 at 11:59:10PM -0700, Christoph Hellwig wrote: > > + /* > > + * Don't fault in device private pages owned by the caller, > > + * just report the PFN. > > + */ > > + if (pgmap->owner == range->dev_private_owner) { > > + *hmm_pfn = swp_offset_pfn(entry); > > + goto found; > > This is dangerous because it mixes actual DMAable alias PFNs with the > device private fake PFNs. Maybe your hardware / driver can handle > it, but just leaking this out is not a good idea. For better or worse that is how the hmm API works today. Recall the result is an array of unsigned long with a pfn and flags: enum hmm_pfn_flags { /* Output fields and flags */ HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1), HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2), HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3), The only promise is that every pfn has a struct page behind it. If the caller specifies dev_private_owner then it must also look into the struct page of every returned pfn to see if it is device private or not. hmm_dma_map_pfn() already unconditionally calls pci_p2pdma_state() which checks for P2P struct pages. It does sound like a good improvement to return the type of the pfn (normal, p2p, private) in the flags bits as well to optimize away these extra struct page lookups. But this is a different project.. Jason
On 21/07/2025 9:59, Christoph Hellwig wrote: > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: >> From: Yonatan Maman <Ymaman@Nvidia.com> >> >> hmm_range_fault() by default triggered a page fault on device private >> when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some >> cases, such as with RDMA devices, the migration overhead between the >> device (e.g., GPU) and the CPU, and vice-versa, significantly degrades >> performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device >> private page might be crucial for minimizing data transfer overhead. > > You don't enable DMA for device private pages. You allow discovering > a DMAable alias for device private pages. > > Also absolutely nothing GPU specific here. > Ok, understood, I will change it (v3). >> + /* >> + * Don't fault in device private pages owned by the caller, >> + * just report the PFN. >> + */ >> + if (pgmap->owner == range->dev_private_owner) { >> + *hmm_pfn = swp_offset_pfn(entry); >> + goto found; > > This is dangerous because it mixes actual DMAable alias PFNs with the > device private fake PFNs. Maybe your hardware / driver can handle > it, but just leaking this out is not a good idea. > In the current implementation, regular pci_p2p pages are returned as-is from hmm_range_fault() - for virtual address backed by pci_p2p page, it will return the corresponding PFN. That said, we can mark these via the hmm_pfn output flags so the caller can handle them appropriately. >> + hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn)) > > Overly long line here. > will be fixed (v3)
On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: > +++ b/include/linux/memremap.h > @@ -89,6 +89,14 @@ struct dev_pagemap_ops { > */ > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > + /* > + * Used for private (un-addressable) device memory only. Return a > + * corresponding PFN for a page that can be mapped to device > + * (e.g using dma_map_page) > + */ > + int (*get_dma_pfn_for_device)(struct page *private_page, > + unsigned long *dma_pfn); This makes no sense. If a page is addressable then it has a PFN. If a page is not addressable then it doesn't have a PFN.
On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote: > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: > > +++ b/include/linux/memremap.h > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops { > > */ > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > > + /* > > + * Used for private (un-addressable) device memory only. Return a > > + * corresponding PFN for a page that can be mapped to device > > + * (e.g using dma_map_page) > > + */ > > + int (*get_dma_pfn_for_device)(struct page *private_page, > > + unsigned long *dma_pfn); > > This makes no sense. If a page is addressable then it has a PFN. > If a page is not addressable then it doesn't have a PFN. The DEVICE_PRIVATE pages have a PFN, but it is not usable for anything. This is effectively converting from a DEVICE_PRIVATE page to an actual DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable proxy, like a swap entry, for where the real data is sitting. Jason
On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote: > On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote: > > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: > > > +++ b/include/linux/memremap.h > > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops { > > > */ > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > > > > + /* > > > + * Used for private (un-addressable) device memory only. Return a > > > + * corresponding PFN for a page that can be mapped to device > > > + * (e.g using dma_map_page) > > > + */ > > > + int (*get_dma_pfn_for_device)(struct page *private_page, > > > + unsigned long *dma_pfn); > > > > This makes no sense. If a page is addressable then it has a PFN. > > If a page is not addressable then it doesn't have a PFN. > > The DEVICE_PRIVATE pages have a PFN, but it is not usable for > anything. OK, then I don't understand what DEVICE PRIVATE means. I thought it was for memory on a PCIe device that isn't even visible through a BAR and so the CPU has no way of addressing it directly. But now you say that it has a PFN, which means it has a physical address, which means it's accessible to the CPU. So what is it? > This is effectively converting from a DEVICE_PRIVATE page to an actual > DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable > proxy, like a swap entry, for where the real data is sitting. > > Jason >
On Mon, Jul 21, 2025 at 02:23:13PM +0100, Matthew Wilcox wrote: > On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote: > > On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote: > > > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: > > > > +++ b/include/linux/memremap.h > > > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops { > > > > */ > > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > > > > > > + /* > > > > + * Used for private (un-addressable) device memory only. Return a > > > > + * corresponding PFN for a page that can be mapped to device > > > > + * (e.g using dma_map_page) > > > > + */ > > > > + int (*get_dma_pfn_for_device)(struct page *private_page, > > > > + unsigned long *dma_pfn); > > > > > > This makes no sense. If a page is addressable then it has a PFN. > > > If a page is not addressable then it doesn't have a PFN. > > > > The DEVICE_PRIVATE pages have a PFN, but it is not usable for > > anything. > > OK, then I don't understand what DEVICE PRIVATE means. > > I thought it was for memory on a PCIe device that isn't even visible > through a BAR and so the CPU has no way of addressing it directly. Correct. > But now you say that it has a PFN, which means it has a physical > address, which means it's accessible to the CPU. Having a PFN doesn't mean it's actually accessible to the CPU. It is a real physical address in the CPU address space, but it is a completely bogus/invalid address - if the CPU actually tries to access it will cause a machine check or whatever other exception gets generated when accessing an invalid physical address. Obviously we're careful to avoid that. The PFN is used solely to get to/from a struct page (via pfn_to_page() or page_to_pfn()). > So what is it? IMHO a hack, because obviously we shouldn't require real physical addresses for something the CPU can't actually address anyway and this causes real problems (eg. it doesn't actually work on anything other than x86_64). There's no reason the "PFN" we store in device-private entries couldn't instead just be an index into some data structure holding pointers to the struct pages. So instead of using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page() and page_to_device_private_index(). We discussed this briefly at LSFMM, I think your suggestion for a data structure was to use a maple tree. I'm yet to look at this more deeply but I'd like to figure out where memdescs fit in this picture too. - Alistair > > This is effectively converting from a DEVICE_PRIVATE page to an actual > > DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable > > proxy, like a swap entry, for where the real data is sitting. > > > > Jason > >
On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote: > > So what is it? > > IMHO a hack, because obviously we shouldn't require real physical addresses for > something the CPU can't actually address anyway and this causes real > problems IMHO what DEVICE PRIVATE really boils down to is a way to have swap entries that point to some kind of opaque driver managed memory. We have alot of assumptions all over about pfn/phys to page relationships so anything that has a struct page also has to come with a fake PFN today.. > (eg. it doesn't actually work on anything other than x86_64). There's no reason > the "PFN" we store in device-private entries couldn't instead just be an index > into some data structure holding pointers to the struct pages. So instead of > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page() > and page_to_device_private_index(). It could work, but any of the pfn conversions would have to be tracked down.. Could be troublesome. Jason
On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote: > On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote: > > > So what is it? > > > > IMHO a hack, because obviously we shouldn't require real physical addresses for > > something the CPU can't actually address anyway and this causes real > > problems > > IMHO what DEVICE PRIVATE really boils down to is a way to have swap > entries that point to some kind of opaque driver managed memory. > > We have alot of assumptions all over about pfn/phys to page > relationships so anything that has a struct page also has to come with > a fake PFN today.. Hmm ... maybe. To get that PFN though we have to come from either a special swap entry which we already have special cases for, or a struct page (which is a device private page) which we mostly have to handle specially anyway. I'm not sure there's too many places that can sensibly handle a fake PFN without somehow already knowing it is device-private PFN. > > (eg. it doesn't actually work on anything other than x86_64). There's no reason > > the "PFN" we store in device-private entries couldn't instead just be an index > > into some data structure holding pointers to the struct pages. So instead of > > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page() > > and page_to_device_private_index(). > > It could work, but any of the pfn conversions would have to be tracked > down.. Could be troublesome. I looked at this a while back and I'm reasonably optimistic that this is doable because we already have to treat these specially everywhere anyway. The proof will be writing the patches of course. - Alistair > Jason
On 23.07.25 06:10, Alistair Popple wrote: > On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote: >> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote: >>>> So what is it? >>> >>> IMHO a hack, because obviously we shouldn't require real physical addresses for >>> something the CPU can't actually address anyway and this causes real >>> problems >> >> IMHO what DEVICE PRIVATE really boils down to is a way to have swap >> entries that point to some kind of opaque driver managed memory. >> >> We have alot of assumptions all over about pfn/phys to page >> relationships so anything that has a struct page also has to come with >> a fake PFN today.. > > Hmm ... maybe. To get that PFN though we have to come from either a special > swap entry which we already have special cases for, or a struct page (which is > a device private page) which we mostly have to handle specially anyway. I'm not > sure there's too many places that can sensibly handle a fake PFN without somehow > already knowing it is device-private PFN. > >>> (eg. it doesn't actually work on anything other than x86_64). There's no reason >>> the "PFN" we store in device-private entries couldn't instead just be an index >>> into some data structure holding pointers to the struct pages. So instead of >>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page() >>> and page_to_device_private_index(). >> >> It could work, but any of the pfn conversions would have to be tracked >> down.. Could be troublesome. > > I looked at this a while back and I'm reasonably optimistic that this is doable > because we already have to treat these specially everywhere anyway. How would that look like? E.g., we have code like if (is_device_private_entry(entry)) { page = pfn_swap_entry_to_page(entry); folio = page_folio(page); ... folio_get(folio); ... } We could easily stop allowing pfn_swap_entry_to_page(), turning these into non-pfn swap entries. Would it then be something like if (is_device_private_entry(entry)) { page = device_private_entry_to_page(entry); ... } Whereby device_private_entry_to_page() obtains the "struct page" not via the PFN but some other magical (index) value? -- Cheers, David / dhildenb
On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote: > On 23.07.25 06:10, Alistair Popple wrote: > > On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote: > > > On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote: > > > > > So what is it? > > > > > > > > IMHO a hack, because obviously we shouldn't require real physical addresses for > > > > something the CPU can't actually address anyway and this causes real > > > > problems > > > > > > IMHO what DEVICE PRIVATE really boils down to is a way to have swap > > > entries that point to some kind of opaque driver managed memory. > > > > > > We have alot of assumptions all over about pfn/phys to page > > > relationships so anything that has a struct page also has to come with > > > a fake PFN today.. > > > > Hmm ... maybe. To get that PFN though we have to come from either a special > > swap entry which we already have special cases for, or a struct page (which is > > a device private page) which we mostly have to handle specially anyway. I'm not > > sure there's too many places that can sensibly handle a fake PFN without somehow > > already knowing it is device-private PFN. > > > > > > (eg. it doesn't actually work on anything other than x86_64). There's no reason > > > > the "PFN" we store in device-private entries couldn't instead just be an index > > > > into some data structure holding pointers to the struct pages. So instead of > > > > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page() > > > > and page_to_device_private_index(). > > > > > > It could work, but any of the pfn conversions would have to be tracked > > > down.. Could be troublesome. > > > > I looked at this a while back and I'm reasonably optimistic that this is doable > > because we already have to treat these specially everywhere anyway. > How would that look like? > > E.g., we have code like > > if (is_device_private_entry(entry)) { > page = pfn_swap_entry_to_page(entry); > folio = page_folio(page); > > ... > folio_get(folio); > ... > } > > We could easily stop allowing pfn_swap_entry_to_page(), turning these into > non-pfn swap entries. > > Would it then be something like > > if (is_device_private_entry(entry)) { > page = device_private_entry_to_page(entry); > > ... > } > > Whereby device_private_entry_to_page() obtains the "struct page" not via the > PFN but some other magical (index) value? Exactly. The observation being that when you convert a PTE from a swap entry to a page we already know it is a device private entry, so can go look up the struct page with special magic (eg. an index into some other array or data structure). And if you have a struct page you already know it's a device private page so if you need to create the swap entry you can look up the magic index using some alternate function. The only issue would be if there were generic code paths that somehow have a raw pfn obtained from neither a page-table walk or struct page. My assumption (yet to be proven/tested) is that these paths don't exist. - Alistair > > -- > Cheers, > > David / dhildenb >
On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote: > The only issue would be if there were generic code paths that somehow have a > raw pfn obtained from neither a page-table walk or struct page. My assumption > (yet to be proven/tested) is that these paths don't exist. hmm does it, it encodes the device private into a pfn and expects the caller to do pfn to page. This isn't set in stone and could be changed.. But broadly, you'd want to entirely eliminate the ability to go from pfn to device private or from device private to pfn. Instead you'd want to work on some (space #, space index) tuple, maybe encoded in a pfn_t, but absolutely and typesafely distinct. Each driver gets its own 0 based space for device private information, the space is effectively the pgmap. And if you do this, maybe we don't need struct page (I mean the type!) backing device memory at all.... Which would be a very worthwhile project. Do we ever even use anything in the device private struct page? Do we refcount it? Jason
On 01.08.25 18:40, Jason Gunthorpe wrote: > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote: > >> The only issue would be if there were generic code paths that somehow have a >> raw pfn obtained from neither a page-table walk or struct page. My assumption >> (yet to be proven/tested) is that these paths don't exist. > > hmm does it, it encodes the device private into a pfn and expects the > caller to do pfn to page. > > This isn't set in stone and could be changed.. > > But broadly, you'd want to entirely eliminate the ability to go from > pfn to device private or from device private to pfn. > > Instead you'd want to work on some (space #, space index) tuple, maybe > encoded in a pfn_t, but absolutely and typesafely distinct. Each > driver gets its own 0 based space for device private information, the > space is effectively the pgmap. > > And if you do this, maybe we don't need struct page (I mean the type!) > backing device memory at all.... Which would be a very worthwhile > project. > > Do we ever even use anything in the device private struct page? Do we > refcount it? ref-counted and map-counted ... -- Cheers, David / dhildenb
On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote: > On 01.08.25 18:40, Jason Gunthorpe wrote: > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote: > > > > > The only issue would be if there were generic code paths that somehow have a > > > raw pfn obtained from neither a page-table walk or struct page. My assumption > > > (yet to be proven/tested) is that these paths don't exist. > > > > hmm does it, it encodes the device private into a pfn and expects the > > caller to do pfn to page. > > > > This isn't set in stone and could be changed.. > > > > But broadly, you'd want to entirely eliminate the ability to go from > > pfn to device private or from device private to pfn. > > > > Instead you'd want to work on some (space #, space index) tuple, maybe > > encoded in a pfn_t, but absolutely and typesafely distinct. Each > > driver gets its own 0 based space for device private information, the > > space is effectively the pgmap. > > > > And if you do this, maybe we don't need struct page (I mean the type!) > > backing device memory at all.... Which would be a very worthwhile > > project. > > > > Do we ever even use anything in the device private struct page? Do we > > refcount it? > > ref-counted and map-counted ... Hm, so it would turn into another struct page split up where we get ourselves a struct device_private and change all the places touching its refcount and mapcount to use the new type. If we could use some index scheme we could then divorce from struct page and strink the struct size sooner. Jason
On Fri, Aug 01, 2025 at 01:57:49PM -0300, Jason Gunthorpe wrote: > On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote: > > On 01.08.25 18:40, Jason Gunthorpe wrote: > > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote: > > > > > > > The only issue would be if there were generic code paths that somehow have a > > > > raw pfn obtained from neither a page-table walk or struct page. My assumption > > > > (yet to be proven/tested) is that these paths don't exist. > > > > > > hmm does it, it encodes the device private into a pfn and expects the > > > caller to do pfn to page. What callers need to do pfn to page when finding a device private pfn via hmm_range_fault()? GPU drivers don't, they tend just to use the pfn as an offset from the start of the pgmap to find whatever data structure they are using to track device memory allocations. The migrate_vma_*() calls do, but they could easily be changed to whatever index scheme we use so long as we can encode that this is a device entry in the MIGRATE_PFN flags. So other than adding a HMM_PFN flag to say this is really a device index I don't see too many issues here. > > > This isn't set in stone and could be changed.. > > > > > > But broadly, you'd want to entirely eliminate the ability to go from > > > pfn to device private or from device private to pfn. > > > > > > Instead you'd want to work on some (space #, space index) tuple, maybe > > > encoded in a pfn_t, but absolutely and typesafely distinct. Each > > > driver gets its own 0 based space for device private information, the > > > space is effectively the pgmap. > > > > > > And if you do this, maybe we don't need struct page (I mean the type!) > > > backing device memory at all.... Which would be a very worthwhile > > > project. Exactly! Although we still need enough of a struct page or something else to still be able to map them in normal anonymous VMAs. Short term the motivation for this project is that the current scheme of "stealing" pfns for the device doesn't actually work in a lot of cases. > > > Do we ever even use anything in the device private struct page? Do we > > > refcount it? > > > > ref-counted and map-counted ... > > Hm, so it would turn into another struct page split up where we get > ourselves a struct device_private and change all the places touching > its refcount and mapcount to use the new type. > > If we could use some index scheme we could then divorce from struct > page and strink the struct size sooner. Right, that is roughly along the lines of what I was thinking. > Jason
On Mon, Aug 04, 2025 at 11:51:38AM +1000, Alistair Popple wrote: > On Fri, Aug 01, 2025 at 01:57:49PM -0300, Jason Gunthorpe wrote: > > On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote: > > > On 01.08.25 18:40, Jason Gunthorpe wrote: > > > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote: > > > > > > > > > The only issue would be if there were generic code paths that somehow have a > > > > > raw pfn obtained from neither a page-table walk or struct page. My assumption > > > > > (yet to be proven/tested) is that these paths don't exist. > > > > > > > > hmm does it, it encodes the device private into a pfn and expects the > > > > caller to do pfn to page. > > What callers need to do pfn to page when finding a device private pfn via > hmm_range_fault()? GPU drivers don't, they tend just to use the pfn as an offset > from the start of the pgmap to find whatever data structure they are using to > track device memory allocations. All drivers today must. You have no idea if the PFN returned is a private or CPU page. The only way to know is to check the struct page type, by looking inside the struct page. > So other than adding a HMM_PFN flag to say this is really a device index I don't > see too many issues here. Christoph suggested exactly this, and it would solve the issue. Seems quite easy too. Let's do it. Jason
On 01.08.25 18:57, Jason Gunthorpe wrote: > On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote: >> On 01.08.25 18:40, Jason Gunthorpe wrote: >>> On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote: >>> >>>> The only issue would be if there were generic code paths that somehow have a >>>> raw pfn obtained from neither a page-table walk or struct page. My assumption >>>> (yet to be proven/tested) is that these paths don't exist. >>> >>> hmm does it, it encodes the device private into a pfn and expects the >>> caller to do pfn to page. >>> >>> This isn't set in stone and could be changed.. >>> >>> But broadly, you'd want to entirely eliminate the ability to go from >>> pfn to device private or from device private to pfn. >>> >>> Instead you'd want to work on some (space #, space index) tuple, maybe >>> encoded in a pfn_t, but absolutely and typesafely distinct. Each >>> driver gets its own 0 based space for device private information, the >>> space is effectively the pgmap. >>> >>> And if you do this, maybe we don't need struct page (I mean the type!) >>> backing device memory at all.... Which would be a very worthwhile >>> project. >>> >>> Do we ever even use anything in the device private struct page? Do we >>> refcount it? >> >> ref-counted and map-counted ... > > Hm, so it would turn into another struct page split up where we get > ourselves a struct device_private and change all the places touching > its refcount and mapcount to use the new type. We're already working with folios in all cases where we modify either refcount or mapcount IIUC. The rmap handling (try to migrate, soon folio splitting) currently depends on the mapcount. Not sure how that will all look like without a ... struct folio / struct page. -- Cheers, David / dhildenb
On 25.07.25 02:31, Alistair Popple wrote: > On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote: >> On 23.07.25 06:10, Alistair Popple wrote: >>> On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote: >>>> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote: >>>>>> So what is it? >>>>> >>>>> IMHO a hack, because obviously we shouldn't require real physical addresses for >>>>> something the CPU can't actually address anyway and this causes real >>>>> problems >>>> >>>> IMHO what DEVICE PRIVATE really boils down to is a way to have swap >>>> entries that point to some kind of opaque driver managed memory. >>>> >>>> We have alot of assumptions all over about pfn/phys to page >>>> relationships so anything that has a struct page also has to come with >>>> a fake PFN today.. >>> >>> Hmm ... maybe. To get that PFN though we have to come from either a special >>> swap entry which we already have special cases for, or a struct page (which is >>> a device private page) which we mostly have to handle specially anyway. I'm not >>> sure there's too many places that can sensibly handle a fake PFN without somehow >>> already knowing it is device-private PFN. >>> >>>>> (eg. it doesn't actually work on anything other than x86_64). There's no reason >>>>> the "PFN" we store in device-private entries couldn't instead just be an index >>>>> into some data structure holding pointers to the struct pages. So instead of >>>>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page() >>>>> and page_to_device_private_index(). >>>> >>>> It could work, but any of the pfn conversions would have to be tracked >>>> down.. Could be troublesome. >>> >>> I looked at this a while back and I'm reasonably optimistic that this is doable >>> because we already have to treat these specially everywhere anyway. >> How would that look like? >> >> E.g., we have code like >> >> if (is_device_private_entry(entry)) { >> page = pfn_swap_entry_to_page(entry); >> folio = page_folio(page); >> >> ... >> folio_get(folio); >> ... >> } >> >> We could easily stop allowing pfn_swap_entry_to_page(), turning these into >> non-pfn swap entries. >> >> Would it then be something like >> >> if (is_device_private_entry(entry)) { >> page = device_private_entry_to_page(entry); >> >> ... >> } >> >> Whereby device_private_entry_to_page() obtains the "struct page" not via the >> PFN but some other magical (index) value? > > Exactly. The observation being that when you convert a PTE from a swap entry > to a page we already know it is a device private entry, so can go look up the > struct page with special magic (eg. an index into some other array or data > structure). > > And if you have a struct page you already know it's a device private page so if > you need to create the swap entry you can look up the magic index using some > alternate function. > > The only issue would be if there were generic code paths that somehow have a > raw pfn obtained from neither a page-table walk or struct page. My assumption > (yet to be proven/tested) is that these paths don't exist. I guess memory compaction and friends don't apply to ZONE_DEVICE, and even memory_failure() handling goes a separate path. -- Cheers, David / dhildenb
On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote: > On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote: > > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote: > > > +++ b/include/linux/memremap.h > > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops { > > > */ > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); > > > > > > + /* > > > + * Used for private (un-addressable) device memory only. Return a > > > + * corresponding PFN for a page that can be mapped to device > > > + * (e.g using dma_map_page) > > > + */ > > > + int (*get_dma_pfn_for_device)(struct page *private_page, > > > + unsigned long *dma_pfn); > > > > This makes no sense. If a page is addressable then it has a PFN. > > If a page is not addressable then it doesn't have a PFN. > > The DEVICE_PRIVATE pages have a PFN, but it is not usable for > anything. > > This is effectively converting from a DEVICE_PRIVATE page to an actual > DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable > proxy, like a swap entry, for where the real data is sitting. Yes, it's on my backlog to start looking at using something other than a real PFN for this proxy. Because having it as an actual PFN has caused us all sorts of random issues as it still needs to reserve a real physical address range which may or may not be available on a given machine. > > Jason >
© 2016 - 2025 Red Hat, Inc.