*** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

[PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Yonatan Maman 2 months, 2 weeks ago

From: Yonatan Maman <Ymaman@Nvidia.com>

This patch series aims to enable Peer-to-Peer (P2P) DMA access in
GPU-centric applications that utilize RDMA and private device pages. This
enhancement reduces data transfer overhead by allowing the GPU to directly
expose device private page data to devices such as NICs, eliminating the
need to traverse system RAM, which is the native method for exposing
device private page data.

To fully support Peer-to-Peer for device private pages, the following
changes are proposed:

`Memory Management (MM)`
 * Leverage struct pagemap_ops to support P2P page operations: This
modification ensures that the GPU can directly map device private pages
for P2P DMA.
 * Utilize hmm_range_fault to support P2P connections for device private
pages (instead of Page fault)

`IB Drivers`
Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
requests.

`Nouveau driver`
Add support for the Nouveau p2p_page callback function: This update
integrates P2P DMA support into the Nouveau driver, allowing it to handle
P2P page operations seamlessly.

`MLX5 Driver`
Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
inaccessible bridges, the system falls back to standard DMA, which uses host
memory, for the affected PFNs

Previous version:
https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/

Yonatan Maman (5):
  mm/hmm: HMM API to enable P2P DMA for device private pages
  nouveau/dmem: HMM P2P DMA for private dev pages
  IB/core: P2P DMA for device private pages
  RDMA/mlx5: Enable P2P DMA with fallback mechanism
  RDMA/mlx5: Enabling ATS for ODP memory

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
 drivers/infiniband/core/umem_odp.c     |   4 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
 drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
 include/linux/hmm.h                    |   3 +-
 include/linux/memremap.h               |   8 ++
 mm/hmm.c                               |  57 ++++++++++---
 7 files changed, 195 insertions(+), 17 deletions(-)

-- 
2.34.1

Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Christoph Hellwig 2 months, 2 weeks ago

Please use a more suitable name for your series.  There's absolutle
nothing GPU-specific here, and reusing the name from a complete
trainwreck that your company pushed over the last few years doesn't
help either.

Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 02:51:07PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
> 
> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
> GPU-centric applications that utilize RDMA and private device pages. This
> enhancement reduces data transfer overhead by allowing the GPU to directly
> expose device private page data to devices such as NICs, eliminating the
> need to traverse system RAM, which is the native method for exposing
> device private page data.
> 
> To fully support Peer-to-Peer for device private pages, the following
> changes are proposed:
> 
> `Memory Management (MM)`
>  * Leverage struct pagemap_ops to support P2P page operations: This
> modification ensures that the GPU can directly map device private pages
> for P2P DMA.
>  * Utilize hmm_range_fault to support P2P connections for device private
> pages (instead of Page fault)
> 
> `IB Drivers`
> Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
> need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
> requests.
> 
> `Nouveau driver`
> Add support for the Nouveau p2p_page callback function: This update
> integrates P2P DMA support into the Nouveau driver, allowing it to handle
> P2P page operations seamlessly.
> 
> `MLX5 Driver`
> Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
> DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
> inaccessible bridges, the system falls back to standard DMA, which uses host
> memory, for the affected PFNs

I'm probably missing something very important, but why can't you always
perform p2p if two devices support it? It is strange that IB and not HMM
has a fallback mode.

Thanks

> 
> Previous version:
> https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
> https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/
> 
> Yonatan Maman (5):
>   mm/hmm: HMM API to enable P2P DMA for device private pages
>   nouveau/dmem: HMM P2P DMA for private dev pages
>   IB/core: P2P DMA for device private pages
>   RDMA/mlx5: Enable P2P DMA with fallback mechanism
>   RDMA/mlx5: Enabling ATS for ODP memory
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
>  drivers/infiniband/core/umem_odp.c     |   4 +
>  drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
>  drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
>  include/linux/hmm.h                    |   3 +-
>  include/linux/memremap.h               |   8 ++
>  mm/hmm.c                               |  57 ++++++++++---
>  7 files changed, 195 insertions(+), 17 deletions(-)
> 
> -- 
> 2.34.1
>

Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Yonatan Maman 2 months, 2 weeks ago


On 20/07/2025 13:30, Leon Romanovsky wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Jul 18, 2025 at 02:51:07PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>> GPU-centric applications that utilize RDMA and private device pages. This
>> enhancement reduces data transfer overhead by allowing the GPU to directly
>> expose device private page data to devices such as NICs, eliminating the
>> need to traverse system RAM, which is the native method for exposing
>> device private page data.
>>
>> To fully support Peer-to-Peer for device private pages, the following
>> changes are proposed:
>>
>> `Memory Management (MM)`
>>   * Leverage struct pagemap_ops to support P2P page operations: This
>> modification ensures that the GPU can directly map device private pages
>> for P2P DMA.
>>   * Utilize hmm_range_fault to support P2P connections for device private
>> pages (instead of Page fault)
>>
>> `IB Drivers`
>> Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
>> need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
>> requests.
>>
>> `Nouveau driver`
>> Add support for the Nouveau p2p_page callback function: This update
>> integrates P2P DMA support into the Nouveau driver, allowing it to handle
>> P2P page operations seamlessly.
>>
>> `MLX5 Driver`
>> Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
>> DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
>> inaccessible bridges, the system falls back to standard DMA, which uses host
>> memory, for the affected PFNs
> 
> I'm probably missing something very important, but why can't you always
> perform p2p if two devices support it? It is strange that IB and not HMM
> has a fallback mode.
> 
> Thanks
>

P2P mapping can fail even when both devices support it, due to PCIe 
bridge limitations or IOMMU restrictions that block direct P2P access. 
The fallback is in IB rather than HMM because HMM only manages memory 
pages - it doesn't do DMA mapping. The IB driver does the actual DMA 
operations, so it knows when P2P mapping fails and can fall back to 
copying through system memory.
In fact, hmm_range_fault doesn't have information about the destination 
device that will perform the DMA mapping.
>>
>> Previous version:
>> https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
>> https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/
>>
>> Yonatan Maman (5):
>>    mm/hmm: HMM API to enable P2P DMA for device private pages
>>    nouveau/dmem: HMM P2P DMA for private dev pages
>>    IB/core: P2P DMA for device private pages
>>    RDMA/mlx5: Enable P2P DMA with fallback mechanism
>>    RDMA/mlx5: Enabling ATS for ODP memory
>>
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
>>   drivers/infiniband/core/umem_odp.c     |   4 +
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
>>   drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
>>   include/linux/hmm.h                    |   3 +-
>>   include/linux/memremap.h               |   8 ++
>>   mm/hmm.c                               |  57 ++++++++++---
>>   7 files changed, 195 insertions(+), 17 deletions(-)
>>
>> --
>> 2.34.1
>>

Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Mon, Jul 21, 2025 at 12:03:51AM +0300, Yonatan Maman wrote:
> 
> 
> On 20/07/2025 13:30, Leon Romanovsky wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Fri, Jul 18, 2025 at 02:51:07PM +0300, Yonatan Maman wrote:
> > > From: Yonatan Maman <Ymaman@Nvidia.com>
> > > 
> > > This patch series aims to enable Peer-to-Peer (P2P) DMA access in
> > > GPU-centric applications that utilize RDMA and private device pages. This
> > > enhancement reduces data transfer overhead by allowing the GPU to directly
> > > expose device private page data to devices such as NICs, eliminating the
> > > need to traverse system RAM, which is the native method for exposing
> > > device private page data.
> > > 
> > > To fully support Peer-to-Peer for device private pages, the following
> > > changes are proposed:
> > > 
> > > `Memory Management (MM)`
> > >   * Leverage struct pagemap_ops to support P2P page operations: This
> > > modification ensures that the GPU can directly map device private pages
> > > for P2P DMA.
> > >   * Utilize hmm_range_fault to support P2P connections for device private
> > > pages (instead of Page fault)
> > > 
> > > `IB Drivers`
> > > Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
> > > need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
> > > requests.
> > > 
> > > `Nouveau driver`
> > > Add support for the Nouveau p2p_page callback function: This update
> > > integrates P2P DMA support into the Nouveau driver, allowing it to handle
> > > P2P page operations seamlessly.
> > > 
> > > `MLX5 Driver`
> > > Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
> > > DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
> > > inaccessible bridges, the system falls back to standard DMA, which uses host
> > > memory, for the affected PFNs
> > 
> > I'm probably missing something very important, but why can't you always
> > perform p2p if two devices support it? It is strange that IB and not HMM
> > has a fallback mode.
> > 
> > Thanks
> > 
> 
> P2P mapping can fail even when both devices support it, due to PCIe bridge
> limitations or IOMMU restrictions that block direct P2P access.

Yes, it is how p2p works. The decision "if p2p is supported or not" is
calculated by pci_p2pdma_map_type(). That function needs to get which two
devices will be connected.

In proposed HMM_PFN_ALLOW_P2P flag, you don't provide device information
and for the system with more than 2 p2p devices, you will get completely
random result.


> The fallback is in IB rather than HMM because HMM only manages memory pages - it doesn't
> do DMA mapping. The IB driver does the actual DMA operations, so it knows
> when P2P mapping fails and can fall back to copying through system memory.

The thing is that in proposed patch, IB doesn't check that p2p is
established with right device.
https://lore.kernel.org/all/20250718115112.3881129-5-ymaman@nvidia.com/

> In fact, hmm_range_fault doesn't have information about the destination
> device that will perform the DMA mapping.

So probably you need to teach HMM to perform page_faults on specific device.

Thansk

> > > 
> > > Previous version:
> > > https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
> > > https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/
> > > 
> > > Yonatan Maman (5):
> > >    mm/hmm: HMM API to enable P2P DMA for device private pages
> > >    nouveau/dmem: HMM P2P DMA for private dev pages
> > >    IB/core: P2P DMA for device private pages
> > >    RDMA/mlx5: Enable P2P DMA with fallback mechanism
> > >    RDMA/mlx5: Enabling ATS for ODP memory
> > > 
> > >   drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
> > >   drivers/infiniband/core/umem_odp.c     |   4 +
> > >   drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
> > >   drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
> > >   include/linux/hmm.h                    |   3 +-
> > >   include/linux/memremap.h               |   8 ++
> > >   mm/hmm.c                               |  57 ++++++++++---
> > >   7 files changed, 195 insertions(+), 17 deletions(-)
> > > 
> > > --
> > > 2.34.1
> > > 
> 
> 
>

Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Jason Gunthorpe 2 months, 2 weeks ago

On Mon, Jul 21, 2025 at 09:49:04AM +0300, Leon Romanovsky wrote:
> > In fact, hmm_range_fault doesn't have information about the destination
> > device that will perform the DMA mapping.
> 
> So probably you need to teach HMM to perform page_faults on specific device.

That isn't how the HMM side is supposed to work, this API is just
giving the one and only P2P page that is backing the device private.

The providing driver shouldn't be doing any p2pdma operations to check
feasibility.

Otherwise we are doing p2p operations twice on every page, doesn't
make sense.

We've consistently been saying the P2P is done during the DMA mapping
side only, I think we should stick with that. Failing P2P is an
exception case, and the fix is to trigger page migration which the
general hmm code knows how to do. So calling hmm range fault again
makes sense to me. I wouldn't want drivers open coding the migration
logic in the new callback.

Jason

Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Wed, Jul 23, 2025 at 01:03:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 21, 2025 at 09:49:04AM +0300, Leon Romanovsky wrote:
> > > In fact, hmm_range_fault doesn't have information about the destination
> > > device that will perform the DMA mapping.
> > 
> > So probably you need to teach HMM to perform page_faults on specific device.
> 
> That isn't how the HMM side is supposed to work, this API is just
> giving the one and only P2P page that is backing the device private.

I know, but somehow you need to say: "please give me p2p pages for
specific device and not random device in the system as it is now".
This is what is missing from my PoV.

Thanks