drivers/gpu/drm/nouveau/nouveau_dmem.c | 117 ++++++++++++++++++++++++- drivers/infiniband/core/umem_odp.c | 2 +- drivers/infiniband/hw/mlx5/mlx5_ib.h | 6 +- include/linux/hmm.h | 2 + include/linux/memremap.h | 7 ++ mm/hmm.c | 28 ++++++ 6 files changed, 156 insertions(+), 6 deletions(-)
From: Yonatan Maman <Ymaman@Nvidia.com> This patch series aims to enable Peer-to-Peer (P2P) DMA access in GPU-centric applications that utilize RDMA and private device pages. This enhancement is crucial for minimizing data transfer overhead by allowing the GPU to directly expose device private page data to devices such as NICs, eliminating the need to traverse system RAM, which is the native method for exposing device private page data. To fully support Peer-to-Peer for device private pages, the following changes are proposed: `Memory Management (MM)` * Leverage struct pagemap_ops to support P2P page operations: This modification ensures that the GPU can directly map device private pages for P2P DMA. * Utilize hmm_range_fault to support P2P connections for device private pages (instead of Page fault) `IB Drivers` Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA requests. `Nouveau driver` Add support for the Nouveau p2p_page callback function: This update integrates P2P DMA support into the Nouveau driver, allowing it to handle P2P page operations seamlessly. `MLX5 Driver` Optimize PCI Peer-to-Peer for private device pages, by enabling Address Translation service(ATS) for ODP memory. Yonatan Maman (4): mm/hmm: HMM API for P2P DMA to device zone pages nouveau/dmem: HMM P2P DMA for private dev pages IB/core: P2P DMA for device private pages RDMA/mlx5: Enabling ATS for ODP memory drivers/gpu/drm/nouveau/nouveau_dmem.c | 117 ++++++++++++++++++++++++- drivers/infiniband/core/umem_odp.c | 2 +- drivers/infiniband/hw/mlx5/mlx5_ib.h | 6 +- include/linux/hmm.h | 2 + include/linux/memremap.h | 7 ++ mm/hmm.c | 28 ++++++ 6 files changed, 156 insertions(+), 6 deletions(-) -- 2.34.1
On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: > From: Yonatan Maman <Ymaman@Nvidia.com> > > This patch series aims to enable Peer-to-Peer (P2P) DMA access in > GPU-centric applications that utilize RDMA and private device pages. This > enhancement is crucial for minimizing data transfer overhead by allowing > the GPU to directly expose device private page data to devices such as > NICs, eliminating the need to traverse system RAM, which is the native > method for exposing device private page data. Please tone down your marketing language and explain your factual changes. If you make performance claims back them by numbers.
On 16/10/2024 7:23, Christoph Hellwig wrote: > On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: >> From: Yonatan Maman <Ymaman@Nvidia.com> >> >> This patch series aims to enable Peer-to-Peer (P2P) DMA access in >> GPU-centric applications that utilize RDMA and private device pages. This >> enhancement is crucial for minimizing data transfer overhead by allowing >> the GPU to directly expose device private page data to devices such as >> NICs, eliminating the need to traverse system RAM, which is the native >> method for exposing device private page data. > > Please tone down your marketing language and explain your factual > changes. If you make performance claims back them by numbers. > Got it, thanks! I'll fix that. Regarding performance, we’re achieving over 10x higher bandwidth and 10x lower latency using perftest-rdma, especially (with a high rate of GPU memory access).
在 2024/10/16 17:16, Yonatan Maman 写道: > > > On 16/10/2024 7:23, Christoph Hellwig wrote: >> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: >>> From: Yonatan Maman <Ymaman@Nvidia.com> >>> >>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in >>> GPU-centric applications that utilize RDMA and private device pages. >>> This >>> enhancement is crucial for minimizing data transfer overhead by allowing >>> the GPU to directly expose device private page data to devices such as >>> NICs, eliminating the need to traverse system RAM, which is the native >>> method for exposing device private page data. >> >> Please tone down your marketing language and explain your factual >> changes. If you make performance claims back them by numbers. >> > > Got it, thanks! I'll fix that. Regarding performance, we’re achieving > over 10x higher bandwidth and 10x lower latency using perftest-rdma, > especially (with a high rate of GPU memory access). If I got this patch series correctly, this is based on ODP (On Demand Paging). And a way also exists which is based on non-ODP. From the following links, this way is implemented on efa, irdma and mlx5. 1. iRDMA https://lore.kernel.org/all/20230217011425.498847-1-yanjun.zhu@intel.com/ 2. efa https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/ 3. mlx5 https://lore.kernel.org/all/1608067636-98073-5-git-send-email-jianxin.xiong@intel.com/ Because these 2 methods are both implemented on mlx5, have you compared the test results with the 2 methods on mlx5? The most important results should be latency and bandwidth. Please let us know the test results. Thanks a lot. Zhu Yanjun
On 18/10/2024 10:26, Zhu Yanjun wrote: > External email: Use caution opening links or attachments > > > 在 2024/10/16 17:16, Yonatan Maman 写道: >> >> >> On 16/10/2024 7:23, Christoph Hellwig wrote: >>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: >>>> From: Yonatan Maman <Ymaman@Nvidia.com> >>>> >>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in >>>> GPU-centric applications that utilize RDMA and private device pages. >>>> This >>>> enhancement is crucial for minimizing data transfer overhead by >>>> allowing >>>> the GPU to directly expose device private page data to devices such as >>>> NICs, eliminating the need to traverse system RAM, which is the native >>>> method for exposing device private page data. >>> >>> Please tone down your marketing language and explain your factual >>> changes. If you make performance claims back them by numbers. >>> >> >> Got it, thanks! I'll fix that. Regarding performance, we’re achieving >> over 10x higher bandwidth and 10x lower latency using perftest-rdma, >> especially (with a high rate of GPU memory access). > > If I got this patch series correctly, this is based on ODP (On Demand > Paging). And a way also exists which is based on non-ODP. From the > following links, this way is implemented on efa, irdma and mlx5. > 1. iRDMA > https://lore.kernel.org/all/20230217011425.498847-1-yanjun.zhu@intel.com/ > > 2. efa > https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/ > > 3. mlx5 > https://lore.kernel.org/all/1608067636-98073-5-git-send-email- > jianxin.xiong@intel.com/ > > Because these 2 methods are both implemented on mlx5, have you compared > the test results with the 2 methods on mlx5? > > The most important results should be latency and bandwidth. Please let > us know the test results. > > Thanks a lot. > Zhu Yanjun > This patch-set aims to support GPU Direct RDMA for HMM ODP memory. Compared to the dma-buf method, we achieve the same performance (BW and latency), for GPU intensive test-cases (No CPU accesses during the test).
Yonatan Maman <ymaman@nvidia.com> writes: > On 16/10/2024 7:23, Christoph Hellwig wrote: >> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote: >>> From: Yonatan Maman <Ymaman@Nvidia.com> >>> >>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in >>> GPU-centric applications that utilize RDMA and private device pages. This >>> enhancement is crucial for minimizing data transfer overhead by allowing >>> the GPU to directly expose device private page data to devices such as >>> NICs, eliminating the need to traverse system RAM, which is the native >>> method for exposing device private page data. >> Please tone down your marketing language and explain your factual >> changes. If you make performance claims back them by numbers. >> > > Got it, thanks! I'll fix that. Regarding performance, we’re achieving > over 10x higher bandwidth and 10x lower latency using perftest-rdma, > especially (with a high rate of GPU memory access). The performance claims still sound a bit vague. Please make sure you include actual perftest-rdma performance numbers from before and after applying this series when you repost.
© 2016 - 2024 Red Hat, Inc.