[PATCH 00/13] Enable compound page for p2pdma memory

Hou Tao posted 13 patches 1 month, 2 weeks ago
drivers/accel/habanalabs/common/hldio.c |   3 +-
drivers/nvme/host/pci.c                 |  10 +-
drivers/pci/p2pdma.c                    | 140 ++++++++++++++++++++++--
fs/kernfs/file.c                        |  79 +++++++++++++
fs/sysfs/file.c                         |  15 +++
include/linux/huge_mm.h                 |   4 +
include/linux/kernfs.h                  |   3 +
include/linux/pci-p2pdma.h              |  30 ++++-
include/linux/sysfs.h                   |   4 +
mm/huge_memory.c                        |  66 +++++++++++
10 files changed, 339 insertions(+), 15 deletions(-)
[PATCH 00/13] Enable compound page for p2pdma memory
Posted by Hou Tao 1 month, 2 weeks ago
From: Hou Tao <houtao1@huawei.com>

Hi,

device-dax has already supported compound page. It not only reduces the
cost of struct page significantly, it also improve the performance of
get_user_pages when 2MB or 1GB page size is used. We are experimenting
to use p2p dma to directly transfer the content of NVMe SSD into NPU.
The size of NPU HBM is 32GB or larger and there are at most 8 NPUs in
the host. When using the base page, the memory overhead is about 4GB for
128GB HBM, and the mapping of 32GB HBM into userspace takes about 0.8
second. Considering ZONE_DEVICE memory type has already supported the
compound page, enabling the compound page support for p2pdma memory as
well. After applying the patch set, when using the 1GB page, the memory
overhead is about 2MB and the mmap costs about 0.04 ms.

The main difference between the compound page support of device-dax and
p2pdma is that p2pdma inserts the page into user vma during mmap instead
of page fault. The main reason is simplicity. The patch set is
structured as shown below:

Patch #1~#2: tiny bug fixes for p2pdma
Patch #3~#5: add callbacks support in kernfs and sysfs, include
pagesize, may_split and get_unmapped_area. These callbacks are necessary
for the support of compound page when mmaping sysfs binary file.
Patch #6~#7: create compound page for p2pdma memory in the kernel. 
Patch #8~#10: support the mapping of compound page in userspace. 
Patch #11~#12: support the compound page for NVMe CMB.
Patch #13: enable the support for compound page for p2pdma memory.

Please see individual patches for more details. Comments and
suggestions are always welcome.

Hou Tao (13):
  PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page()
    fails
  PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap()
  kernfs: add support for get_unmapped_area callback
  kernfs: add support for may_split and pagesize callbacks
  sysfs: support get_unmapped_area callback for binary file
  PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource()
  PCI/P2PDMA: create compound page for aligned p2pdma memory
  mm/huge_memory: add helpers to insert huge page during mmap
  PCI/P2PDMA: support get_unmapped_area to return aligned vaddr
  PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
  PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align()
  nvme-pci: introduce cmb_devmap_align module parameter
  PCI/P2PDMA: enable compound page support for p2pdma memory

 drivers/accel/habanalabs/common/hldio.c |   3 +-
 drivers/nvme/host/pci.c                 |  10 +-
 drivers/pci/p2pdma.c                    | 140 ++++++++++++++++++++++--
 fs/kernfs/file.c                        |  79 +++++++++++++
 fs/sysfs/file.c                         |  15 +++
 include/linux/huge_mm.h                 |   4 +
 include/linux/kernfs.h                  |   3 +
 include/linux/pci-p2pdma.h              |  30 ++++-
 include/linux/sysfs.h                   |   4 +
 mm/huge_memory.c                        |  66 +++++++++++
 10 files changed, 339 insertions(+), 15 deletions(-)

-- 
2.29.2
Re: [PATCH 00/13] Enable compound page for p2pdma memory
Posted by Leon Romanovsky 1 month, 2 weeks ago
On Sat, Dec 20, 2025 at 12:04:33PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> device-dax has already supported compound page. It not only reduces the
> cost of struct page significantly, it also improve the performance of
> get_user_pages when 2MB or 1GB page size is used. We are experimenting
> to use p2p dma to directly transfer the content of NVMe SSD into NPU.

I’ll admit my understanding here is limited, and lately everything tends  
to look like a DMABUF problem to me. Could you explain why DMABUF support 
is not being used for this use case?

Thanks

> The size of NPU HBM is 32GB or larger and there are at most 8 NPUs in
> the host. When using the base page, the memory overhead is about 4GB for
> 128GB HBM, and the mapping of 32GB HBM into userspace takes about 0.8
> second. Considering ZONE_DEVICE memory type has already supported the
> compound page, enabling the compound page support for p2pdma memory as
> well. After applying the patch set, when using the 1GB page, the memory
> overhead is about 2MB and the mmap costs about 0.04 ms.
> 
> The main difference between the compound page support of device-dax and
> p2pdma is that p2pdma inserts the page into user vma during mmap instead
> of page fault. The main reason is simplicity. The patch set is
> structured as shown below:
> 
> Patch #1~#2: tiny bug fixes for p2pdma
> Patch #3~#5: add callbacks support in kernfs and sysfs, include
> pagesize, may_split and get_unmapped_area. These callbacks are necessary
> for the support of compound page when mmaping sysfs binary file.
> Patch #6~#7: create compound page for p2pdma memory in the kernel. 
> Patch #8~#10: support the mapping of compound page in userspace. 
> Patch #11~#12: support the compound page for NVMe CMB.
> Patch #13: enable the support for compound page for p2pdma memory.
> 
> Please see individual patches for more details. Comments and
> suggestions are always welcome.
> 
> Hou Tao (13):
>   PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page()
>     fails
>   PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap()
>   kernfs: add support for get_unmapped_area callback
>   kernfs: add support for may_split and pagesize callbacks
>   sysfs: support get_unmapped_area callback for binary file
>   PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource()
>   PCI/P2PDMA: create compound page for aligned p2pdma memory
>   mm/huge_memory: add helpers to insert huge page during mmap
>   PCI/P2PDMA: support get_unmapped_area to return aligned vaddr
>   PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
>   PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align()
>   nvme-pci: introduce cmb_devmap_align module parameter
>   PCI/P2PDMA: enable compound page support for p2pdma memory
> 
>  drivers/accel/habanalabs/common/hldio.c |   3 +-
>  drivers/nvme/host/pci.c                 |  10 +-
>  drivers/pci/p2pdma.c                    | 140 ++++++++++++++++++++++--
>  fs/kernfs/file.c                        |  79 +++++++++++++
>  fs/sysfs/file.c                         |  15 +++
>  include/linux/huge_mm.h                 |   4 +
>  include/linux/kernfs.h                  |   3 +
>  include/linux/pci-p2pdma.h              |  30 ++++-
>  include/linux/sysfs.h                   |   4 +
>  mm/huge_memory.c                        |  66 +++++++++++
>  10 files changed, 339 insertions(+), 15 deletions(-)
> 
> -- 
> 2.29.2
> 
>