[v6] nvme/pci: PRP list DMA pool partitioning

[PATCH v6 0/3] nvme/pci: PRP list DMA pool partitioning

Posted by Caleb Sander Mateos 9 months, 2 weeks ago

NVMe commands with over 8 KB of discontiguous data allocate PRP list
pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool
spinlock. These device-global spinlocks are a significant source of
contention when many CPUs are submitting to the same NVMe devices. On a
workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2
NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in
_raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.

Ideally, the dma_pools would be per-hctx to minimize contention. But
that could impose considerable resource costs in a system with many NVMe
devices and CPUs.

As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
nvme_queue to the set of DMA pools corresponding to its device and its
hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
about half, to 1.2%. Preventing the sharing of PRP list pages across
NUMA nodes also makes them cheaper to initialize.

Allocating the dmapool structs on the desired NUMA node further reduces
the time spent in dma_pool_alloc from 0.87% to 0.50%.

Caleb Sander Mateos (2):
  nvme/pci: factor out nvme_init_hctx() helper
  nvme/pci: make PRP list DMA pools per-NUMA-node

Keith Busch (1):
  dmapool: add NUMA affinity support

 drivers/nvme/host/pci.c | 171 +++++++++++++++++++++++-----------------
 include/linux/dmapool.h |  17 +++-
 mm/dmapool.c            |  16 ++--
 3 files changed, 121 insertions(+), 83 deletions(-)

v6:
- Clarify description of when PRP list pages are allocated (Christoph)
- Add Reviewed-by tags

v5:
- Allocate dmapool structs on desired NUMA node (Keith)
- Add Reviewed-by tags

v4:
- Drop the numa_node < nr_node_ids check (Kanchan)
- Add Reviewed-by tags

v3: simplify nvme_release_prp_pools() (Keith)

v2:
- Initialize admin nvme_queue's nvme_prp_dma_pools (Kanchan)
- Shrink nvme_dev's prp_pools array from MAX_NUMNODES to nr_node_ids (Kanchan)

-- 
2.45.2

Re: [PATCH v6 0/3] nvme/pci: PRP list DMA pool partitioning

Posted by Christoph Hellwig 9 months ago

As it's been impossible to get any MM feedback even with multiple pings,
so I've tentatively queued it up in nvme-6.16.  If anyone in MM land
doesn't like this please scream now and I'll drop it ASAP.

Re: [PATCH v6 0/3] nvme/pci: PRP list DMA pool partitioning

Posted by Caleb Sander Mateos 9 months, 1 week ago

Hi all,
It seems like there is consensus on this series and all patches have
multiple reviews. Would it be possible to queue it up for 6.16? The
NVMe tree seems like it would make sense, though maybe the dmapool
patch needs to go through the mm tree?

Thanks,
Caleb

On Fri, Apr 25, 2025 at 7:07 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> NVMe commands with over 8 KB of discontiguous data allocate PRP list
> pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
> Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool
> spinlock. These device-global spinlocks are a significant source of
> contention when many CPUs are submitting to the same NVMe devices. On a
> workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2
> NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in
> _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
>
> Ideally, the dma_pools would be per-hctx to minimize contention. But
> that could impose considerable resource costs in a system with many NVMe
> devices and CPUs.
>
> As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
> nvme_queue to the set of DMA pools corresponding to its device and its
> hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
> about half, to 1.2%. Preventing the sharing of PRP list pages across
> NUMA nodes also makes them cheaper to initialize.
>
> Allocating the dmapool structs on the desired NUMA node further reduces
> the time spent in dma_pool_alloc from 0.87% to 0.50%.
>
> Caleb Sander Mateos (2):
>   nvme/pci: factor out nvme_init_hctx() helper
>   nvme/pci: make PRP list DMA pools per-NUMA-node
>
> Keith Busch (1):
>   dmapool: add NUMA affinity support
>
>  drivers/nvme/host/pci.c | 171 +++++++++++++++++++++++-----------------
>  include/linux/dmapool.h |  17 +++-
>  mm/dmapool.c            |  16 ++--
>  3 files changed, 121 insertions(+), 83 deletions(-)
>
> v6:
> - Clarify description of when PRP list pages are allocated (Christoph)
> - Add Reviewed-by tags
>
> v5:
> - Allocate dmapool structs on desired NUMA node (Keith)
> - Add Reviewed-by tags
>
> v4:
> - Drop the numa_node < nr_node_ids check (Kanchan)
> - Add Reviewed-by tags
>
> v3: simplify nvme_release_prp_pools() (Keith)
>
> v2:
> - Initialize admin nvme_queue's nvme_prp_dma_pools (Kanchan)
> - Shrink nvme_dev's prp_pools array from MAX_NUMNODES to nr_node_ids (Kanchan)
>
> --
> 2.45.2
>

Re: [PATCH v6 0/3] nvme/pci: PRP list DMA pool partitioning

Posted by Keith Busch 9 months, 1 week ago

On Fri, May 02, 2025 at 09:48:17AM -0700, Caleb Sander Mateos wrote:
> It seems like there is consensus on this series and all patches have
> multiple reviews. Would it be possible to queue it up for 6.16? The
> NVMe tree seems like it would make sense, though maybe the dmapool
> patch needs to go through the mm tree?

The subsequent nvme patches depend on the dmapool patch, so they all
need to go through the same tree. It's okay with me if nvme picks these
up, but I think we need an Ack from one of the mm reviewers/maintainers.