mm/dmapool.c | 107 +++++++++++++++++++++++++++------------------------ 1 file changed, 56 insertions(+), 51 deletions(-)
From: Keith Busch <kbusch@kernel.org> Allocating and freeing blocks from the dmapool iterates a list of all allocated pages. We can save time removing the per-alloc/free list traversal for a constant time lookup, so this series does that. Compared to current kernel, perf record from running io_uring benchmarks on nvme reports dma_pool_alloc() cost reduction cut in half from 0.81% to 0.41%. Keith Busch (2): mm/dmapool: replace linked list with xarray mm/dmapool: link blocks across pages mm/dmapool.c | 107 +++++++++++++++++++++++++++------------------------ 1 file changed, 56 insertions(+), 51 deletions(-) -- 2.30.2
I posted a similar patch series back in 2018:
https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/
https://lore.kernel.org/linux-mm/15ff502d-d840-1003-6c45-bc17f0d81262@cybernetics.com/
https://lore.kernel.org/linux-mm/1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com/
I initially used a red-black tree keyed by the DMA address, but then for
v2 of the patchset I put the dma pool info directly into struct page and
used virt_to_page() to get at it. But it turned out that was a bad idea
because not all architectures have struct page backing
dma_alloc_coherent():
https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/
I intended to go back and resubmit the red-black tree version, but I was
too busy at the time and forgot about it. A few days ago I finally
decided to update the patches and submit them upstream. I found your
recent dmapool xarray patches by searching the mailing list archive to
see if anyone else was working on something similar.
Using the following as a benchmark:
modprobe mpt3sas
drivers/scsi/mpt3sas/mpt3sas_base.c
_base_allocate_chain_dma_pool
loop dma_pool_alloc(ioc->chain_dma_pool)
rmmod mpt3sas
drivers/scsi/mpt3sas/mpt3sas_base.c
_base_release_memory_pools()
loop dma_pool_free(ioc->chain_dma_pool)
Here are the benchmark results showing the speedup from the patchsets:
modprobe rmmod
orig 1x 1x
xarray 5.2x 186x
rbtree 9.3x 269x
It looks like my red-black tree version is faster than the v1 of the
xarray patch on this benchmark at least, although the mpt3sas usage of
dmapool is hardly typical. I will try to get some testing done on my
patchset and post it next week.
Tony Battersby
Cybernetics
On Fri, May 27, 2022 at 03:35:47PM -0400, Tony Battersby wrote: > I posted a similar patch series back in 2018: > > https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/ > https://lore.kernel.org/linux-mm/15ff502d-d840-1003-6c45-bc17f0d81262@cybernetics.com/ > https://lore.kernel.org/linux-mm/1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com/ > > > I initially used a red-black tree keyed by the DMA address, but then for > v2 of the patchset I put the dma pool info directly into struct page and > used virt_to_page() to get at it. But it turned out that was a bad idea > because not all architectures have struct page backing > dma_alloc_coherent(): > > https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/ > > I intended to go back and resubmit the red-black tree version, but I was > too busy at the time and forgot about it. A few days ago I finally > decided to update the patches and submit them upstream. I found your > recent dmapool xarray patches by searching the mailing list archive to > see if anyone else was working on something similar. > > Using the following as a benchmark: > > modprobe mpt3sas > drivers/scsi/mpt3sas/mpt3sas_base.c > _base_allocate_chain_dma_pool > loop dma_pool_alloc(ioc->chain_dma_pool) > > rmmod mpt3sas > drivers/scsi/mpt3sas/mpt3sas_base.c > _base_release_memory_pools() > loop dma_pool_free(ioc->chain_dma_pool) > > Here are the benchmark results showing the speedup from the patchsets: > > modprobe rmmod > orig 1x 1x > xarray 5.2x 186x > rbtree 9.3x 269x > > It looks like my red-black tree version is faster than the v1 of the > xarray patch on this benchmark at least, although the mpt3sas usage of > dmapool is hardly typical. I will try to get some testing done on my > patchset and post it next week. Thanks for the info. Just comparing with xarray, I actually found that the list was still faster until you get >100 pages in the pool, after which xarray becomes the clear winner. But it turns out I don't often see that many pages allocated for a lot of real use cases, so I'm trying to take this in a different direction by replacing the lookup structures with an intrusive stack. That is safe to do since pages are never freed for the lifetime of the pool, and it's by far faster than anything else. The downside is that I'd need to increase the size of the smallest allowable pool block, but I think that's okay. Anyway I was planning to post this new idea soon. My reasons for wanting a faster dma pool are still in the works, though, so I'm just trying to sort out those patches before returning to this one.
© 2016 - 2026 Red Hat, Inc.