drivers/md/dm-crypt.c | 95 ++++++++++++++++++++++++++++++--------------------- include/linux/gfp.h | 21 +++++++++--- include/linux/mempool.h | 21 ++++++++++++ mm/mempolicy.c | 12 ++++--- mm/mempool.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------- mm/page_alloc.c | 21 ++++++++---- 6 files changed, 323 insertions(+), 95 deletions(-)
Changelog: RFC -> v2: * Added callback variant for page bulk allocator and mempool bulk allocator per Mel Gorman. * Used the callback version in dm-crypt driver. * Some code cleanup and refactor to reduce duplicate code. rfc: https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828301@gmail.com/ We have full disk encryption enabled, profiling shows page allocations may incur a noticeable overhead when writing. The dm-crypt creates an "out" bio for writing. And fill the "out" bio with the same amount of pages as "in" bio. But the driver allocates one page at a time in a loop. For 1M bio it means the driver has to call page allocator 256 times. It seems not that efficient. Since v5.13 we have page bulk allocator supported, so dm-crypt could use it to do page allocations more efficiently. I could just call the page bulk allocator in dm-crypt driver before the mempool allocator, but it seems ad-hoc and the quick search shows some others do the similar thing, for example, f2fs compress, block bounce, g2fs, ufs, etc. So it seems more neat to implement a general bulk allocation API for mempool. Currently the bulk allocator just supported list and array to consume the pages. But neither is the best fit to dm-crypt ussecase. So introduce a new bulk allocator API, callback, per the suggestion from Mel Gorman. It consumes the pages by calling a callback with a parameter. So introduce the mempool page bulk allocator. The below APIs are introduced: - mempool_init_pages_bulk() - mempool_create_pages_bulk() They initialize the mempool for page bulk allocator. The pool is filled by alloc_page() in a loop. - mempool_alloc_pages_bulk_cb() - mempool_alloc_pages_bulk_array() They do bulk allocation from mempool. The list version is not implemented since there is no user for list version bulk allocator so far and it may be gong soon. They do the below conceptually: 1. Call bulk page allocator 2. If the allocation is fulfilled then return otherwise try to allocate the remaining pages from the mempool 3. If it is fulfilled then return otherwise retry from #1 with sleepable gfp 4. If it is still failed, sleep for a while to wait for the mempool is refilled, then retry from #1 The populated pages will stay on array until the callers consume them or free them, or will be consumed by the callback. Since mempool allocator is guaranteed to success in the sleepable context, so the two APIs return true for success or false for fail. It is the caller's responsibility to handle failure case (partial allocation), just like the page bulk allocator. The mempool typically is an object agnostic allocator, but bulk allocation is only supported by pages, so the mempool bulk allocator is for page allocation only as well. With the mempool bulk allocator the IOPS of dm-crypt with 1M I/O would get improved by approxiamately 6%. The test is done on a machine with 80 CPU and 128GB memory with an encrypted ram device (the impact from storage hardware could be minimized so that we could benchmark the dm-crypt layer more accurately). Before the patch: Jobs: 1 (f=1): [w(1)][100.0%][w=1301MiB/s][w=1301 IOPS][eta 00m:00s] crypt: (groupid=0, jobs=1): err= 0: pid=48512: Wed Feb 1 18:11:30 2023 write: IOPS=1300, BW=1301MiB/s (1364MB/s)(76.2GiB/60001msec); 0 zone resets slat (usec): min=724, max=867, avg=765.71, stdev=19.27 clat (usec): min=4, max=196297, avg=195688.86, stdev=6450.50 lat (usec): min=801, max=197064, avg=196454.90, stdev=6450.35 clat percentiles (msec): | 1.00th=[ 197], 5.00th=[ 197], 10.00th=[ 197], 20.00th=[ 197], | 30.00th=[ 197], 40.00th=[ 197], 50.00th=[ 197], 60.00th=[ 197], | 70.00th=[ 197], 80.00th=[ 197], 90.00th=[ 197], 95.00th=[ 197], | 99.00th=[ 197], 99.50th=[ 197], 99.90th=[ 197], 99.95th=[ 197], | 99.99th=[ 197] bw ( MiB/s): min= 800, max= 1308, per=99.69%, avg=1296.94, stdev=46.02, samples=119 iops : min= 800, max= 1308, avg=1296.94, stdev=46.02, samples=119 lat (usec) : 10=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05% lat (msec) : 100=0.08%, 250=99.83% cpu : usr=3.88%, sys=96.02%, ctx=69, majf=1, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,78060,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=256 Run status group 0 (all jobs): WRITE: bw=1301MiB/s (1364MB/s), 1301MiB/s-1301MiB/s (1364MB/s-1364MB/s), io=76.2GiB (81.9GB), run=60001-60001msec After the patch: Jobs: 1 (f=1): [w(1)][100.0%][w=1401MiB/s][w=1401 IOPS][eta 00m:00s] crypt: (groupid=0, jobs=1): err= 0: pid=2171: Wed Feb 1 21:08:16 2023 write: IOPS=1401, BW=1402MiB/s (1470MB/s)(82.1GiB/60001msec); 0 zone resets slat (usec): min=685, max=815, avg=710.77, stdev=13.24 clat (usec): min=4, max=182206, avg=181658.31, stdev=5810.58 lat (usec): min=709, max=182913, avg=182369.36, stdev=5810.67 clat percentiles (msec): | 1.00th=[ 182], 5.00th=[ 182], 10.00th=[ 182], 20.00th=[ 182], | 30.00th=[ 182], 40.00th=[ 182], 50.00th=[ 182], 60.00th=[ 182], | 70.00th=[ 182], 80.00th=[ 182], 90.00th=[ 182], 95.00th=[ 182], | 99.00th=[ 182], 99.50th=[ 182], 99.90th=[ 182], 99.95th=[ 182], | 99.99th=[ 182] bw ( MiB/s): min= 900, max= 1408, per=99.71%, avg=1397.60, stdev=46.04, samples=119 iops : min= 900, max= 1408, avg=1397.60, stdev=46.04, samples=119 lat (usec) : 10=0.01%, 750=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05% lat (msec) : 100=0.08%, 250=99.83% cpu : usr=3.66%, sys=96.23%, ctx=76, majf=1, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=0,84098,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=256 Run status group 0 (all jobs): WRITE: bw=1402MiB/s (1470MB/s), 1402MiB/s-1402MiB/s (1470MB/s-1470MB/s), io=82.1GiB (88.2GB), run=60001-60001msec And the benchmark with 4K size I/O doesn't show measurable regression. Yang Shi (5): mm: page_alloc: add API for bulk allocator with callback mm: mempool: extract the common initialization and alloc code mm: mempool: introduce page bulk allocator md: dm-crypt: move crypt_free_buffer_pages ahead md: dm-crypt: use mempool page bulk allocator drivers/md/dm-crypt.c | 95 ++++++++++++++++++++++++++++++--------------------- include/linux/gfp.h | 21 +++++++++--- include/linux/mempool.h | 21 ++++++++++++ mm/mempolicy.c | 12 ++++--- mm/mempool.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------- mm/page_alloc.c | 21 ++++++++---- 6 files changed, 323 insertions(+), 95 deletions(-)
On Tue, 14 Feb 2023, Yang Shi wrote: > > Changelog: > RFC -> v2: > * Added callback variant for page bulk allocator and mempool bulk allocator > per Mel Gorman. > * Used the callback version in dm-crypt driver. > * Some code cleanup and refactor to reduce duplicate code. > > rfc: https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828301@gmail.com/ Hi This seems like unneeded complication to me. We have alloc_pages(), it can allocate multiple pages efficiently, so why not use it? I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if alloc_pages() fails (either because the system is low on memory or because memory is too fragmented), fall back to the existing code that does mempool_alloc(). Mikulas
On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > On Tue, 14 Feb 2023, Yang Shi wrote: > > > > > Changelog: > > RFC -> v2: > > * Added callback variant for page bulk allocator and mempool bulk allocator > > per Mel Gorman. > > * Used the callback version in dm-crypt driver. > > * Some code cleanup and refactor to reduce duplicate code. > > > > rfc: https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828301@gmail.com/ > > Hi > > This seems like unneeded complication to me. We have alloc_pages(), it can > allocate multiple pages efficiently, so why not use it? The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't need contiguous pages at all. This may incur unnecessary compaction overhead to the dm-crypt layer when memory is fragmented. The bulk allocator is a good fit to this usecase, which allocates multiple order-0 pages. In addition, filesystem writeback doesn't guarantee power-of-2 pages every time IIUC. But alloc_pages() just can allocate power-of-2 pages. > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if > alloc_pages() fails (either because the system is low on memory or because > memory is too fragmented), fall back to the existing code that does > mempool_alloc(). My PoC patches just did this way, but called bulk allocator. There may be other potential mepool users as I listed in this cover letter, which may get benefits from bulk allocator. So introducing a new bulk mempool API seems better for long run although we just have one user for now. And it makes other uses easier to gain the benefit by just calling the new API. > > Mikulas >
On Wed, 15 Feb 2023, Yang Shi wrote: > On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > > On Tue, 14 Feb 2023, Yang Shi wrote: > > > > > > > > Changelog: > > > RFC -> v2: > > > * Added callback variant for page bulk allocator and mempool bulk allocator > > > per Mel Gorman. > > > * Used the callback version in dm-crypt driver. > > > * Some code cleanup and refactor to reduce duplicate code. > > > > > > rfc: https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828301@gmail.com/ > > > > Hi > > > > This seems like unneeded complication to me. We have alloc_pages(), it can > > allocate multiple pages efficiently, so why not use it? > > The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't > need contiguous pages at all. This may incur unnecessary compaction It doesn't hurt that the pages are contiguous - and allocating and freeing a few compound pages is even faster than allocating and freeing many 0-order pages. > overhead to the dm-crypt layer when memory is fragmented. The compaction overhead may be suppressed by the GFP flags (i.e. don't use __GFP_DIRECT_RECLAIM). > The bulk allocator is a good fit to this usecase, which allocates > multiple order-0 pages. > > In addition, filesystem writeback doesn't guarantee power-of-2 pages > every time IIUC. But alloc_pages() just can allocate power-of-2 pages. So, we can allocate more compound pages for the non-power-of-2 case - see the next patch that I'm sending. > > > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if > > alloc_pages() fails (either because the system is low on memory or because > > memory is too fragmented), fall back to the existing code that does > > mempool_alloc(). > > My PoC patches just did this way, but called bulk allocator. There may > be other potential mepool users as I listed in this cover letter, > which may get benefits from bulk allocator. So introducing a new bulk > mempool API seems better for long run although we just have one user > for now. And it makes other uses easier to gain the benefit by just > calling the new API. This mempool bulk refactoring just makes the code bigger. And it is not needed - dm-crypt can fall-back to non-bulk mempool allocations. In the next email, I'm sending a patch that is noticeably smaller and that uses alloc_pages()/__free_pages(). Mikulas
On Thu, Feb 16, 2023 at 9:45 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > On Wed, 15 Feb 2023, Yang Shi wrote: > > > On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > > > > > > On Tue, 14 Feb 2023, Yang Shi wrote: > > > > > > > > > > > Changelog: > > > > RFC -> v2: > > > > * Added callback variant for page bulk allocator and mempool bulk allocator > > > > per Mel Gorman. > > > > * Used the callback version in dm-crypt driver. > > > > * Some code cleanup and refactor to reduce duplicate code. > > > > > > > > rfc: https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828301@gmail.com/ > > > > > > Hi > > > > > > This seems like unneeded complication to me. We have alloc_pages(), it can > > > allocate multiple pages efficiently, so why not use it? > > > > The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't > > need contiguous pages at all. This may incur unnecessary compaction > > It doesn't hurt that the pages are contiguous - and allocating and freeing > a few compound pages is even faster than allocating and freeing many > 0-order pages. If "allocating many order-0 pages" means calling alloc_page() multiple times, just like what the dm-crypt code does before this patchset, yeah, allocating a compound page may be faster. But it may be not true with bulk allocator. And it also depends on how bad the fragmentation is and how contended the zone lock is. When allocating order-0 page, the bulk allocator just could take the pages from pcp list within one call. And the pcp list could hold a lot pages actually, on my test machine per pcp list could have more than 1000 pages. > > > overhead to the dm-crypt layer when memory is fragmented. > > The compaction overhead may be suppressed by the GFP flags (i.e. don't use > __GFP_DIRECT_RECLAIM). You could, but you may pressure the mempool quite more often when light memory pressure and fragmentation exist. The bulk allocator may just succeed with light reclamation without allocating from mempool. > > > The bulk allocator is a good fit to this usecase, which allocates > > multiple order-0 pages. > > > > In addition, filesystem writeback doesn't guarantee power-of-2 pages > > every time IIUC. But alloc_pages() just can allocate power-of-2 pages. > > So, we can allocate more compound pages for the non-power-of-2 case - see > the next patch that I'm sending. Thanks for the patch. If the callers are willing to handle the complexity (calculating the proper orders, dealing with the compound pages, etc), it is definitely an option for them. > > > > > > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if > > > alloc_pages() fails (either because the system is low on memory or because > > > memory is too fragmented), fall back to the existing code that does > > > mempool_alloc(). > > > > My PoC patches just did this way, but called bulk allocator. There may > > be other potential mepool users as I listed in this cover letter, > > which may get benefits from bulk allocator. So introducing a new bulk > > mempool API seems better for long run although we just have one user > > for now. And it makes other uses easier to gain the benefit by just > > calling the new API. > > This mempool bulk refactoring just makes the code bigger. And it is not > needed - dm-crypt can fall-back to non-bulk mempool allocations. Do you mean the mempool code? It may be inevitable by adding a new API. But it is not significantly bigger. And the API hides all the details and complexity from the callers, as well as handle all the allocation corner cases in mm layer. It would make the users life much easier. Of course if the callers are happy to handle all the complexity by themselves, they don't have to call the API. > > In the next email, I'm sending a patch that is noticeably smaller and that > uses alloc_pages()/__free_pages(). Thanks for the patch. But if other potential users would like to do the same optimization, the code have to be duplicated everywhere. Maybe not every one is happy to handle this by themselves. > > Mikulas >
© 2016 - 2025 Red Hat, Inc.