[v2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock

[PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock

Posted by Breno Leitao 2 days, 1 hour ago

While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.

This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.

alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.

I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.

Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
  per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
  helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
  anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
  struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
  unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org

To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (2):
      fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 | 105 ++++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 5 files changed, 729 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock

Posted by Jeff Layton 1 day, 22 hours ago

On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> While profiling Meta's caching code[1], I found pipe->mutex contention
> on the hot path. anon_pipe_write() currently calls alloc_page() once
> per page while holding pipe->mutex. The allocation can sleep doing
> direct reclaim and runs memcg charging, which extends the critical
> section and stalls any concurrent reader on the same mutex.
> 
> This series pre-allocates pages outside pipe->mutex in
> anon_pipe_write(): for writes that span more than one full page, up
> to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> then drains the prealloc array first, falls back to the per-pipe
> tmp_page[] cache, and only enters the allocator under the mutex for
> the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> writes that skip prealloc, or shortfalls when the prealloc loop
> fails). Leftover prealloc pages are recycled into tmp_page[] before
> unlock and any remainder is put_page()'d after unlock, keeping the
> allocator out of the critical section on both sides.
> 
> alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
> refuses __GFP_ACCOUNT under memcg -- it returns at most one page
> when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
> 8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
> __GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
> task NUMA mempolicy honoured uniformly without open-coding the
> charge.
> 
> I also vibe-coded a microbenchmark to validate the change. It sweeps
> writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> 1 MB pipe and prints throughput + latency percentiles per config.
> 
> Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
> writes, 1 MB pipe). The numbers below were collected on v1
> (alloc_pages_bulk()); v2's per-page loop preserves the dominant
> "allocation outside the mutex" win and is expected to land in the same
> range.
> 
> == No memory pressure (10s per config) ==
> 
>   Throughput in MB/s (baseline -> patched, delta):
>     writers   readers=1              readers=5               readers=10
>           1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
>           2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
>           5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)
> 
>   Avg write latency in ns (baseline -> patched, delta):
>     writers   readers=1                 readers=5                readers=10
>           1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
>           2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
>           5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)
> 
> Throughput improves +6% to +28% and average write latency drops 5%
> to 22% across every configuration.
> 
> == Under memory pressure (--memory-pressure, 6s per config) ==
> 
> stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
> sweep so the alloc_page() calls inside anon_pipe_write() routinely
> hit direct reclaim -- exactly the regime the patch targets.
> 
>   Throughput in MB/s (baseline -> patched, delta):
>     writers   readers=1            readers=5            readers=10
>           1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
>           2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
>           5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)
> 
>   Avg write latency in ns (baseline -> patched, delta):
>     writers   readers=1              readers=5              readers=10
>           1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
>           2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
>           5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)
> 
> Throughput improves +21% to +48% and average write latency drops
> 17% to 33% -- a noticeably bigger win than the no-pressure run.
> 
> That tracks: when alloc_page() has to dip into reclaim, the cost
> of holding pipe->mutex across it is highest, and pulling the
> allocation out of the critical section pays the most.
> 
> Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
>   per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
> - Split the prealloc work out of anon_pipe_write() into dedicated
>   helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
>   anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
>   struct anon_pipe_prealloc, so the write path stays readable.
> - Recycle leftover prealloc pages into pipe->tmp_page[] before
>   unlocking
> - Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org
> 
> To: Alexander Viro <viro@zeniv.linux.org.uk>
> To: Christian Brauner <brauner@kernel.org>
> To: Jan Kara <jack@suse.cz>
> To: Shuah Khan <shuah@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> 
> ---
> Breno Leitao (2):
>       fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
>       selftests/pipe: add pipe_bench microbenchmark
> 
>  fs/pipe.c                                 | 105 ++++-
>  tools/testing/selftests/Makefile          |   1 +
>  tools/testing/selftests/pipe/.gitignore   |   1 +
>  tools/testing/selftests/pipe/Makefile     |   9 +
>  tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
>  5 files changed, 729 insertions(+), 3 deletions(-)
> ---
> base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
> change-id: 20260515-fix_pipe-c91677c187e7
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>

Pity this can't use the bulk page allocator, but looks good otherwise.

Reviewed-by: Jeff Layton <jlayton@kernel.org>