fs/pipe.c | 103 ++++- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/pipe/.gitignore | 1 + tools/testing/selftests/pipe/Makefile | 9 + tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++ 5 files changed, 727 insertions(+), 3 deletions(-)
While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.
This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.
alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.
I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.
Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.
== No memory pressure (10s per config) ==
Throughput in MB/s (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 1119 -> 1354 (+21%) 1132 -> 1195 (+6%) 1060 -> 1240 (+17%)
2 1162 -> 1487 (+28%) 1034 -> 1285 (+24%) 1069 -> 1213 (+14%)
5 1152 -> 1357 (+18%) 1021 -> 1164 (+14%) 997 -> 1239 (+24%)
Avg write latency in ns (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 55786 -> 46103 (-17%) 55164 -> 52260 (-5%) 58906 -> 50370 (-14%)
2 107546 -> 84011 (-22%) 120837 -> 97206 (-20%) 116860 -> 103036 (-12%)
5 271293 -> 230170 (-15%) 306089 -> 268429 (-12%) 313300 -> 252232 (-19%)
Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.
== Under memory pressure (--memory-pressure, 6s per config) ==
stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.
Throughput in MB/s (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 1088 -> 1438 (+32%) 996 -> 1477 (+48%) 989 -> 1194 (+21%)
2 1076 -> 1378 (+28%) 1007 -> 1269 (+26%) 1018 -> 1234 (+21%)
5 1052 -> 1311 (+25%) 986 -> 1225 (+24%) 972 -> 1249 (+29%)
Avg write latency in ns (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 57397 -> 43406 (-24%) 62690 -> 42272 (-33%) 63136 -> 52272 (-17%)
2 116121 -> 90700 (-22%) 124098 -> 98481 (-21%) 122754 -> 101217 (-18%)
5 297122 -> 238322 (-20%) 316836 -> 255095 (-19%) 321496 -> 250189 (-22%)
Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.
That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.
Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v3:
- Drop the anon_pipe_free_pages() call from the wait branch so
leftover prealloc pages survive the wait_event sleep. The loop
resumes once pipe_writable() becomes true and immediately wants
pages again; freeing them forced the next iteration back into
alloc_page() under pipe->mutex, defeating the patch for any write
large enough to block mid-syscall. The out: label still frees any
remainder on syscall exit, so nothing leaks. (Suggested by Oleg
Nesterov.)
- Drop the in-loop anon_pipe_refill_tmp_pages() call in the wait
branch as well: its only purpose was to rescue pages from the
free_pages() above. With prealloc surviving the sleep,
anon_pipe_get_page() drains it directly on the next iteration, and
the out: label still refills tmp_page[] at syscall exit.
- Link to v2: https://patch.msgid.link/20260522-fix_pipe-v2-0-a8b35a78244e@debian.org
Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org
To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Breno Leitao (2):
fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
selftests/pipe: add pipe_bench microbenchmark
fs/pipe.c | 103 ++++-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/pipe/.gitignore | 1 +
tools/testing/selftests/pipe/Makefile | 9 +
tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
5 files changed, 727 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7
Best regards,
--
Breno Leitao <leitao@debian.org>
© 2016 - 2026 Red Hat, Inc.