[v2] mm: improve write performance with RWF_DONTCACHE

[PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE

Posted by Jeff Layton 2 months, 1 week ago

This version adopts Christoph's suggest to have generic_write_sync()
kick the flusher thread for the superblock instead of initiating
writeback directly. This seems to perform as well or better in most
cases than doing the writeback directly.

Here are results on XFS, both local and exported via knfsd:

    nfsd: https://markdownpastebin.com/?id=1884b9487c404ff4b7094ed41cc48f05
    xfs: https://markdownpastebin.com/?id=3c6b262182184b25b7d58fb211374475

Ritesh had also asked about getting perf lock traces to confirm the
source of the contention. I did that (and I can post them if you like),
but the results from the unpatched dontcache runs didn't point out any
specific lock contention. That leads me to believe that the bottlenecks
were from normal queueing work, and not contention for the xa_lock after
all.

Kicking the writeback thread seems to be a clear improvement over the
status quo in my testing, but I do wonder if having dontcache writes
spamming writeback for the whole bdi is the best idea.

I'm benchmarking out a patch that has the flusher do a
writeback_single_inode() for the work. I don't expect it to perform
measurably better in this testing, but it would better isolate the
DONTCACHE writeback behavior to just those inodes touched by DONTCACHE
writes.

Assuming that looks OK, I'll probably send a v3. Original cover letter
from v1 follows:

-----------------------------------8<------------------------------------

Recently, we've added controls that allow nfsd to use different IO modes
for reads and writes. There are currently 3 different settings for each:

- buffered: traditional buffered reads and writes (this is the default)
- dontcache: set the RWF_DONTCACHE flag on the read or write
- direct: use direct I/O

One of my goals for this half of the year was to do some benchmarking of
these different modes with different workloads to see if we can come
up with some guidance about what should be used and when.

I had Claude cook up a set of benchmarks that used fio's libnfs backend
and started testing the different modes. The initial results weren't
terribly surprising, but one thing that really stood out was how badly
RWF_DONTCACHE performed with write-heavy workloads. This turned out to
be the case on a local xfs with io_uring as well as with nfsd.

The nice thing about these new debugfs controls for nfsd is that it
makes it easy to experiement with different IO modes for nfsd. After
testing several different approaches, I think this patchset represents a
fairly clear improvement. The first two patches alleviate the flush
contention when RWF_DONTCACHE is used with heavy write activity.

The last two patches add the performance benchmarking scripts. I don't
expect us to merge those, but I wanted to include them to make it clear
how this was tested.  The results of my testing with all 4 modes
(buffered, direct, patched and unpatched dontcache) along with Claude's
analysis are at the links below:

nfsd results: https://markdownpastebin.com/?id=0eaf694bd54046b584a8572895abcec2
xfs results: https://markdownpastebin.com/?id=96249deb897a401ba32acbce05312dcc

I can also send them inline if people don't want to chase links.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v2:
- kick flusher thread instead of initiatin writeback inline
- add mechanism to run 'perf lock' around the testcases
- Link to v1: https://lore.kernel.org/r/20260401-dontcache-v1-0-1f5746fab47a@kernel.org

---
Jeff Layton (3):
      mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE
      testing: add nfsd-io-bench NFS server benchmark suite
      testing: add dontcache-bench local filesystem benchmark suite

 fs/fs-writeback.c                                  |  14 +
 include/linux/backing-dev-defs.h                   |   1 +
 include/linux/fs.h                                 |   6 +-
 include/trace/events/writeback.h                   |   3 +-
 .../dontcache-bench/fio-jobs/lat-reader.fio        |  12 +
 .../dontcache-bench/fio-jobs/multi-write.fio       |   9 +
 .../dontcache-bench/fio-jobs/noisy-writer.fio      |  12 +
 .../testing/dontcache-bench/fio-jobs/rand-read.fio |  13 +
 .../dontcache-bench/fio-jobs/rand-write.fio        |  13 +
 .../testing/dontcache-bench/fio-jobs/seq-read.fio  |  13 +
 .../testing/dontcache-bench/fio-jobs/seq-write.fio |  13 +
 .../dontcache-bench/scripts/parse-results.sh       | 238 +++++++++
 .../dontcache-bench/scripts/run-benchmarks.sh      | 562 ++++++++++++++++++++
 .../testing/nfsd-io-bench/fio-jobs/lat-reader.fio  |  15 +
 .../testing/nfsd-io-bench/fio-jobs/multi-write.fio |  14 +
 .../nfsd-io-bench/fio-jobs/noisy-writer.fio        |  14 +
 tools/testing/nfsd-io-bench/fio-jobs/rand-read.fio |  15 +
 .../testing/nfsd-io-bench/fio-jobs/rand-write.fio  |  15 +
 tools/testing/nfsd-io-bench/fio-jobs/seq-read.fio  |  14 +
 tools/testing/nfsd-io-bench/fio-jobs/seq-write.fio |  14 +
 .../testing/nfsd-io-bench/scripts/parse-results.sh | 238 +++++++++
 .../nfsd-io-bench/scripts/run-benchmarks.sh        | 591 +++++++++++++++++++++
 .../testing/nfsd-io-bench/scripts/setup-server.sh  |  94 ++++
 23 files changed, 1928 insertions(+), 5 deletions(-)
---
base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
change-id: 20260401-dontcache-5811efd7eaf3

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>

Re: [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE

Posted by Christoph Hellwig 2 months ago

On Wed, Apr 08, 2026 at 10:25:20AM -0400, Jeff Layton wrote:
> This version adopts Christoph's suggest to have generic_write_sync()
> kick the flusher thread for the superblock instead of initiating
> writeback directly. This seems to perform as well or better in most
> cases than doing the writeback directly.
> 
> Here are results on XFS, both local and exported via knfsd:
> 
>     nfsd: https://markdownpastebin.com/?id=1884b9487c404ff4b7094ed41cc48f05
>     xfs: https://markdownpastebin.com/?id=3c6b262182184b25b7d58fb211374475

Please add the results into the patch.  Besides having them in the patch
vs the cover letter, URLs to random hosting sites tend to get stale
very easily.

Comments on the XFS numbers / benchmark setup:

How does O_DIRECT manage to create almost the same peek dirty as
buffered?  Similarly even this patched doncache show be lower, which
feels odd.

(the editorializing in these results feels odd, not sure what tool
generates them, but maybe edit it for the commit log to look a bit more
serious).

Given that this patch does not change the read path, how do the pure read
numbers for patches vs unpatched dontcache manage to differ so much?

Comments on the NFSD numbers / benchmark setup:

I though you were testing using dontcache on the server, but the
comments seem to imply it is done on the client, what is the case
here? 

Again, the read changes between patches and unpatched look suspect.
Please drill down why they happen, the fact that patches is slower here
and faster in XFS makes me wonder if the results might just be very
volatile?

Re: [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE

Posted by Jeff Layton 2 months, 1 week ago

On Wed, 2026-04-08 at 10:25 -0400, Jeff Layton wrote:
> This version adopts Christoph's suggest to have generic_write_sync()
> kick the flusher thread for the superblock instead of initiating
> writeback directly. This seems to perform as well or better in most
> cases than doing the writeback directly.
> 
> Here are results on XFS, both local and exported via knfsd:
> 
>     nfsd: https://markdownpastebin.com/?id=1884b9487c404ff4b7094ed41cc48f05
>     xfs: https://markdownpastebin.com/?id=3c6b262182184b25b7d58fb211374475
> 
> Ritesh had also asked about getting perf lock traces to confirm the
> source of the contention. I did that (and I can post them if you like),
> but the results from the unpatched dontcache runs didn't point out any
> specific lock contention. That leads me to believe that the bottlenecks
> were from normal queueing work, and not contention for the xa_lock after
> all.
> 
> Kicking the writeback thread seems to be a clear improvement over the
> status quo in my testing, but I do wonder if having dontcache writes
> spamming writeback for the whole bdi is the best idea.
> 
> I'm benchmarking out a patch that has the flusher do a
> writeback_single_inode() for the work. I don't expect it to perform
> measurably better in this testing, but it would better isolate the
> DONTCACHE writeback behavior to just those inodes touched by DONTCACHE
> writes.
> 
> Assuming that looks OK, I'll probably send a v3. Original cover letter
> from v1 follows:
> 

Actually, that version regressed performance in a couple of cases. I think v2 is probably the best approach, on balance. Maybe we can get this into -next so that it can make v7.2?

Here's the comparison between this version and a writeback_single_inode() flush version:

------------------8<-----------------------

● Comparing dontcache numbers against the previous whole-BDI flusher kernel (from /tmp/dontcache-local-4way-flusher.md):                                                                          

  Per-Inode vs Whole-BDI Flusher — DONTCACHE on Local XFS                                                                                                                                         

  Single-Client Writes

  ┌──────────────────┬───────────┬───────────┬─────────────┐                                                                                                                                      
  │    Benchmark     │ Whole-BDI │ Per-Inode │   Change    │                                                                                                                                      
  ├──────────────────┼───────────┼───────────┼─────────────┤      
  │ Seq write MB/s   │ 1450      │ 1438      │ -1% (noise) │
  ├──────────────────┼───────────┼───────────┼─────────────┤
  │ Seq write p99.9  │ 23.5 ms   │ 23.5 ms   │ identical   │
  ├──────────────────┼───────────┼───────────┼─────────────┤                                                                                                                                      
  │ Rand write MB/s  │ 363       │ 286       │ -21%        │
  ├──────────────────┼───────────┼───────────┼─────────────┤                                                                                                                                      
  │ Rand write p99.9 │ 1.8 ms    │ 16.7 ms   │ regression  │
  └──────────────────┴───────────┴───────────┴─────────────┘

  Seq write is identical. Rand write regressed — the whole-BDI flusher batched all dirty pages in one pass with writeback_sb_inodes() under a single blk_plug, while per-inode write_inode_now() loses that batching.

  Single-Client Reads

  ┌────────────────┬───────────┬───────────┬────────┐
  │   Benchmark    │ Whole-BDI │ Per-Inode │ Change │
  ├────────────────┼───────────┼───────────┼────────┤
  │ Seq read MB/s  │ 2950      │ 2350      │ -20%   │
  ├────────────────┼───────────┼───────────┼────────┤
  │ Rand read MB/s │ 651       │ 519       │ -20%   │
  └────────────────┴───────────┴───────────┴────────┘

  Reads shouldn't be affected by writeback path changes. Buffered reads also dropped (2888 → 2331), suggesting different system conditions between runs rather than a per-inode regression.

  Multi-Writer (Scenario A)

  ┌────────────────┬───────────┬───────────┬────────────┐
  │     Metric     │ Whole-BDI │ Per-Inode │   Change   │
  ├────────────────┼───────────┼───────────┼────────────┤
  │ Aggregate MB/s │ 1478      │ 999       │ -32%       │
  ├────────────────┼───────────┼───────────┼────────────┤
  │ p99.9          │ 46 ms     │ 77 ms     │ -67% worse │
  └────────────────┴───────────┴───────────┴────────────┘

  This is the biggest regression. With whole-BDI, the flusher did one batched pass through all dirty inodes via writeback_sb_inodes(). With per-inode, each of 4 writers queues a separate work item processed serially by write_inode_now() — losing the batch I/O merging benefit.

  Scenario C & D (Noisy Neighbor)

  ┌─────────────────────────┬───────────┬───────────┬─────────────┐
  │         Metric          │ Whole-BDI │ Per-Inode │   Change    │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario C writer       │ 1468      │ 1386      │ -6%         │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario C readers      │ 18.7 MB/s │ 18.7 MB/s │ identical   │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D writer       │ 1472      │ 1467      │ identical   │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D readers      │ 496 MB/s  │ 507 MB/s  │ +2%         │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D reader p99.9 │ 440 us    │ 358 us    │ +19% better │
  └─────────────────────────┴───────────┴───────────┴─────────────┘

  Mixed-mode (Scenario D) is the intended production case and it's essentially identical or slightly better — per-inode writeback creates less device contention for buffered readers.

  Summary

  The per-inode approach is neutral-to-slightly-better for the production scenario (Scenario D), but regresses on multi-writer and random write workloads. The core issue is loss of I/O batching
 — writeback_sb_inodes() processes all dirty inodes in one blk_plug'd pass, while per-inode write_inode_now() calls are processed one at a time. The read regressions likely reflect different
  system conditions since buffered/direct reads also dropped ~20%.

-- 
Jeff Layton <jlayton@kernel.org>

Re: [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE

Posted by Christoph Hellwig 2 months ago

On Wed, Apr 08, 2026 at 02:45:32PM -0400, Jeff Layton wrote:
> > Assuming that looks OK, I'll probably send a v3. Original cover letter
> > from v1 follows:
> > 
> 
> Actually, that version regressed performance in a couple of cases.
> I think v2 is probably the best approach, on balance.

Limiting writeback to a specific inode has many downsides as I explained
before, so this is not surprising.

> Maybe we can get this into -next so that it can make v7.2?

If you want to target 7.2 we're not in any rush.  If you were trying to
say 7.1 we're way to close too the merge window.  We also haven't
heard from Jens at all, whom I'd really like to look over this.