fs/nfsd/debugfs.c | 2 ++ fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- fs/nfsd/nfsd.h | 1 + fs/nfsd/nfsproc.c | 4 ++-- fs/nfsd/vfs.c | 21 ++++++++++++++----- fs/nfsd/vfs.h | 5 +++-- fs/nfsd/xdr3.h | 3 +++ net/sunrpc/svc.c | 19 ++++++++++++++---- 8 files changed, 92 insertions(+), 22 deletions(-)
Chuck and I were discussing RWF_DONTCACHE and he suggested that this
might be an alternate approach. My main gripe with DONTCACHE was that it
kicks off writeback after every WRITE operation. With NFS, we generally
get a COMMIT operation at some point. Allowing us to batch up writes
until that point has traditionally been considered better for
performance.
Instead of RWF_DONTCACHE, this patch has nfsd issue generic_fadvise(...,
POSIX_FADV_DONTNEED) on the appropriate range after any READ, stable
WRITE or COMMIT operation. This means that it doesn't change how and
when dirty data gets flushed to the disk, but still keeps resident
pagecache to a minimum.
For reference, here are some numbers from a fio run doing sequential
reads and writes, with the server in "normal" buffered I/O mode, with
Mike's RWF_DONTCACHE patch enabled, and with fadvise(...DONTNEED).
Jobfile:
[global]
name=fio-seq-RW
filename=fio-seq-RW
rw=rw
rwmixread=60
rwmixwrite=40
bs=1M
direct=0
numjobs=16
time_based
runtime=300
[file1]
size=100G
ioengine=io_uring
iodepth=16
::::::::::::::::::::::::::::::::::::
3 runs each.
Baseline (nothing enabled):
Run status group 0 (all jobs):
READ: bw=2999MiB/s (3144MB/s), 185MiB/s-189MiB/s (194MB/s-198MB/s), io=879GiB (944GB), run=300014-300087msec
WRITE: bw=1998MiB/s (2095MB/s), 124MiB/s-126MiB/s (130MB/s-132MB/s), io=585GiB (629GB), run=300014-300087msec
READ: bw=2866MiB/s (3005MB/s), 177MiB/s-181MiB/s (185MB/s-190MB/s), io=844GiB (906GB), run=301294-301463msec
WRITE: bw=1909MiB/s (2002MB/s), 117MiB/s-121MiB/s (123MB/s-127MB/s), io=562GiB (604GB), run=301294-301463msec
READ: bw=2885MiB/s (3026MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=846GiB (908GB), run=300017-300117msec
WRITE: bw=1923MiB/s (2016MB/s), 118MiB/s-122MiB/s (124MB/s-128MB/s), io=563GiB (605GB), run=300017-300117msec
RWF_DONTCACHE:
Run status group 0 (all jobs):
READ: bw=3088MiB/s (3238MB/s), 189MiB/s-195MiB/s (198MB/s-205MB/s), io=906GiB (972GB), run=300015-300276msec
WRITE: bw=2058MiB/s (2158MB/s), 126MiB/s-129MiB/s (132MB/s-136MB/s), io=604GiB (648GB), run=300015-300276msec
READ: bw=3116MiB/s (3267MB/s), 191MiB/s-197MiB/s (201MB/s-206MB/s), io=913GiB (980GB), run=300022-300074msec
WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-137MB/s), io=609GiB (654GB), run=300022-300074msec
READ: bw=3011MiB/s (3158MB/s), 185MiB/s-191MiB/s (194MB/s-200MB/s), io=886GiB (951GB), run=301049-301133msec
WRITE: bw=2007MiB/s (2104MB/s), 123MiB/s-127MiB/s (129MB/s-133MB/s), io=590GiB (634GB), run=301049-301133msec
fadvise(..., POSIX_FADV_DONTNEED):
READ: bw=2918MiB/s (3060MB/s), 180MiB/s-184MiB/s (188MB/s-193MB/s), io=855GiB (918GB), run=300014-300111msec
WRITE: bw=1944MiB/s (2038MB/s), 120MiB/s-123MiB/s (125MB/s-129MB/s), io=570GiB (612GB), run=300014-300111msec
READ: bw=2951MiB/s (3095MB/s), 182MiB/s-188MiB/s (191MB/s-197MB/s), io=867GiB (931GB), run=300529-300695msec
WRITE: bw=1966MiB/s (2061MB/s), 121MiB/s-124MiB/s (127MB/s-130MB/s), io=577GiB (620GB), run=300529-300695msec
READ: bw=2971MiB/s (3115MB/s), 181MiB/s-188MiB/s (190MB/s-197MB/s), io=871GiB (935GB), run=300015-300077msec
WRITE: bw=1979MiB/s (2076MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=580GiB (623GB), run=300015-300077msec
::::::::::::::::::::::::::::::
The numbers are pretty close, but it looks like RWF_DONTCACHE edges out
the other modes. Also, with the RWF_DONTCACHE and fadvise() modes the
pagecache utilization stays very low on the server (which is of course,
the point).
I think next I'll test a hybrid mode. Use RWF_DONTCACHE for READ and
stable WRITE operations, and do the fadvise() only after COMMITs.
Plumbing this in for v4 will be "interesting" if we decide this approach
is sound, but it shouldn't be too bad if we only do it after a COMMIT.
Thoughts?
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (2):
sunrpc: delay pc_release callback until after sending a reply
nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
fs/nfsd/debugfs.c | 2 ++
fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
fs/nfsd/nfsd.h | 1 +
fs/nfsd/nfsproc.c | 4 ++--
fs/nfsd/vfs.c | 21 ++++++++++++++-----
fs/nfsd/vfs.h | 5 +++--
fs/nfsd/xdr3.h | 3 +++
net/sunrpc/svc.c | 19 ++++++++++++++----
8 files changed, 92 insertions(+), 22 deletions(-)
---
base-commit: 38ddcbef7f4e9c5aa075c8ccf9f6d5293e027951
change-id: 20250701-nfsd-testing-12e7c8da5f1c
Best regards,
--
Jeff Layton <jlayton@kernel.org>
On Fri, 04 Jul 2025, Jeff Layton wrote: > Chuck and I were discussing RWF_DONTCACHE and he suggested that this > might be an alternate approach. My main gripe with DONTCACHE was that it > kicks off writeback after every WRITE operation. With NFS, we generally > get a COMMIT operation at some point. Allowing us to batch up writes > until that point has traditionally been considered better for > performance. I wonder if that traditional consideration is justified, give your subsequent results. The addition of COMMIT in v3 allowed us to both: - delay kicking off writes - not wait for writes to complete I think the second was always primary. Maybe we didn't consider the value of the first enough. Obviously the client caches writes and delays the start of writeback. Adding another delay on the serve side does not seem to have a clear justification. Maybe we *should* kick-off writeback immediately. There would still be opportunity for subsequent WRITE requests to be merged into the writeback queue. Ideally DONTCACHE should only affect cache usage and the latency of subsequence READs. It shouldn't affect WRITE behaviour. Thanks, NeilBrown
On Fri, 2025-07-04 at 09:16 +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: > > Chuck and I were discussing RWF_DONTCACHE and he suggested that this > > might be an alternate approach. My main gripe with DONTCACHE was that it > > kicks off writeback after every WRITE operation. With NFS, we generally > > get a COMMIT operation at some point. Allowing us to batch up writes > > until that point has traditionally been considered better for > > performance. > > I wonder if that traditional consideration is justified, give your > subsequent results. The addition of COMMIT in v3 allowed us to both: > - delay kicking off writes > - not wait for writes to complete > > I think the second was always primary. Maybe we didn't consider the > value of the first enough. > Obviously the client caches writes and delays the start of writeback. > Adding another delay on the serve side does not seem to have a clear > justification. Maybe we *should* kick-off writeback immediately. There > would still be opportunity for subsequent WRITE requests to be merged > into the writeback queue. > That is the fundamental question: should we delay writeback or not? It seems like delaying it is probably best, even in the modern era with SSDs, but we do need more numbers here (ideally across a range of workloads). > Ideally DONTCACHE should only affect cache usage and the latency of > subsequence READs. It shouldn't affect WRITE behaviour. > It definitely does affect it today. The ideal thing IMO would be to just add the dropbehind flag to the folios on writes but not call filemap_fdatawrite_range_kick() on every write operation. After a COMMIT the pages should be clean and the vfs_fadvise call should just drop them from the cache, so this approach shouldn't materially change how writeback behaves. -- Jeff Layton <jlayton@kernel.org>
On Sat, Jul 05, 2025 at 07:32:58AM -0400, Jeff Layton wrote: > That is the fundamental question: should we delay writeback or not? It > seems like delaying it is probably best, even in the modern era with > SSDs, but we do need more numbers here (ideally across a range of > workloads). If you have asynchronous writeback there's probably no good reason to delay it per-se. But it does make sense to wait long enough to have a large I/O size, especially with some form of parity raid you'll want to fill up the chunk, but also storage devices themselves will perform much better with a larger size. e.g. for HDD you'll want to write 1MB batches, and similar write sizes also help with for SSDs. While the write performance itself might not be much worse with smaller I/O especially for high quality ones, large I/O helps to reduce the internal fragmentation and thus later reduces garbage collection overhead and thus increases life time. > > Ideally DONTCACHE should only affect cache usage and the latency of > > subsequence READs. It shouldn't affect WRITE behaviour. > > > > It definitely does affect it today. The ideal thing IMO would be to > just add the dropbehind flag to the folios on writes but not call > filemap_fdatawrite_range_kick() on every write operation. Yes, a mode that sets drop behind but leaves writeback to the writeback threads can be interesting. Right now it will still be bottlenecked by the single writeback thread, but work on this is underway.
On 7/3/25 7:16 PM, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this >> might be an alternate approach. My main gripe with DONTCACHE was that it >> kicks off writeback after every WRITE operation. With NFS, we generally >> get a COMMIT operation at some point. Allowing us to batch up writes >> until that point has traditionally been considered better for >> performance. > > I wonder if that traditional consideration is justified, give your > subsequent results. The addition of COMMIT in v3 allowed us to both: > - delay kicking off writes > - not wait for writes to complete > > I think the second was always primary. Maybe we didn't consider the > value of the first enough. > Obviously the client caches writes and delays the start of writeback. > Adding another delay on the serve side does not seem to have a clear > justification. Maybe we *should* kick-off writeback immediately. There > would still be opportunity for subsequent WRITE requests to be merged > into the writeback queue. Dave Chinner had the same thought a while back. So I've experimented with starting writes as part of nfsd_write(). Kicking off writes, even without waiting, is actually pretty costly, and it resulted in worse performance. Now that .pc_release is called /after/ the WRITE response has been sent, though, that might be a place where kicking off writeback could be done without charging that latency to the client. -- Chuck Lever
On Fri, 04 Jul 2025, Chuck Lever wrote: > On 7/3/25 7:16 PM, NeilBrown wrote: > > On Fri, 04 Jul 2025, Jeff Layton wrote: > >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this > >> might be an alternate approach. My main gripe with DONTCACHE was that it > >> kicks off writeback after every WRITE operation. With NFS, we generally > >> get a COMMIT operation at some point. Allowing us to batch up writes > >> until that point has traditionally been considered better for > >> performance. > > > > I wonder if that traditional consideration is justified, give your > > subsequent results. The addition of COMMIT in v3 allowed us to both: > > - delay kicking off writes > > - not wait for writes to complete > > > > I think the second was always primary. Maybe we didn't consider the > > value of the first enough. > > Obviously the client caches writes and delays the start of writeback. > > Adding another delay on the serve side does not seem to have a clear > > justification. Maybe we *should* kick-off writeback immediately. There > > would still be opportunity for subsequent WRITE requests to be merged > > into the writeback queue. > > Dave Chinner had the same thought a while back. So I've experimented > with starting writes as part of nfsd_write(). Kicking off writes, > even without waiting, is actually pretty costly, and it resulted in > worse performance. Was this with filemap_fdatawrite_range_kick() or something else? > > Now that .pc_release is called /after/ the WRITE response has been sent, > though, that might be a place where kicking off writeback could be done > without charging that latency to the client. Certainly after is better. I was imagining kicking the writeback threads, but they seems to only do whole filesystems. Maybe that is OK? I doubt we want more than one thread kicking off write for any given inode at a time. Maybe it would be best to add a "kicking_write" flag in the filecache and if it is already set, then just add the range to some range of pending-writes. Maybe. I guess I should explore how to test :-) NeilBrown
© 2016 - 2026 Red Hat, Inc.