fs/nfsd/debugfs.c | 2 ++ fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- fs/nfsd/nfsd.h | 1 + fs/nfsd/nfsproc.c | 4 ++-- fs/nfsd/vfs.c | 21 ++++++++++++++----- fs/nfsd/vfs.h | 5 +++-- fs/nfsd/xdr3.h | 3 +++ net/sunrpc/svc.c | 19 ++++++++++++++---- 8 files changed, 92 insertions(+), 22 deletions(-)
Chuck and I were discussing RWF_DONTCACHE and he suggested that this might be an alternate approach. My main gripe with DONTCACHE was that it kicks off writeback after every WRITE operation. With NFS, we generally get a COMMIT operation at some point. Allowing us to batch up writes until that point has traditionally been considered better for performance. Instead of RWF_DONTCACHE, this patch has nfsd issue generic_fadvise(..., POSIX_FADV_DONTNEED) on the appropriate range after any READ, stable WRITE or COMMIT operation. This means that it doesn't change how and when dirty data gets flushed to the disk, but still keeps resident pagecache to a minimum. For reference, here are some numbers from a fio run doing sequential reads and writes, with the server in "normal" buffered I/O mode, with Mike's RWF_DONTCACHE patch enabled, and with fadvise(...DONTNEED). Jobfile: [global] name=fio-seq-RW filename=fio-seq-RW rw=rw rwmixread=60 rwmixwrite=40 bs=1M direct=0 numjobs=16 time_based runtime=300 [file1] size=100G ioengine=io_uring iodepth=16 :::::::::::::::::::::::::::::::::::: 3 runs each. Baseline (nothing enabled): Run status group 0 (all jobs): READ: bw=2999MiB/s (3144MB/s), 185MiB/s-189MiB/s (194MB/s-198MB/s), io=879GiB (944GB), run=300014-300087msec WRITE: bw=1998MiB/s (2095MB/s), 124MiB/s-126MiB/s (130MB/s-132MB/s), io=585GiB (629GB), run=300014-300087msec READ: bw=2866MiB/s (3005MB/s), 177MiB/s-181MiB/s (185MB/s-190MB/s), io=844GiB (906GB), run=301294-301463msec WRITE: bw=1909MiB/s (2002MB/s), 117MiB/s-121MiB/s (123MB/s-127MB/s), io=562GiB (604GB), run=301294-301463msec READ: bw=2885MiB/s (3026MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=846GiB (908GB), run=300017-300117msec WRITE: bw=1923MiB/s (2016MB/s), 118MiB/s-122MiB/s (124MB/s-128MB/s), io=563GiB (605GB), run=300017-300117msec RWF_DONTCACHE: Run status group 0 (all jobs): READ: bw=3088MiB/s (3238MB/s), 189MiB/s-195MiB/s (198MB/s-205MB/s), io=906GiB (972GB), run=300015-300276msec WRITE: bw=2058MiB/s (2158MB/s), 126MiB/s-129MiB/s (132MB/s-136MB/s), io=604GiB (648GB), run=300015-300276msec READ: bw=3116MiB/s (3267MB/s), 191MiB/s-197MiB/s (201MB/s-206MB/s), io=913GiB (980GB), run=300022-300074msec WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-137MB/s), io=609GiB (654GB), run=300022-300074msec READ: bw=3011MiB/s (3158MB/s), 185MiB/s-191MiB/s (194MB/s-200MB/s), io=886GiB (951GB), run=301049-301133msec WRITE: bw=2007MiB/s (2104MB/s), 123MiB/s-127MiB/s (129MB/s-133MB/s), io=590GiB (634GB), run=301049-301133msec fadvise(..., POSIX_FADV_DONTNEED): READ: bw=2918MiB/s (3060MB/s), 180MiB/s-184MiB/s (188MB/s-193MB/s), io=855GiB (918GB), run=300014-300111msec WRITE: bw=1944MiB/s (2038MB/s), 120MiB/s-123MiB/s (125MB/s-129MB/s), io=570GiB (612GB), run=300014-300111msec READ: bw=2951MiB/s (3095MB/s), 182MiB/s-188MiB/s (191MB/s-197MB/s), io=867GiB (931GB), run=300529-300695msec WRITE: bw=1966MiB/s (2061MB/s), 121MiB/s-124MiB/s (127MB/s-130MB/s), io=577GiB (620GB), run=300529-300695msec READ: bw=2971MiB/s (3115MB/s), 181MiB/s-188MiB/s (190MB/s-197MB/s), io=871GiB (935GB), run=300015-300077msec WRITE: bw=1979MiB/s (2076MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=580GiB (623GB), run=300015-300077msec :::::::::::::::::::::::::::::: The numbers are pretty close, but it looks like RWF_DONTCACHE edges out the other modes. Also, with the RWF_DONTCACHE and fadvise() modes the pagecache utilization stays very low on the server (which is of course, the point). I think next I'll test a hybrid mode. Use RWF_DONTCACHE for READ and stable WRITE operations, and do the fadvise() only after COMMITs. Plumbing this in for v4 will be "interesting" if we decide this approach is sound, but it shouldn't be too bad if we only do it after a COMMIT. Thoughts? Signed-off-by: Jeff Layton <jlayton@kernel.org> --- Jeff Layton (2): sunrpc: delay pc_release callback until after sending a reply nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT fs/nfsd/debugfs.c | 2 ++ fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- fs/nfsd/nfsd.h | 1 + fs/nfsd/nfsproc.c | 4 ++-- fs/nfsd/vfs.c | 21 ++++++++++++++----- fs/nfsd/vfs.h | 5 +++-- fs/nfsd/xdr3.h | 3 +++ net/sunrpc/svc.c | 19 ++++++++++++++---- 8 files changed, 92 insertions(+), 22 deletions(-) --- base-commit: 38ddcbef7f4e9c5aa075c8ccf9f6d5293e027951 change-id: 20250701-nfsd-testing-12e7c8da5f1c Best regards, -- Jeff Layton <jlayton@kernel.org>
On Fri, 04 Jul 2025, Jeff Layton wrote: > Chuck and I were discussing RWF_DONTCACHE and he suggested that this > might be an alternate approach. My main gripe with DONTCACHE was that it > kicks off writeback after every WRITE operation. With NFS, we generally > get a COMMIT operation at some point. Allowing us to batch up writes > until that point has traditionally been considered better for > performance. I wonder if that traditional consideration is justified, give your subsequent results. The addition of COMMIT in v3 allowed us to both: - delay kicking off writes - not wait for writes to complete I think the second was always primary. Maybe we didn't consider the value of the first enough. Obviously the client caches writes and delays the start of writeback. Adding another delay on the serve side does not seem to have a clear justification. Maybe we *should* kick-off writeback immediately. There would still be opportunity for subsequent WRITE requests to be merged into the writeback queue. Ideally DONTCACHE should only affect cache usage and the latency of subsequence READs. It shouldn't affect WRITE behaviour. Thanks, NeilBrown
On Fri, 2025-07-04 at 09:16 +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: > > Chuck and I were discussing RWF_DONTCACHE and he suggested that this > > might be an alternate approach. My main gripe with DONTCACHE was that it > > kicks off writeback after every WRITE operation. With NFS, we generally > > get a COMMIT operation at some point. Allowing us to batch up writes > > until that point has traditionally been considered better for > > performance. > > I wonder if that traditional consideration is justified, give your > subsequent results. The addition of COMMIT in v3 allowed us to both: > - delay kicking off writes > - not wait for writes to complete > > I think the second was always primary. Maybe we didn't consider the > value of the first enough. > Obviously the client caches writes and delays the start of writeback. > Adding another delay on the serve side does not seem to have a clear > justification. Maybe we *should* kick-off writeback immediately. There > would still be opportunity for subsequent WRITE requests to be merged > into the writeback queue. > That is the fundamental question: should we delay writeback or not? It seems like delaying it is probably best, even in the modern era with SSDs, but we do need more numbers here (ideally across a range of workloads). > Ideally DONTCACHE should only affect cache usage and the latency of > subsequence READs. It shouldn't affect WRITE behaviour. > It definitely does affect it today. The ideal thing IMO would be to just add the dropbehind flag to the folios on writes but not call filemap_fdatawrite_range_kick() on every write operation. After a COMMIT the pages should be clean and the vfs_fadvise call should just drop them from the cache, so this approach shouldn't materially change how writeback behaves. -- Jeff Layton <jlayton@kernel.org>
On Sat, Jul 05, 2025 at 07:32:58AM -0400, Jeff Layton wrote: > That is the fundamental question: should we delay writeback or not? It > seems like delaying it is probably best, even in the modern era with > SSDs, but we do need more numbers here (ideally across a range of > workloads). If you have asynchronous writeback there's probably no good reason to delay it per-se. But it does make sense to wait long enough to have a large I/O size, especially with some form of parity raid you'll want to fill up the chunk, but also storage devices themselves will perform much better with a larger size. e.g. for HDD you'll want to write 1MB batches, and similar write sizes also help with for SSDs. While the write performance itself might not be much worse with smaller I/O especially for high quality ones, large I/O helps to reduce the internal fragmentation and thus later reduces garbage collection overhead and thus increases life time. > > Ideally DONTCACHE should only affect cache usage and the latency of > > subsequence READs. It shouldn't affect WRITE behaviour. > > > > It definitely does affect it today. The ideal thing IMO would be to > just add the dropbehind flag to the folios on writes but not call > filemap_fdatawrite_range_kick() on every write operation. Yes, a mode that sets drop behind but leaves writeback to the writeback threads can be interesting. Right now it will still be bottlenecked by the single writeback thread, but work on this is underway.
On 7/3/25 7:16 PM, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this >> might be an alternate approach. My main gripe with DONTCACHE was that it >> kicks off writeback after every WRITE operation. With NFS, we generally >> get a COMMIT operation at some point. Allowing us to batch up writes >> until that point has traditionally been considered better for >> performance. > > I wonder if that traditional consideration is justified, give your > subsequent results. The addition of COMMIT in v3 allowed us to both: > - delay kicking off writes > - not wait for writes to complete > > I think the second was always primary. Maybe we didn't consider the > value of the first enough. > Obviously the client caches writes and delays the start of writeback. > Adding another delay on the serve side does not seem to have a clear > justification. Maybe we *should* kick-off writeback immediately. There > would still be opportunity for subsequent WRITE requests to be merged > into the writeback queue. Dave Chinner had the same thought a while back. So I've experimented with starting writes as part of nfsd_write(). Kicking off writes, even without waiting, is actually pretty costly, and it resulted in worse performance. Now that .pc_release is called /after/ the WRITE response has been sent, though, that might be a place where kicking off writeback could be done without charging that latency to the client. -- Chuck Lever
On Fri, 04 Jul 2025, Chuck Lever wrote: > On 7/3/25 7:16 PM, NeilBrown wrote: > > On Fri, 04 Jul 2025, Jeff Layton wrote: > >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this > >> might be an alternate approach. My main gripe with DONTCACHE was that it > >> kicks off writeback after every WRITE operation. With NFS, we generally > >> get a COMMIT operation at some point. Allowing us to batch up writes > >> until that point has traditionally been considered better for > >> performance. > > > > I wonder if that traditional consideration is justified, give your > > subsequent results. The addition of COMMIT in v3 allowed us to both: > > - delay kicking off writes > > - not wait for writes to complete > > > > I think the second was always primary. Maybe we didn't consider the > > value of the first enough. > > Obviously the client caches writes and delays the start of writeback. > > Adding another delay on the serve side does not seem to have a clear > > justification. Maybe we *should* kick-off writeback immediately. There > > would still be opportunity for subsequent WRITE requests to be merged > > into the writeback queue. > > Dave Chinner had the same thought a while back. So I've experimented > with starting writes as part of nfsd_write(). Kicking off writes, > even without waiting, is actually pretty costly, and it resulted in > worse performance. Was this with filemap_fdatawrite_range_kick() or something else? > > Now that .pc_release is called /after/ the WRITE response has been sent, > though, that might be a place where kicking off writeback could be done > without charging that latency to the client. Certainly after is better. I was imagining kicking the writeback threads, but they seems to only do whole filesystems. Maybe that is OK? I doubt we want more than one thread kicking off write for any given inode at a time. Maybe it would be best to add a "kicking_write" flag in the filecache and if it is already set, then just add the range to some range of pending-writes. Maybe. I guess I should explore how to test :-) NeilBrown
© 2016 - 2025 Red Hat, Inc.