nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

[PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

Posted by Jeff Layton 3 months ago

Chuck and I were discussing RWF_DONTCACHE and he suggested that this
might be an alternate approach. My main gripe with DONTCACHE was that it
kicks off writeback after every WRITE operation. With NFS, we generally
get a COMMIT operation at some point. Allowing us to batch up writes
until that point has traditionally been considered better for
performance.

Instead of RWF_DONTCACHE, this patch has nfsd issue generic_fadvise(...,
POSIX_FADV_DONTNEED) on the appropriate range after any READ, stable
WRITE or COMMIT operation. This means that it doesn't change how and
when dirty data gets flushed to the disk, but still keeps resident
pagecache to a minimum.

For reference, here are some numbers from a fio run doing sequential
reads and writes, with the server in "normal" buffered I/O mode, with
Mike's RWF_DONTCACHE patch enabled, and with fadvise(...DONTNEED).

Jobfile:

[global]
name=fio-seq-RW
filename=fio-seq-RW
rw=rw
rwmixread=60
rwmixwrite=40
bs=1M
direct=0
numjobs=16
time_based
runtime=300

[file1]
size=100G
ioengine=io_uring
iodepth=16

::::::::::::::::::::::::::::::::::::

3 runs each.

Baseline (nothing enabled):
Run status group 0 (all jobs):
   READ: bw=2999MiB/s (3144MB/s), 185MiB/s-189MiB/s (194MB/s-198MB/s), io=879GiB (944GB), run=300014-300087msec
  WRITE: bw=1998MiB/s (2095MB/s), 124MiB/s-126MiB/s (130MB/s-132MB/s), io=585GiB (629GB), run=300014-300087msec

   READ: bw=2866MiB/s (3005MB/s), 177MiB/s-181MiB/s (185MB/s-190MB/s), io=844GiB (906GB), run=301294-301463msec
  WRITE: bw=1909MiB/s (2002MB/s), 117MiB/s-121MiB/s (123MB/s-127MB/s), io=562GiB (604GB), run=301294-301463msec

   READ: bw=2885MiB/s (3026MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=846GiB (908GB), run=300017-300117msec
  WRITE: bw=1923MiB/s (2016MB/s), 118MiB/s-122MiB/s (124MB/s-128MB/s), io=563GiB (605GB), run=300017-300117msec

RWF_DONTCACHE:
Run status group 0 (all jobs):
   READ: bw=3088MiB/s (3238MB/s), 189MiB/s-195MiB/s (198MB/s-205MB/s), io=906GiB (972GB), run=300015-300276msec
  WRITE: bw=2058MiB/s (2158MB/s), 126MiB/s-129MiB/s (132MB/s-136MB/s), io=604GiB (648GB), run=300015-300276msec

   READ: bw=3116MiB/s (3267MB/s), 191MiB/s-197MiB/s (201MB/s-206MB/s), io=913GiB (980GB), run=300022-300074msec
  WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-137MB/s), io=609GiB (654GB), run=300022-300074msec

   READ: bw=3011MiB/s (3158MB/s), 185MiB/s-191MiB/s (194MB/s-200MB/s), io=886GiB (951GB), run=301049-301133msec
  WRITE: bw=2007MiB/s (2104MB/s), 123MiB/s-127MiB/s (129MB/s-133MB/s), io=590GiB (634GB), run=301049-301133msec

fadvise(..., POSIX_FADV_DONTNEED):
   READ: bw=2918MiB/s (3060MB/s), 180MiB/s-184MiB/s (188MB/s-193MB/s), io=855GiB (918GB), run=300014-300111msec
  WRITE: bw=1944MiB/s (2038MB/s), 120MiB/s-123MiB/s (125MB/s-129MB/s), io=570GiB (612GB), run=300014-300111msec

   READ: bw=2951MiB/s (3095MB/s), 182MiB/s-188MiB/s (191MB/s-197MB/s), io=867GiB (931GB), run=300529-300695msec
  WRITE: bw=1966MiB/s (2061MB/s), 121MiB/s-124MiB/s (127MB/s-130MB/s), io=577GiB (620GB), run=300529-300695msec

   READ: bw=2971MiB/s (3115MB/s), 181MiB/s-188MiB/s (190MB/s-197MB/s), io=871GiB (935GB), run=300015-300077msec
  WRITE: bw=1979MiB/s (2076MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=580GiB (623GB), run=300015-300077msec

::::::::::::::::::::::::::::::

The numbers are pretty close, but it looks like RWF_DONTCACHE edges out
the other modes. Also, with the RWF_DONTCACHE and fadvise() modes the
pagecache utilization stays very low on the server (which is of course,
the point).

I think next I'll test a hybrid mode. Use RWF_DONTCACHE for READ and
stable WRITE operations, and do the fadvise() only after COMMITs.

Plumbing this in for v4 will be "interesting" if we decide this approach
is sound, but it shouldn't be too bad if we only do it after a COMMIT.

Thoughts?

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (2):
      sunrpc: delay pc_release callback until after sending a reply
      nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT

 fs/nfsd/debugfs.c  |  2 ++
 fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
 fs/nfsd/nfsd.h     |  1 +
 fs/nfsd/nfsproc.c  |  4 ++--
 fs/nfsd/vfs.c      | 21 ++++++++++++++-----
 fs/nfsd/vfs.h      |  5 +++--
 fs/nfsd/xdr3.h     |  3 +++
 net/sunrpc/svc.c   | 19 ++++++++++++++----
 8 files changed, 92 insertions(+), 22 deletions(-)
---
base-commit: 38ddcbef7f4e9c5aa075c8ccf9f6d5293e027951
change-id: 20250701-nfsd-testing-12e7c8da5f1c

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>

Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

Posted by NeilBrown 3 months ago

On Fri, 04 Jul 2025, Jeff Layton wrote:
> Chuck and I were discussing RWF_DONTCACHE and he suggested that this
> might be an alternate approach. My main gripe with DONTCACHE was that it
> kicks off writeback after every WRITE operation. With NFS, we generally
> get a COMMIT operation at some point. Allowing us to batch up writes
> until that point has traditionally been considered better for
> performance.

I wonder if that traditional consideration is justified, give your
subsequent results.  The addition of COMMIT in v3 allowed us to both:
 - delay kicking off writes
 - not wait for writes to complete

I think the second was always primary.  Maybe we didn't consider the
value of the first enough.
Obviously the client caches writes and delays the start of writeback.
Adding another delay on the serve side does not seem to have a clear
justification.  Maybe we *should* kick-off writeback immediately.  There
would still be opportunity for subsequent WRITE requests to be merged
into the writeback queue.

Ideally DONTCACHE should only affect cache usage and the latency of
subsequence READs.  It shouldn't affect WRITE behaviour.

Thanks,
NeilBrown

Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

Posted by Jeff Layton 3 months ago

On Fri, 2025-07-04 at 09:16 +1000, NeilBrown wrote:
> On Fri, 04 Jul 2025, Jeff Layton wrote:
> > Chuck and I were discussing RWF_DONTCACHE and he suggested that this
> > might be an alternate approach. My main gripe with DONTCACHE was that it
> > kicks off writeback after every WRITE operation. With NFS, we generally
> > get a COMMIT operation at some point. Allowing us to batch up writes
> > until that point has traditionally been considered better for
> > performance.
> 
> I wonder if that traditional consideration is justified, give your
> subsequent results.  The addition of COMMIT in v3 allowed us to both:
>  - delay kicking off writes
>  - not wait for writes to complete
> 
> I think the second was always primary.  Maybe we didn't consider the
> value of the first enough.
> Obviously the client caches writes and delays the start of writeback.
> Adding another delay on the serve side does not seem to have a clear
> justification.  Maybe we *should* kick-off writeback immediately.  There
> would still be opportunity for subsequent WRITE requests to be merged
> into the writeback queue.
> 

That is the fundamental question: should we delay writeback or not? It
seems like delaying it is probably best, even in the modern era with
SSDs, but we do need more numbers here (ideally across a range of
workloads).

> Ideally DONTCACHE should only affect cache usage and the latency of
> subsequence READs.  It shouldn't affect WRITE behaviour.
> 

It definitely does affect it today. The ideal thing IMO would be to
just add the dropbehind flag to the folios on writes but not call
filemap_fdatawrite_range_kick() on every write operation.

After a COMMIT the pages should be clean and the vfs_fadvise call
should just drop them from the cache, so this approach shouldn't
materially change how writeback behaves.
-- 
Jeff Layton <jlayton@kernel.org>

Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

Posted by Christoph Hellwig 2 months, 4 weeks ago

On Sat, Jul 05, 2025 at 07:32:58AM -0400, Jeff Layton wrote:
> That is the fundamental question: should we delay writeback or not? It
> seems like delaying it is probably best, even in the modern era with
> SSDs, but we do need more numbers here (ideally across a range of
> workloads).

If you have asynchronous writeback there's probably no good reason to
delay it per-se.  But it does make sense to wait long enough to have
a large I/O size, especially with some form of parity raid you'll want
to fill up the chunk, but also storage devices themselves will perform
much better with a larger size.  e.g. for HDD you'll want to write 1MB
batches, and similar write sizes also help with for SSDs.  While the
write performance itself might not be much worse with smaller I/O
especially for high quality ones, large I/O helps to reduce the
internal fragmentation and thus later reduces garbage collection
overhead and thus increases life time.

> > Ideally DONTCACHE should only affect cache usage and the latency of
> > subsequence READs.  It shouldn't affect WRITE behaviour.
> > 
> 
> It definitely does affect it today. The ideal thing IMO would be to
> just add the dropbehind flag to the folios on writes but not call
> filemap_fdatawrite_range_kick() on every write operation.

Yes, a mode that sets drop behind but leaves writeback to the
writeback threads can be interesting.  Right now it will still be
bottlenecked by the single writeback thread, but work on this is
underway.

Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

Posted by Chuck Lever 3 months ago

On 7/3/25 7:16 PM, NeilBrown wrote:
> On Fri, 04 Jul 2025, Jeff Layton wrote:
>> Chuck and I were discussing RWF_DONTCACHE and he suggested that this
>> might be an alternate approach. My main gripe with DONTCACHE was that it
>> kicks off writeback after every WRITE operation. With NFS, we generally
>> get a COMMIT operation at some point. Allowing us to batch up writes
>> until that point has traditionally been considered better for
>> performance.
> 
> I wonder if that traditional consideration is justified, give your
> subsequent results.  The addition of COMMIT in v3 allowed us to both:
>  - delay kicking off writes
>  - not wait for writes to complete
> 
> I think the second was always primary.  Maybe we didn't consider the
> value of the first enough.
> Obviously the client caches writes and delays the start of writeback.
> Adding another delay on the serve side does not seem to have a clear
> justification.  Maybe we *should* kick-off writeback immediately.  There
> would still be opportunity for subsequent WRITE requests to be merged
> into the writeback queue.

Dave Chinner had the same thought a while back. So I've experimented
with starting writes as part of nfsd_write(). Kicking off writes,
even without waiting, is actually pretty costly, and it resulted in
worse performance.

Now that .pc_release is called /after/ the WRITE response has been sent,
though, that might be a place where kicking off writeback could be done
without charging that latency to the client.


-- 
Chuck Lever

Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

Posted by NeilBrown 3 months ago

On Fri, 04 Jul 2025, Chuck Lever wrote:
> On 7/3/25 7:16 PM, NeilBrown wrote:
> > On Fri, 04 Jul 2025, Jeff Layton wrote:
> >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this
> >> might be an alternate approach. My main gripe with DONTCACHE was that it
> >> kicks off writeback after every WRITE operation. With NFS, we generally
> >> get a COMMIT operation at some point. Allowing us to batch up writes
> >> until that point has traditionally been considered better for
> >> performance.
> > 
> > I wonder if that traditional consideration is justified, give your
> > subsequent results.  The addition of COMMIT in v3 allowed us to both:
> >  - delay kicking off writes
> >  - not wait for writes to complete
> > 
> > I think the second was always primary.  Maybe we didn't consider the
> > value of the first enough.
> > Obviously the client caches writes and delays the start of writeback.
> > Adding another delay on the serve side does not seem to have a clear
> > justification.  Maybe we *should* kick-off writeback immediately.  There
> > would still be opportunity for subsequent WRITE requests to be merged
> > into the writeback queue.
> 
> Dave Chinner had the same thought a while back. So I've experimented
> with starting writes as part of nfsd_write(). Kicking off writes,
> even without waiting, is actually pretty costly, and it resulted in
> worse performance.

Was this with filemap_fdatawrite_range_kick() or something else?

> 
> Now that .pc_release is called /after/ the WRITE response has been sent,
> though, that might be a place where kicking off writeback could be done
> without charging that latency to the client.

Certainly after is better.  I was imagining kicking the writeback
threads, but they seems to only do whole filesystems.  Maybe that is OK?

I doubt we want more than one thread kicking off write for any given
inode at a time.  Maybe it would be best to add a "kicking_write" flag
in the filecache and if it is already set, then just add the range to
some range of pending-writes.  Maybe.
I guess I should explore how to test :-)

NeilBrown