[PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED

Jens Axboe posted 16 patches 1 week, 4 days ago
There is a newer version of this series
[PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Jens Axboe 1 week, 4 days ago
If RWF_UNCACHED is set for a write, mark new folios being written with
uncached. This is done by passing in the fact that it's an uncached write
through the folio pointer. We can only get there when IOCB_UNCACHED was
allowed, which can only happen if the file system opts in. Opting in means
they need to check for the LSB in the folio pointer to know if it's an
uncached write or not. If it is, then FGP_UNCACHED should be used if
creating new folios is necessary.

Uncached writes will drop any folios they create upon writeback
completion, but leave folios that may exist in that range alone. Since
->write_begin() doesn't currently take any flags, and to avoid needing
to change the callback kernel wide, use the foliop being passed in to
->write_begin() to signal if this is an uncached write or not. File
systems can then use that to mark newly created folios as uncached.

Add a helper, generic_uncached_write(), that generic_file_write_iter()
calls upon successful completion of an uncached write.

This provides similar benefits to using RWF_UNCACHED with reads. Testing
buffered writes on 32 files:

writing bs 65536, uncached 0
  1s: 196035MB/sec
  2s: 132308MB/sec
  3s: 132438MB/sec
  4s: 116528MB/sec
  5s: 103898MB/sec
  6s: 108893MB/sec
  7s: 99678MB/sec
  8s: 106545MB/sec
  9s: 106826MB/sec
 10s: 101544MB/sec
 11s: 111044MB/sec
 12s: 124257MB/sec
 13s: 116031MB/sec
 14s: 114540MB/sec
 15s: 115011MB/sec
 16s: 115260MB/sec
 17s: 116068MB/sec
 18s: 116096MB/sec

where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.

Running the same test with uncached buffered writes:

writing bs 65536, uncached 1
  1s: 198974MB/sec
  2s: 189618MB/sec
  3s: 193601MB/sec
  4s: 188582MB/sec
  5s: 193487MB/sec
  6s: 188341MB/sec
  7s: 194325MB/sec
  8s: 188114MB/sec
  9s: 192740MB/sec
 10s: 189206MB/sec
 11s: 193442MB/sec
 12s: 189659MB/sec
 13s: 191732MB/sec
 14s: 190701MB/sec
 15s: 191789MB/sec
 16s: 191259MB/sec
 17s: 190613MB/sec
 18s: 191951MB/sec

and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 29 +++++++++++++++++++++++++++++
 mm/filemap.c            | 17 +++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d55bf995bd9e..d35280744aa1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -14,6 +14,7 @@
 #include <linux/gfp.h>
 #include <linux/bitops.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/writeback.h>
 #include <linux/hugetlb_inline.h>
 
 struct folio_batch;
@@ -70,6 +71,34 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
 	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
 }
 
+/*
+ * generic_uncached_write - start uncached writeback
+ * @iocb: the iocb that was written
+ * @written: the amount of bytes written
+ *
+ * When writeback has been handled by write_iter, this helper should be called
+ * if the file system supports uncached writes. If %IOCB_UNCACHED is set, it
+ * will kick off writeback for the specified range.
+ */
+static inline void generic_uncached_write(struct kiocb *iocb, ssize_t written)
+{
+	if (iocb->ki_flags & IOCB_UNCACHED) {
+		struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+		/* kick off uncached writeback */
+		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
+					   iocb->ki_pos + written, WB_SYNC_NONE);
+	}
+}
+
+/*
+ * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
+ * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
+ * must check for this and pass FGP_UNCACHED for folio creation.
+ */
+#define foliop_uncached			((struct folio *) 0xfee1c001)
+#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
+
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
  * @mapping: mapping in which to set writeback error
diff --git a/mm/filemap.c b/mm/filemap.c
index 40debe742abe..0d312de4e20c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 
 	return filemap_fdatawrite_wbc(mapping, &wbc);
 }
+EXPORT_SYMBOL_GPL(__filemap_fdatawrite_range);
 
 static inline int __filemap_fdatawrite(struct address_space *mapping,
 	int sync_mode)
@@ -4076,7 +4077,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	ssize_t written = 0;
 
 	do {
-		struct folio *folio;
+		struct folio *folio = NULL;
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
@@ -4104,6 +4105,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
+		/*
+		 * If IOCB_UNCACHED is set here, we now the file system
+		 * supports it. And hence it'll know to check folip for being
+		 * set to this magic value. If so, it's an uncached write.
+		 * Whenever ->write_begin() changes prototypes again, this
+		 * can go away and just pass iocb or iocb flags.
+		 */
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			folio = foliop_uncached;
+
 		status = a_ops->write_begin(file, mapping, pos, bytes,
 						&folio, &fsdata);
 		if (unlikely(status < 0))
@@ -4234,8 +4245,10 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ret = __generic_file_write_iter(iocb, from);
 	inode_unlock(inode);
 
-	if (ret > 0)
+	if (ret > 0) {
+		generic_uncached_write(iocb, ret);
 		ret = generic_write_sync(iocb, ret);
+	}
 	return ret;
 }
 EXPORT_SYMBOL(generic_file_write_iter);
-- 
2.45.2
Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Dave Chinner 1 week, 4 days ago
On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
> If RWF_UNCACHED is set for a write, mark new folios being written with
> uncached. This is done by passing in the fact that it's an uncached write
> through the folio pointer. We can only get there when IOCB_UNCACHED was
> allowed, which can only happen if the file system opts in. Opting in means
> they need to check for the LSB in the folio pointer to know if it's an
> uncached write or not. If it is, then FGP_UNCACHED should be used if
> creating new folios is necessary.
> 
> Uncached writes will drop any folios they create upon writeback
> completion, but leave folios that may exist in that range alone. Since
> ->write_begin() doesn't currently take any flags, and to avoid needing
> to change the callback kernel wide, use the foliop being passed in to
> ->write_begin() to signal if this is an uncached write or not. File
> systems can then use that to mark newly created folios as uncached.
> 
> Add a helper, generic_uncached_write(), that generic_file_write_iter()
> calls upon successful completion of an uncached write.

This doesn't implement an "uncached" write operation. This
implements a cache write-through operation.

We've actually been talking about this for some time as a desirable
general buffered write trait on fast SSDs. Excessive write-behind
caching is a real problem in general, especially when doing
streaming sequential writes to pcie 4 and 5 nvme SSDs that can do
more than 7GB/s to disk. When the page cache fills up, we see all
the same problems you are trying to work around in this series
with "uncached" writes.

IOWS, what we really want is page cache write-through as an
automatic feature for buffered writes.


> @@ -70,6 +71,34 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
>  	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
>  }
>  
> +/*
> + * generic_uncached_write - start uncached writeback
> + * @iocb: the iocb that was written
> + * @written: the amount of bytes written
> + *
> + * When writeback has been handled by write_iter, this helper should be called
> + * if the file system supports uncached writes. If %IOCB_UNCACHED is set, it
> + * will kick off writeback for the specified range.
> + */
> +static inline void generic_uncached_write(struct kiocb *iocb, ssize_t written)
> +{
> +	if (iocb->ki_flags & IOCB_UNCACHED) {
> +		struct address_space *mapping = iocb->ki_filp->f_mapping;
> +
> +		/* kick off uncached writeback */
> +		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
> +					   iocb->ki_pos + written, WB_SYNC_NONE);
> +	}
> +}

Yup, this is basically write-through.

> +
> +/*
> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
> + * must check for this and pass FGP_UNCACHED for folio creation.
> + */
> +#define foliop_uncached			((struct folio *) 0xfee1c001)
> +#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
> +
>  /**
>   * filemap_set_wb_err - set a writeback error on an address_space
>   * @mapping: mapping in which to set writeback error
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 40debe742abe..0d312de4e20c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  
>  	return filemap_fdatawrite_wbc(mapping, &wbc);
>  }
> +EXPORT_SYMBOL_GPL(__filemap_fdatawrite_range);
>  
>  static inline int __filemap_fdatawrite(struct address_space *mapping,
>  	int sync_mode)
> @@ -4076,7 +4077,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  	ssize_t written = 0;
>  
>  	do {
> -		struct folio *folio;
> +		struct folio *folio = NULL;
>  		size_t offset;		/* Offset into folio */
>  		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
> @@ -4104,6 +4105,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>  			break;
>  		}
>  
> +		/*
> +		 * If IOCB_UNCACHED is set here, we now the file system
> +		 * supports it. And hence it'll know to check folip for being
> +		 * set to this magic value. If so, it's an uncached write.
> +		 * Whenever ->write_begin() changes prototypes again, this
> +		 * can go away and just pass iocb or iocb flags.
> +		 */
> +		if (iocb->ki_flags & IOCB_UNCACHED)
> +			folio = foliop_uncached;
> +
>  		status = a_ops->write_begin(file, mapping, pos, bytes,
>  						&folio, &fsdata);
>  		if (unlikely(status < 0))
> @@ -4234,8 +4245,10 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		ret = __generic_file_write_iter(iocb, from);
>  	inode_unlock(inode);
>  
> -	if (ret > 0)
> +	if (ret > 0) {
> +		generic_uncached_write(iocb, ret);
>  		ret = generic_write_sync(iocb, ret);

Why isn't the writethrough check inside generic_write_sync()?
Having to add it to every filesystem that supports write-through is
unwieldy. If the IO is DSYNC or SYNC, we're going to run WB_SYNC_ALL
writeback through the generic_write_sync() path already, so the only time we
actually want to run WB_SYNC_NONE write-through here is if the iocb
is not marked as dsync.

Hence I think this write-through check should be done conditionally
inside generic_write_sync(), not in addition to the writeback
generic_write_sync() might need to do...

That also gives us a common place for adding cache write-through
trigger logic (think writebehind trigger logic similar to readahead)
and this is also a place where we could automatically tag mapping
ranges for reclaim on writeback completion....

-Dave.
-- 
Dave Chinner
david@fromorbit.com
Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Jens Axboe 1 week, 4 days ago
On 11/11/24 5:57 PM, Dave Chinner wrote:
> On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
>> If RWF_UNCACHED is set for a write, mark new folios being written with
>> uncached. This is done by passing in the fact that it's an uncached write
>> through the folio pointer. We can only get there when IOCB_UNCACHED was
>> allowed, which can only happen if the file system opts in. Opting in means
>> they need to check for the LSB in the folio pointer to know if it's an
>> uncached write or not. If it is, then FGP_UNCACHED should be used if
>> creating new folios is necessary.
>>
>> Uncached writes will drop any folios they create upon writeback
>> completion, but leave folios that may exist in that range alone. Since
>> ->write_begin() doesn't currently take any flags, and to avoid needing
>> to change the callback kernel wide, use the foliop being passed in to
>> ->write_begin() to signal if this is an uncached write or not. File
>> systems can then use that to mark newly created folios as uncached.
>>
>> Add a helper, generic_uncached_write(), that generic_file_write_iter()
>> calls upon successful completion of an uncached write.
> 
> This doesn't implement an "uncached" write operation. This
> implements a cache write-through operation.

It's uncached in the sense that the range gets pruned on writeback
completion. For write-through, I'd consider that just the fact that it
gets kicked off once dirtied rather than wait for writeback to get
kicked at some point.

So I'd say write-through is a subset of that.

> We've actually been talking about this for some time as a desirable
> general buffered write trait on fast SSDs. Excessive write-behind
> caching is a real problem in general, especially when doing
> streaming sequential writes to pcie 4 and 5 nvme SSDs that can do
> more than 7GB/s to disk. When the page cache fills up, we see all

Try twice that, 14GB/sec.

> the same problems you are trying to work around in this series
> with "uncached" writes.
> 
> IOWS, what we really want is page cache write-through as an
> automatic feature for buffered writes.

I don't know who "we" is here - what I really want is for the write to
get kicked off, but also reclaimed as part of completion. I don't want
kswapd to do that, as it's inefficient.

>> @@ -70,6 +71,34 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
>>  	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
>>  }
>>  
>> +/*
>> + * generic_uncached_write - start uncached writeback
>> + * @iocb: the iocb that was written
>> + * @written: the amount of bytes written
>> + *
>> + * When writeback has been handled by write_iter, this helper should be called
>> + * if the file system supports uncached writes. If %IOCB_UNCACHED is set, it
>> + * will kick off writeback for the specified range.
>> + */
>> +static inline void generic_uncached_write(struct kiocb *iocb, ssize_t written)
>> +{
>> +	if (iocb->ki_flags & IOCB_UNCACHED) {
>> +		struct address_space *mapping = iocb->ki_filp->f_mapping;
>> +
>> +		/* kick off uncached writeback */
>> +		__filemap_fdatawrite_range(mapping, iocb->ki_pos,
>> +					   iocb->ki_pos + written, WB_SYNC_NONE);
>> +	}
>> +}
> 
> Yup, this is basically write-through.

... with pruning once it completes.

>> + * Value passed in to ->write_begin() if IOCB_UNCACHED is set for the write,
>> + * and the ->write_begin() handler on a file system supporting FOP_UNCACHED
>> + * must check for this and pass FGP_UNCACHED for folio creation.
>> + */
>> +#define foliop_uncached			((struct folio *) 0xfee1c001)
>> +#define foliop_is_uncached(foliop)	(*(foliop) == foliop_uncached)
>> +
>>  /**
>>   * filemap_set_wb_err - set a writeback error on an address_space
>>   * @mapping: mapping in which to set writeback error
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 40debe742abe..0d312de4e20c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>>  
>>  	return filemap_fdatawrite_wbc(mapping, &wbc);
>>  }
>> +EXPORT_SYMBOL_GPL(__filemap_fdatawrite_range);
>>  
>>  static inline int __filemap_fdatawrite(struct address_space *mapping,
>>  	int sync_mode)
>> @@ -4076,7 +4077,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>>  	ssize_t written = 0;
>>  
>>  	do {
>> -		struct folio *folio;
>> +		struct folio *folio = NULL;
>>  		size_t offset;		/* Offset into folio */
>>  		size_t bytes;		/* Bytes to write to folio */
>>  		size_t copied;		/* Bytes copied from user */
>> @@ -4104,6 +4105,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>>  			break;
>>  		}
>>  
>> +		/*
>> +		 * If IOCB_UNCACHED is set here, we now the file system
>> +		 * supports it. And hence it'll know to check folip for being
>> +		 * set to this magic value. If so, it's an uncached write.
>> +		 * Whenever ->write_begin() changes prototypes again, this
>> +		 * can go away and just pass iocb or iocb flags.
>> +		 */
>> +		if (iocb->ki_flags & IOCB_UNCACHED)
>> +			folio = foliop_uncached;
>> +
>>  		status = a_ops->write_begin(file, mapping, pos, bytes,
>>  						&folio, &fsdata);
>>  		if (unlikely(status < 0))
>> @@ -4234,8 +4245,10 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>  		ret = __generic_file_write_iter(iocb, from);
>>  	inode_unlock(inode);
>>  
>> -	if (ret > 0)
>> +	if (ret > 0) {
>> +		generic_uncached_write(iocb, ret);
>>  		ret = generic_write_sync(iocb, ret);
> 
> Why isn't the writethrough check inside generic_write_sync()?
> Having to add it to every filesystem that supports write-through is
> unwieldy. If the IO is DSYNC or SYNC, we're going to run WB_SYNC_ALL
> writeback through the generic_write_sync() path already, so the only time we
> actually want to run WB_SYNC_NONE write-through here is if the iocb
> is not marked as dsync.
> 
> Hence I think this write-through check should be done conditionally
> inside generic_write_sync(), not in addition to the writeback
> generic_write_sync() might need to do...

True good point, I'll move it to generic_write_sync() instead, it needs
to go there for all three spots where it's added anyway.

> That also gives us a common place for adding cache write-through
> trigger logic (think writebehind trigger logic similar to readahead)
> and this is also a place where we could automatically tag mapping
> ranges for reclaim on writeback completion....

I appreciate that you seemingly like the concept, but not that you are
also seemingly trying to commandeer this to be something else. Unless
you like the automatic reclaiming as well, it's not clear to me.

I'm certainly open to doing the reclaim differently, and I originally
wanted to do tagging for that - but there are no more tags available.
Yes we could add that, but then the question was whether that was
worthwhile. Hence I just went with a folio flag instead.

But the mechanism for doing that doesn't interest me as much as the
concept of it. Certainly open to ideas on that end.

-- 
Jens Axboe
Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Dave Chinner 1 week, 4 days ago
On Mon, Nov 11, 2024 at 06:27:46PM -0700, Jens Axboe wrote:
> On 11/11/24 5:57 PM, Dave Chinner wrote:
> > On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
> >> If RWF_UNCACHED is set for a write, mark new folios being written with
> >> uncached. This is done by passing in the fact that it's an uncached write
> >> through the folio pointer. We can only get there when IOCB_UNCACHED was
> >> allowed, which can only happen if the file system opts in. Opting in means
> >> they need to check for the LSB in the folio pointer to know if it's an
> >> uncached write or not. If it is, then FGP_UNCACHED should be used if
> >> creating new folios is necessary.
> >>
> >> Uncached writes will drop any folios they create upon writeback
> >> completion, but leave folios that may exist in that range alone. Since
> >> ->write_begin() doesn't currently take any flags, and to avoid needing
> >> to change the callback kernel wide, use the foliop being passed in to
> >> ->write_begin() to signal if this is an uncached write or not. File
> >> systems can then use that to mark newly created folios as uncached.
> >>
> >> Add a helper, generic_uncached_write(), that generic_file_write_iter()
> >> calls upon successful completion of an uncached write.
> > 
> > This doesn't implement an "uncached" write operation. This
> > implements a cache write-through operation.
> 
> It's uncached in the sense that the range gets pruned on writeback
> completion.

That's not the definition of "uncached". Direct IO is, by
definition, "uncached" because it bypasses the cache and is not
coherent with the contents of the cache.

This IO, however, is moving the data coherently through the cache
(both on read and write).  The cached folios are transient - i.e.
-temporarily resident- in the cache whilst the IO is in progress -
but this behaviour does not make it "uncached IO".

Calling it "uncached IO " is simply wrong from any direction I look
at it....

> For write-through, I'd consider that just the fact that it
> gets kicked off once dirtied rather than wait for writeback to get
> kicked at some point.
> 
> So I'd say write-through is a subset of that.

I think the post-IO invalidation that these IOs do is largely
irrelevant to how the page cache processes the write. Indeed,
from userspace, the functionality in this patchset would be
implemented like this:

oneshot_data_write(fd, buf, len, off)
{
	/* write into page cache */
	pwrite(fd, buf, len, off);

	/* force the write through the page cache */
	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);

	/* Invalidate the single use data in the cache now it is on disk */
	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
}

Allowing the application to control writeback and invalidation
granularity is a much more flexible solution to the problem here;
when IO is sequential, delayed allocation will be allowed to ensure
large contiguous extents are created and that will greatly reduce
file fragmentation on XFS, btrfs, bcachefs and ext4. For random
writes, it'll submit async IOs in batches...

Given that io_uring already supports sync_file_range() and
posix_fadvise(), I'm wondering why we need an new IO API to perform
this specific write-through behaviour in a way that is less flexible
than what applications can already implement through existing
APIs....

> > the same problems you are trying to work around in this series
> > with "uncached" writes.
> > 
> > IOWS, what we really want is page cache write-through as an
> > automatic feature for buffered writes.
> 
> I don't know who "we" is here - what I really want is for the write to
> get kicked off, but also reclaimed as part of completion. I don't want
> kswapd to do that, as it's inefficient.

"we" as in the general cohort of filesystem and mm
developers who interact closely with the page cache all the time.
There was a fair bit of talk about writethrough and other
transparent page cache IO path improvements at LSFMM this year.

> > That also gives us a common place for adding cache write-through
> > trigger logic (think writebehind trigger logic similar to readahead)
> > and this is also a place where we could automatically tag mapping
> > ranges for reclaim on writeback completion....
> 
> I appreciate that you seemingly like the concept, but not that you are
> also seemingly trying to commandeer this to be something else. Unless
> you like the automatic reclaiming as well, it's not clear to me.

I'm not trying to commandeer anything.

Having thought about it more, I think this new API is unneccesary
for custom written applications to perform fine grained control of
page cache residency of one-shot data. We already have APIs that
allow applications to do exactly what this patchset is doing. rather
than choosing to modify whatever benchmark being used to use
existing APIs, a choice was made to modify both the applicaiton and
the kernel to implement a whole new API....

I think that was the -wrong choice-.

I think this partially because the kernel modifications are don't
really help further us towards the goal of transparent mode
switching in the page cache.

Read-through should be a mode that the readahead control activates,
not be something triggered by a special read() syscall flag. We
already have access patterns and fadvise modes guiding this.
Write-through should be controlled in a similar way.

And making the data being read and written behave as transient page
caceh objects should be done via an existing fadvise mode, too,
because the model you have implemented here exactly matches the 
definition of FADV_NOREUSE:

	POSIX_FADV_NOREUSE
              The specified data will be accessed only once.

Having a new per-IO flag that effectively collides existing
control functionality into a single inflexible API bit doesn't
really make a whole lot of sense to me.

IOWs, I'm not questioning whether we need rw-through modes and/or
IO-transient residency for page cache based IO - it's been on our
radar for a while. I'm more concerned that the chosen API in this
patchset is a poor one as it cannot replace any of the existing
controls we already have for these sorts of application directed
page cache manipulations...

-Dave.
-- 
Dave Chinner
david@fromorbit.com
Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Jens Axboe 1 week, 4 days ago
On 11/12/24 1:02 AM, Dave Chinner wrote:
> On Mon, Nov 11, 2024 at 06:27:46PM -0700, Jens Axboe wrote:
>> On 11/11/24 5:57 PM, Dave Chinner wrote:
>>> On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote:
>>>> If RWF_UNCACHED is set for a write, mark new folios being written with
>>>> uncached. This is done by passing in the fact that it's an uncached write
>>>> through the folio pointer. We can only get there when IOCB_UNCACHED was
>>>> allowed, which can only happen if the file system opts in. Opting in means
>>>> they need to check for the LSB in the folio pointer to know if it's an
>>>> uncached write or not. If it is, then FGP_UNCACHED should be used if
>>>> creating new folios is necessary.
>>>>
>>>> Uncached writes will drop any folios they create upon writeback
>>>> completion, but leave folios that may exist in that range alone. Since
>>>> ->write_begin() doesn't currently take any flags, and to avoid needing
>>>> to change the callback kernel wide, use the foliop being passed in to
>>>> ->write_begin() to signal if this is an uncached write or not. File
>>>> systems can then use that to mark newly created folios as uncached.
>>>>
>>>> Add a helper, generic_uncached_write(), that generic_file_write_iter()
>>>> calls upon successful completion of an uncached write.
>>>
>>> This doesn't implement an "uncached" write operation. This
>>> implements a cache write-through operation.
>>
>> It's uncached in the sense that the range gets pruned on writeback
>> completion.
> 
> That's not the definition of "uncached". Direct IO is, by
> definition, "uncached" because it bypasses the cache and is not
> coherent with the contents of the cache.

I grant you it's not the best word in the world to describe it, but it
is uncached in the sense that it's not persistent in cache. It does very
much use the page cache as the synchronization point, exactly to avoid
the pitfalls of the giant mess that is O_DIRECT. But it's not persistent
in cache, whereas write-through very much traditionally is. Hence I
think uncached is a much better word than write-through, though as
mentioned I'll be happy to take other suggestions. Write-through isn't
it though, as the uncached concept is as much about reads as it is about
writes.

> This IO, however, is moving the data coherently through the cache
> (both on read and write).  The cached folios are transient - i.e.
> -temporarily resident- in the cache whilst the IO is in progress -
> but this behaviour does not make it "uncached IO".
> 
> Calling it "uncached IO " is simply wrong from any direction I look
> at it....

As mentioned, better words welcome :-)

>> For write-through, I'd consider that just the fact that it
>> gets kicked off once dirtied rather than wait for writeback to get
>> kicked at some point.
>>
>> So I'd say write-through is a subset of that.
> 
> I think the post-IO invalidation that these IOs do is largely
> irrelevant to how the page cache processes the write. Indeed,
> from userspace, the functionality in this patchset would be
> implemented like this:
> 
> oneshot_data_write(fd, buf, len, off)
> {
> 	/* write into page cache */
> 	pwrite(fd, buf, len, off);
> 
> 	/* force the write through the page cache */
> 	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> 
> 	/* Invalidate the single use data in the cache now it is on disk */
> 	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> }

Right, you could do that, it'd obviously just be much slower as you lose
the pipelining of the writes. This is the reason for the patch, after
all.

> Allowing the application to control writeback and invalidation
> granularity is a much more flexible solution to the problem here;
> when IO is sequential, delayed allocation will be allowed to ensure
> large contiguous extents are created and that will greatly reduce
> file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> writes, it'll submit async IOs in batches...
> 
> Given that io_uring already supports sync_file_range() and
> posix_fadvise(), I'm wondering why we need an new IO API to perform
> this specific write-through behaviour in a way that is less flexible
> than what applications can already implement through existing
> APIs....

Just to make it available generically, it's just a read/write flag after
all. And yes, you can very much do this already with io_uring, just by
linking the ops. But the way I see it, it's a generic solution to a
generic problem.

>>> That also gives us a common place for adding cache write-through
>>> trigger logic (think writebehind trigger logic similar to readahead)
>>> and this is also a place where we could automatically tag mapping
>>> ranges for reclaim on writeback completion....
>>
>> I appreciate that you seemingly like the concept, but not that you are
>> also seemingly trying to commandeer this to be something else. Unless
>> you like the automatic reclaiming as well, it's not clear to me.
> 
> I'm not trying to commandeer anything.

No? You're very much trying to steer it in a direction that you find
better. There's a difference between making suggestions, or speaking
like you are sitting on the ultimate truth.

> Having thought about it more, I think this new API is unneccesary
> for custom written applications to perform fine grained control of
> page cache residency of one-shot data. We already have APIs that
> allow applications to do exactly what this patchset is doing. rather
> than choosing to modify whatever benchmark being used to use
> existing APIs, a choice was made to modify both the applicaiton and
> the kernel to implement a whole new API....
> 
> I think that was the -wrong choice-.
> 
> I think this partially because the kernel modifications are don't
> really help further us towards the goal of transparent mode
> switching in the page cache.
> 
> Read-through should be a mode that the readahead control activates,
> not be something triggered by a special read() syscall flag. We
> already have access patterns and fadvise modes guiding this.
> Write-through should be controlled in a similar way.
> 
> And making the data being read and written behave as transient page
> caceh objects should be done via an existing fadvise mode, too,
> because the model you have implemented here exactly matches the 
> definition of FADV_NOREUSE:
> 
> 	POSIX_FADV_NOREUSE
>               The specified data will be accessed only once.
> 
> Having a new per-IO flag that effectively collides existing
> control functionality into a single inflexible API bit doesn't
> really make a whole lot of sense to me.
> 
> IOWs, I'm not questioning whether we need rw-through modes and/or
> IO-transient residency for page cache based IO - it's been on our
> radar for a while. I'm more concerned that the chosen API in this
> patchset is a poor one as it cannot replace any of the existing
> controls we already have for these sorts of application directed
> page cache manipulations...

We'll just have to disagree, then. Per-file settings is fine for sync
IO, for anything async per-io is the way to go. It's why we have things
like RWF_NOWAIT as well, where O_NONBLOCK exists too. I'd argue that
RWF_NOWAIT should always have been a thing, and O_NONBLOCK is a mistake.
That's why RWF_UNCACHED exists. And yes, the FADV_NOREUSE was already
discussed with Willy and Yu, and I already did a poc patch to just
unconditionally set RWF_UNCACHED for FADV_NOREUSE enabled files. While
it's not exactly the same concept, I think the overlap is large enough
that it makes sense to do that. Especially since, historically,
FADV_NOREUSE has been largely a no-op and even know it doesn't have well
defined semantics.

-- 
Jens Axboe
Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Kirill A. Shutemov 1 week, 4 days ago
On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote:
> I think the post-IO invalidation that these IOs do is largely
> irrelevant to how the page cache processes the write. Indeed,
> from userspace, the functionality in this patchset would be
> implemented like this:
> 
> oneshot_data_write(fd, buf, len, off)
> {
> 	/* write into page cache */
> 	pwrite(fd, buf, len, off);
> 
> 	/* force the write through the page cache */
> 	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> 
> 	/* Invalidate the single use data in the cache now it is on disk */
> 	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> }
> 
> Allowing the application to control writeback and invalidation
> granularity is a much more flexible solution to the problem here;
> when IO is sequential, delayed allocation will be allowed to ensure
> large contiguous extents are created and that will greatly reduce
> file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> writes, it'll submit async IOs in batches...
> 
> Given that io_uring already supports sync_file_range() and
> posix_fadvise(), I'm wondering why we need an new IO API to perform
> this specific write-through behaviour in a way that is less flexible
> than what applications can already implement through existing
> APIs....

Attaching the hint to the IO operation allows kernel to keep the data in
page cache if it is there for other reason. You cannot do it with a
separate syscall.

Consider a scenario of a nightly backup of the data. The same data is in
cache because the actual workload needs it. You don't want backup task to
invalidate the data from cache. Your snippet would do that.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Posted by Dave Chinner 1 week, 4 days ago
On Tue, Nov 12, 2024 at 11:50:46AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote:
> > I think the post-IO invalidation that these IOs do is largely
> > irrelevant to how the page cache processes the write. Indeed,
> > from userspace, the functionality in this patchset would be
> > implemented like this:
> > 
> > oneshot_data_write(fd, buf, len, off)
> > {
> > 	/* write into page cache */
> > 	pwrite(fd, buf, len, off);
> > 
> > 	/* force the write through the page cache */
> > 	sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > 	/* Invalidate the single use data in the cache now it is on disk */
> > 	posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> > }
> > 
> > Allowing the application to control writeback and invalidation
> > granularity is a much more flexible solution to the problem here;
> > when IO is sequential, delayed allocation will be allowed to ensure
> > large contiguous extents are created and that will greatly reduce
> > file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> > writes, it'll submit async IOs in batches...
> > 
> > Given that io_uring already supports sync_file_range() and
> > posix_fadvise(), I'm wondering why we need an new IO API to perform
> > this specific write-through behaviour in a way that is less flexible
> > than what applications can already implement through existing
> > APIs....
> 
> Attaching the hint to the IO operation allows kernel to keep the data in
> page cache if it is there for other reason. You cannot do it with a
> separate syscall.

Sure we can. FADV_NOREUSE is attached to the struct file - that's
available to every IO that is done on that file. Hence we know
before we start every IO on that file if we only need to preserve
existing page cache or all data we access.

Having a file marked like this doesn't affect any other application
that is accessing the same inode. It just means that the specific
fd opened by a specific process will not perturb the long term
residency of the page cache on that inode.

> Consider a scenario of a nightly backup of the data. The same data is in
> cache because the actual workload needs it. You don't want backup task to
> invalidate the data from cache. Your snippet would do that.

The code I presented was essentially just a demonstration of what
"uncached IO" was doing. That it is actually cached IO, and that it
can be done from userspace right now. Yes, it's not exactly the same
cache invalidation semantics, but that's not the point.

The point was that the existing APIs are *much more flexible* than
this proposal, and we don't actually need new kernel functionality
for applications to see the same benchmark results as Jens has
presented. All they need to do is be modified to use existing APIs.

The additional point to that end is that FADV_NOREUSE should be
hooke dup to the conditional cache invalidation mechanism Jens added
to the page cache IO paths. Then we have all the functionality of
this patch set individually selectable by userspace applications
without needing a new IO API to be rolled out. i.e. the snippet
then bcomes:

	/* don't cache after IO */
	fadvise(fd, FADV_NORESUSE)
	....
	write(fd, buf, len, off);
	/* write through */
	sync_file_range(fd, off, len, SYNC_FILE_RANGE);

Note how this doesn't need to block in sync_file_range() before
doing the invalidation anymore? We've separated the cache control
behaviour from the writeback behaviour. We can now do both write
back and write through buffered writes that clean up the page cache
after IO completion has occurred - write-through is not restricted
to uncached writes, nor is the cache purge after writeback
completion.

IOWs, we can do:

	/* don't cache after IO */
	fadvise(fd, FADV_NORESUSE)
	....
	off = pos;
	count = 4096;
	while (off < pos + len) {
		ret = write(fd, buf, count, off);
		/* get more data and put it in buf */
		off += ret;
	}
	/* write through */
	sync_file_range(fd, pos, len, SYNC_FILE_RANGE);

And now we only do one set of writeback on the file range, instead
of one per IO. And we still get the page cache being released on
writeback Io completion.

This is a *much* better API for IO and page cache control. It is not
constrained to individual IOs, so applications can allow the page
cache to write-combine data from multiple syscalls into a single
physical extent allocation and writeback IO.

This is much more efficient for modern filesytsems - the "writeback
per IO" model forces filesystms like XFS and ext4 to work like ext3
did, and defeats buffered write IO optimisations like dealyed
allocation. If we are going to do small "allocation and write IO"
patterns, we may as well be using direct IO as it is optimised for
that sort of behaviour.

So let's conside the backup application example. IMO, backup
applications  really don't want to use this new uncached IO
mechanism for either reading or writing data.

Backup programs do sequential data read IO as they walk the backup set -
if they are doing buffered IO then we -really- want readahead to be
active.

However, uncached IO turns off readahead, which is the equivalent of
the backup application doing:

	fadvise(fd, FADV_RANDOM);
	while (len > 0) {
		ret = read(fd, buf, len, off);
		fadvise(fd, FADV_DONTNEED, off, len);

		/* do stuff with buf */

		off += ret;
		len -= ret;
	}

Sequential buffered read IO after setting FADV_RANDOM absolutely
*sucks* from a performance perspective.

This is when FADV_NOREUSE is useful. We can leave readahead turned
on, and when we do the first read from the page cache after
readahead completes, we can then apply the NOREUSE policy. i.e. if
the data we are reading has not been accessed, then turf it after
reading if NOREUSE is set. If the data was already resident in
cache, then leave it there as per a normal read.

IOWs, if we separate the cache control from the read IO itself,
there is no need to turn off readahead to implement "drop cache
on-read" semantics. We just need to know if the folio has been
accessed or not to determine what to do with it.

Let's also consider the backup data file - that is written
sequentially.  It's going to be large and we don't know it's size
ahead of time. If we are using buffered writes we want delayed
allocation to optimise the file layout and hence writeback IO
throughput.  We also want to drop the page cache when writeback
eventually happens, but we really don't want writeback to happen on
every write.

IOWs, backup programs can take advantage of "drop cache when clean"
semantics, but can't really take any significant advantage from
per-IO write-through semantics. IOWs, backup applications really
want per-file NOREUSE write semantics that are seperately controlled
w.r.t. cache write-through behaviour.

One of the points I tried to make was that the uncached IO proposal
smashes multiple disparate semantics into a single per-IO control
bit. The backup application example above shows exactly how that API
isn't actually very good for the applications that could benefit
from the functionality this patchset adds to the page cache to
support that single control bit...

-Dave.
-- 
Dave Chinner
david@fromorbit.com