[v4] allow partial folio write with iomap_folio_state

[PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state

Posted by alexjlzheng@gmail.com 2 weeks, 5 days ago

From: Jinliang Zheng <alexjlzheng@tencent.com>

Currently, if a partial write occurs in a buffer write, the entire write will
be discarded. While this is an uncommon case, it's still a bit wasteful and
we can do better.

With iomap_folio_state, we can identify uptodate states at the block
level, and a read_folio reading can correctly handle partially
uptodate folios.

Therefore, when a partial write occurs, accept the block-aligned
partial write instead of rejecting the entire write.

For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
bytes are 2MB-3kB.

Without this patchset, we'd need to recopy from the beginning of the
folio in the next iteration, which means 2MB-3kB of bytes is copy
duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
 |<-------- 1MB -------->|                          next time we need copy (chunk /= 2)
                         |<-------- 1MB -------->|  next next time we need copy.

 |<------ 2MB-3kB bytes duplicate copy ---->|

With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
This means we only need to process the remaining 4kB in the next iteration,
which means there's only 1kB we need to copy duplicately.

 |<-------------------- 2MB -------------------->|
 +-------+-------+-------+-------+-------+-------+
 | block |  ...  | block | block |  ...  | block | folio
 +-------+-------+-------+-------+-------+-------+
 |<-4kB->|

 |<--------------- copied 2MB-3kB --------->|       first time copied
                                         |<-4kB->|  next time we need copy

                                         |<>|
                              only 1kB bytes duplicate copy

Although partial writes are inherently a relatively unusual situation and do
not account for a large proportion of performance testing, the optimization
here still makes sense in large-scale data centers.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7b9193f8243a..0952a3debe11 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -873,6 +873,25 @@ static int iomap_write_begin(struct iomap_iter *iter,
 	return status;
 }
 
+static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
+		size_t copied, struct folio *folio)
+{
+	struct iomap_folio_state *ifs = folio->private;
+	unsigned block_size, last_blk, last_blk_bytes;
+
+	if (!ifs || !copied)
+		return 0;
+
+	block_size = 1 << inode->i_blkbits;
+	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
+	last_blk_bytes = (pos + copied) & (block_size - 1);
+
+	if (!ifs_block_is_uptodate(ifs, last_blk))
+		copied -= min(copied, last_blk_bytes);
+
+	return copied;
+}
+
 static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
@@ -881,17 +900,24 @@ static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	/*
 	 * The blocks that were entirely written will now be uptodate, so we
 	 * don't have to worry about a read_folio reading them and overwriting a
-	 * partial write.  However, if we've encountered a short write and only
-	 * partially written into a block, it will not be marked uptodate, so a
-	 * read_folio might come in and destroy our partial write.
+	 * partial write.
 	 *
-	 * Do the simplest thing and just treat any short write to a
-	 * non-uptodate page as a zero-length write, and force the caller to
-	 * redo the whole thing.
+	 * However, if we've encountered a short write and only partially
+	 * written into a block, we must discard the short-written _tail_ block
+	 * and not mark it uptodate in the ifs, to ensure a read_folio reading
+	 * can handle it correctly via iomap_adjust_read_range(). It's safe to
+	 * keep the non-tail block writes because we know that for a non-tail
+	 * block:
+	 * - is either fully written, since copy_from_user() is sequential
+	 * - or is a partially written head block that has already been read in
+	 *   and marked uptodate in the ifs by iomap_write_begin().
 	 */
-	if (unlikely(copied < len && !folio_test_uptodate(folio)))
-		return 0;
-	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
+	if (unlikely(copied < len && !folio_test_uptodate(folio))) {
+		copied = iomap_trim_tail_partial(inode, pos, copied, folio);
+		if (!copied)
+			return 0;
+	}
+	iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), copied);
 	iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
-- 
2.49.0

Re: [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state

Posted by Pankaj Raghav (Samsung) 2 weeks, 3 days ago

> +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> +		size_t copied, struct folio *folio)
> +{
> +	struct iomap_folio_state *ifs = folio->private;
> +	unsigned block_size, last_blk, last_blk_bytes;
> +
> +	if (!ifs || !copied)
> +		return 0;
> +
> +	block_size = 1 << inode->i_blkbits;
> +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> +	last_blk_bytes = (pos + copied) & (block_size - 1);
> +
> +	if (!ifs_block_is_uptodate(ifs, last_blk))
> +		copied -= min(copied, last_blk_bytes);

If pos is aligned to block_size, is there a scenario where 
copied < last_blk_bytes?

Trying to understand why you are using a min() here.
--
Pankaj

Re: [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state

Posted by Jinliang Zheng 2 weeks, 3 days ago

On Mon, 15 Sep 2025 12:50:54 +0200, kernel@pankajraghav.com wrote:
> > +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> > +		size_t copied, struct folio *folio)
> > +{
> > +	struct iomap_folio_state *ifs = folio->private;
> > +	unsigned block_size, last_blk, last_blk_bytes;
> > +
> > +	if (!ifs || !copied)
> > +		return 0;
> > +
> > +	block_size = 1 << inode->i_blkbits;
> > +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> > +	last_blk_bytes = (pos + copied) & (block_size - 1);
> > +
> > +	if (!ifs_block_is_uptodate(ifs, last_blk))
> > +		copied -= min(copied, last_blk_bytes);
> 
> If pos is aligned to block_size, is there a scenario where 
> copied < last_blk_bytes?

I believe there is no other scenario. The min() here is specifically to handle cases where
pos is not aligned to block_size. But please note that the pos here is unrelated to the pos
in iomap_adjust_read_range().

thanks,
Jinliang Zheng. :)

> 
> Trying to understand why you are using a min() here.
> --
> Pankaj

Re: [PATCH 4/4] iomap: don't abandon the whole copy when we have iomap_folio_state

Posted by Pankaj Raghav (Samsung) 2 weeks, 3 days ago

On Mon, Sep 15, 2025 at 07:12:28PM +0800, Jinliang Zheng wrote:
> On Mon, 15 Sep 2025 12:50:54 +0200, kernel@pankajraghav.com wrote:
> > > +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
> > > +		size_t copied, struct folio *folio)
> > > +{
> > > +	struct iomap_folio_state *ifs = folio->private;
> > > +	unsigned block_size, last_blk, last_blk_bytes;
> > > +
> > > +	if (!ifs || !copied)
> > > +		return 0;
> > > +
> > > +	block_size = 1 << inode->i_blkbits;
> > > +	last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
> > > +	last_blk_bytes = (pos + copied) & (block_size - 1);
> > > +
> > > +	if (!ifs_block_is_uptodate(ifs, last_blk))
> > > +		copied -= min(copied, last_blk_bytes);
> > 
> > If pos is aligned to block_size, is there a scenario where 
> > copied < last_blk_bytes?
> 
> I believe there is no other scenario. The min() here is specifically to handle cases where
> pos is not aligned to block_size. But please note that the pos here is unrelated to the pos
> in iomap_adjust_read_range().

Ah, you are right. This is about write and not read. I got a bit
confused after reading both the patches back to back.

--
Pankaj