From: Jinliang Zheng <alexjlzheng@tencent.com>
Currently, if a partial write occurs in a buffer write, the entire write will
be discarded. While this is an uncommon case, it's still a bit wasteful and
we can do better.
With iomap_folio_state, we can identify uptodate states at the block
level, and a read_folio reading can correctly handle partially
uptodate folios.
Therefore, when a partial write occurs, accept the block-aligned
partial write instead of rejecting the entire write.
For example, suppose a folio is 2MB, blocksize is 4kB, and the copied
bytes are 2MB-3kB.
Without this patchset, we'd need to recopy from the beginning of the
folio in the next iteration, which means 2MB-3kB of bytes is copy
duplicately.
|<-------------------- 2MB -------------------->|
+-------+-------+-------+-------+-------+-------+
| block | ... | block | block | ... | block | folio
+-------+-------+-------+-------+-------+-------+
|<-4kB->|
|<--------------- copied 2MB-3kB --------->| first time copied
|<-------- 1MB -------->| next time we need copy (chunk /= 2)
|<-------- 1MB -------->| next next time we need copy.
|<------ 2MB-3kB bytes duplicate copy ---->|
With this patchset, we can accept 2MB-4kB of bytes, which is block-aligned.
This means we only need to process the remaining 4kB in the next iteration,
which means there's only 1kB we need to copy duplicately.
|<-------------------- 2MB -------------------->|
+-------+-------+-------+-------+-------+-------+
| block | ... | block | block | ... | block | folio
+-------+-------+-------+-------+-------+-------+
|<-4kB->|
|<--------------- copied 2MB-3kB --------->| first time copied
|<-4kB->| next time we need copy
|<>|
only 1kB bytes duplicate copy
Although partial writes are inherently a relatively unusual situation and do
not account for a large proportion of performance testing, the optimization
here still makes sense in large-scale data centers.
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++++++++++++---------
1 file changed, 35 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7b9193f8243a..0952a3debe11 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -873,6 +873,25 @@ static int iomap_write_begin(struct iomap_iter *iter,
return status;
}
+static int iomap_trim_tail_partial(struct inode *inode, loff_t pos,
+ size_t copied, struct folio *folio)
+{
+ struct iomap_folio_state *ifs = folio->private;
+ unsigned block_size, last_blk, last_blk_bytes;
+
+ if (!ifs || !copied)
+ return 0;
+
+ block_size = 1 << inode->i_blkbits;
+ last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits;
+ last_blk_bytes = (pos + copied) & (block_size - 1);
+
+ if (!ifs_block_is_uptodate(ifs, last_blk))
+ copied -= min(copied, last_blk_bytes);
+
+ return copied;
+}
+
static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
size_t copied, struct folio *folio)
{
@@ -881,17 +900,24 @@ static int __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
/*
* The blocks that were entirely written will now be uptodate, so we
* don't have to worry about a read_folio reading them and overwriting a
- * partial write. However, if we've encountered a short write and only
- * partially written into a block, it will not be marked uptodate, so a
- * read_folio might come in and destroy our partial write.
+ * partial write.
*
- * Do the simplest thing and just treat any short write to a
- * non-uptodate page as a zero-length write, and force the caller to
- * redo the whole thing.
+ * However, if we've encountered a short write and only partially
+ * written into a block, we must discard the short-written _tail_ block
+ * and not mark it uptodate in the ifs, to ensure a read_folio reading
+ * can handle it correctly via iomap_adjust_read_range(). It's safe to
+ * keep the non-tail block writes because we know that for a non-tail
+ * block:
+ * - is either fully written, since copy_from_user() is sequential
+ * - or is a partially written head block that has already been read in
+ * and marked uptodate in the ifs by iomap_write_begin().
*/
- if (unlikely(copied < len && !folio_test_uptodate(folio)))
- return 0;
- iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), len);
+ if (unlikely(copied < len && !folio_test_uptodate(folio))) {
+ copied = iomap_trim_tail_partial(inode, pos, copied, folio);
+ if (!copied)
+ return 0;
+ }
+ iomap_set_range_uptodate(folio, offset_in_folio(folio, pos), copied);
iomap_set_range_dirty(folio, offset_in_folio(folio, pos), copied);
filemap_dirty_folio(inode->i_mapping, folio);
return copied;
--
2.49.0
> +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos, > + size_t copied, struct folio *folio) > +{ > + struct iomap_folio_state *ifs = folio->private; > + unsigned block_size, last_blk, last_blk_bytes; > + > + if (!ifs || !copied) > + return 0; > + > + block_size = 1 << inode->i_blkbits; > + last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits; > + last_blk_bytes = (pos + copied) & (block_size - 1); > + > + if (!ifs_block_is_uptodate(ifs, last_blk)) > + copied -= min(copied, last_blk_bytes); If pos is aligned to block_size, is there a scenario where copied < last_blk_bytes? Trying to understand why you are using a min() here. -- Pankaj
On Mon, 15 Sep 2025 12:50:54 +0200, kernel@pankajraghav.com wrote: > > +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos, > > + size_t copied, struct folio *folio) > > +{ > > + struct iomap_folio_state *ifs = folio->private; > > + unsigned block_size, last_blk, last_blk_bytes; > > + > > + if (!ifs || !copied) > > + return 0; > > + > > + block_size = 1 << inode->i_blkbits; > > + last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits; > > + last_blk_bytes = (pos + copied) & (block_size - 1); > > + > > + if (!ifs_block_is_uptodate(ifs, last_blk)) > > + copied -= min(copied, last_blk_bytes); > > If pos is aligned to block_size, is there a scenario where > copied < last_blk_bytes? I believe there is no other scenario. The min() here is specifically to handle cases where pos is not aligned to block_size. But please note that the pos here is unrelated to the pos in iomap_adjust_read_range(). thanks, Jinliang Zheng. :) > > Trying to understand why you are using a min() here. > -- > Pankaj
On Mon, Sep 15, 2025 at 07:12:28PM +0800, Jinliang Zheng wrote: > On Mon, 15 Sep 2025 12:50:54 +0200, kernel@pankajraghav.com wrote: > > > +static int iomap_trim_tail_partial(struct inode *inode, loff_t pos, > > > + size_t copied, struct folio *folio) > > > +{ > > > + struct iomap_folio_state *ifs = folio->private; > > > + unsigned block_size, last_blk, last_blk_bytes; > > > + > > > + if (!ifs || !copied) > > > + return 0; > > > + > > > + block_size = 1 << inode->i_blkbits; > > > + last_blk = offset_in_folio(folio, pos + copied - 1) >> inode->i_blkbits; > > > + last_blk_bytes = (pos + copied) & (block_size - 1); > > > + > > > + if (!ifs_block_is_uptodate(ifs, last_blk)) > > > + copied -= min(copied, last_blk_bytes); > > > > If pos is aligned to block_size, is there a scenario where > > copied < last_blk_bytes? > > I believe there is no other scenario. The min() here is specifically to handle cases where > pos is not aligned to block_size. But please note that the pos here is unrelated to the pos > in iomap_adjust_read_range(). Ah, you are right. This is about write and not read. I got a bit confused after reading both the patches back to back. -- Pankaj
© 2016 - 2025 Red Hat, Inc.