[RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes

Ojaswin Mujoo posted 3 patches 1 month ago
There is a newer version of this series
[RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
Posted by Ojaswin Mujoo 1 month ago
Currently, RWF_WRITETHROUGH writes wait for writeback to complete
on a folio before performing the writethrough. This serializes
writethrough with each other and the writeback path. However, it is also
desirable have similar guarantees between RWF_WRITETHROUGH and non
writethrough writes.

Hence, ensure stable writes are enabled on an inode's mapping as
long as a writethrough write is ongoing. This way, all paths will
wait for RWF_WRITETHROUGH to complete on a folio before proceeding.

To track inflight writethrough writes, we use an atomic counter in the
inode->i_mapping. This struct was chosen because (i) writethrough is an
operation on the folio and (ii) we don't want to add bloat to struct
inode.

Suggested-by: Dave Chinner <dgc@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/inode.c             |  1 +
 fs/iomap/buffered-io.c | 35 +++++++++++++++++++++++++++++++++--
 fs/iomap/direct-io.c   |  2 ++
 include/linux/fs.h     |  2 ++
 include/linux/iomap.h  |  2 ++
 5 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index cc12b68e021b..5b779c112ff8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -280,6 +280,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
 	mapping->flags = 0;
 	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
+	atomic_set(&mapping->i_wt_count, 0);
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
 	atomic_set(&mapping->nr_thps, 0);
 #endif
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ab169daa1126..9d4d459af1a0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1150,11 +1150,41 @@ static bool iomap_writethrough_checks(struct kiocb *iocb, size_t off, loff_t len
 	return true;
 }
 
+/**
+ * inode_writethrough_begin - signal start of a RWF_WRITETHROUGH request
+ * @inode: inode the writethrough happens on
+ *
+ * This is called when we are about to start a writethrough on an inode.
+ * If it is the first writethrough, set the mapping as stable to ensure
+ * other folio operations wait for writeback to finish.
+ *
+ * To avoid a race, just set the mapping stable first and then increment
+ * writethrough count, so that the stable writes are enforced as soon as
+ * writethrough count becomes non zero.
+ */
+inline void inode_writethrough_begin(struct inode *inode)
+{
+	mapping_set_stable_writes(inode->i_mapping);
+	atomic_inc(&inode->i_mapping->i_wt_count);
+}
+
+/**
+ * inode_writethrough_end - signal finish of a RWF_WRITETHROUGH request
+ * @inode: inode the writethrough I/O happened on
+ *
+ * This is called once we've finished processing a writethrough request
+ */
+inline void inode_writethrough_end(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_mapping->i_wt_count))
+		mapping_clear_stable_writes(inode->i_mapping);
+}
+
 /*
  * With writethrough, we might potentially be writing through a partial
  * folio hence we don't clear the dirty bit (yet)
  */
-static void folio_prepare_writethrough(struct folio *folio)
+static void folio_prepare_writethrough(struct inode *inode, struct folio *folio)
 {
 	if (folio_test_writeback(folio))
 		folio_wait_writeback(folio);
@@ -1167,6 +1197,7 @@ static void folio_prepare_writethrough(struct folio *folio)
 		/* Refer folio_clear_dirty_for_io() for why this is needed */
 		folio_mark_dirty(folio);
 
+	inode_writethrough_begin(inode);
 }
 
 /**
@@ -1203,7 +1234,7 @@ static int iomap_writethrough_begin(struct kiocb *iocb, struct folio *folio,
 	bool fully_written;
 	u64 zero = 0;
 
-	folio_prepare_writethrough(folio);
+	folio_prepare_writethrough(iter->inode, folio);
 
 	wt_ctx->bvec = kmalloc(sizeof(struct bio_vec), GFP_KERNEL | GFP_NOFS);
 	if (!wt_ctx->bvec)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f4d8ff08a83a..12680d97d765 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -140,6 +140,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 		kiocb_invalidate_post_direct_write(iocb, dio->size);
 
 	inode_dio_end(file_inode(iocb->ki_filp));
+	if (dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)
+		inode_writethrough_end(file_inode(iocb->ki_filp));
 
 	if (ret > 0) {
 		iocb->ki_pos += ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ca291957140e..6b7491fdd51a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -456,6 +456,7 @@ extern const struct address_space_operations empty_aops;
  *   memory mappings.
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
+ * @i_wt_count: Number of RWF_WRITETHROUGH writes ongoing in mapping.
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
@@ -474,6 +475,7 @@ struct address_space {
 	struct rw_semaphore	invalidate_lock;
 	gfp_t			gfp_mask;
 	atomic_t		i_mmap_writable;
+	atomic_t		i_wt_count;
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
 	/* number of thp, only for non-shmem files */
 	atomic_t		nr_thps;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index b96574bb2918..6d08b966ceaf 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -630,6 +630,8 @@ struct iomap_writethrough_ops {
 ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
 				      const struct iomap_writethrough_ops *wt_ops,
 				      void *private);
+inline void inode_writethrough_begin(struct inode *inode);
+inline void inode_writethrough_end(struct inode *inode);
 
 #ifdef CONFIG_SWAP
 struct file;
-- 
2.52.0
Re: [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
Posted by Darrick J. Wong 1 month ago
On Mon, Mar 09, 2026 at 11:04:32PM +0530, Ojaswin Mujoo wrote:
> Currently, RWF_WRITETHROUGH writes wait for writeback to complete
> on a folio before performing the writethrough. This serializes
> writethrough with each other and the writeback path. However, it is also
> desirable have similar guarantees between RWF_WRITETHROUGH and non
> writethrough writes.
> 
> Hence, ensure stable writes are enabled on an inode's mapping as
> long as a writethrough write is ongoing. This way, all paths will
> wait for RWF_WRITETHROUGH to complete on a folio before proceeding.
> 
> To track inflight writethrough writes, we use an atomic counter in the
> inode->i_mapping. This struct was chosen because (i) writethrough is an
> operation on the folio and (ii) we don't want to add bloat to struct
> inode.

What if we just set it whenever someone successfully initiates a
RWF_WRITETHROUGH write?  Then we wouldn't need all this atomic counter
machinery.

Also: What if some filesystem (not xfs, obviously) finds a need to
change the stablepages bit while there might be writethrough writes in
progress?  It's a little awkward to have a flag /and/ a counter; why not
change mapping_{set,clear}_stable_pages to inc and dec the counter and
base the test off that?

--D

> Suggested-by: Dave Chinner <dgc@kernel.org>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/inode.c             |  1 +
>  fs/iomap/buffered-io.c | 35 +++++++++++++++++++++++++++++++++--
>  fs/iomap/direct-io.c   |  2 ++
>  include/linux/fs.h     |  2 ++
>  include/linux/iomap.h  |  2 ++
>  5 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index cc12b68e021b..5b779c112ff8 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -280,6 +280,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
>  	mapping->flags = 0;
>  	mapping->wb_err = 0;
>  	atomic_set(&mapping->i_mmap_writable, 0);
> +	atomic_set(&mapping->i_wt_count, 0);
>  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
>  	atomic_set(&mapping->nr_thps, 0);
>  #endif
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ab169daa1126..9d4d459af1a0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1150,11 +1150,41 @@ static bool iomap_writethrough_checks(struct kiocb *iocb, size_t off, loff_t len
>  	return true;
>  }
>  
> +/**
> + * inode_writethrough_begin - signal start of a RWF_WRITETHROUGH request
> + * @inode: inode the writethrough happens on
> + *
> + * This is called when we are about to start a writethrough on an inode.
> + * If it is the first writethrough, set the mapping as stable to ensure
> + * other folio operations wait for writeback to finish.
> + *
> + * To avoid a race, just set the mapping stable first and then increment
> + * writethrough count, so that the stable writes are enforced as soon as
> + * writethrough count becomes non zero.
> + */
> +inline void inode_writethrough_begin(struct inode *inode)
> +{
> +	mapping_set_stable_writes(inode->i_mapping);
> +	atomic_inc(&inode->i_mapping->i_wt_count);
> +}
> +
> +/**
> + * inode_writethrough_end - signal finish of a RWF_WRITETHROUGH request
> + * @inode: inode the writethrough I/O happened on
> + *
> + * This is called once we've finished processing a writethrough request
> + */
> +inline void inode_writethrough_end(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_mapping->i_wt_count))
> +		mapping_clear_stable_writes(inode->i_mapping);
> +}
> +
>  /*
>   * With writethrough, we might potentially be writing through a partial
>   * folio hence we don't clear the dirty bit (yet)
>   */
> -static void folio_prepare_writethrough(struct folio *folio)
> +static void folio_prepare_writethrough(struct inode *inode, struct folio *folio)
>  {
>  	if (folio_test_writeback(folio))
>  		folio_wait_writeback(folio);
> @@ -1167,6 +1197,7 @@ static void folio_prepare_writethrough(struct folio *folio)
>  		/* Refer folio_clear_dirty_for_io() for why this is needed */
>  		folio_mark_dirty(folio);
>  
> +	inode_writethrough_begin(inode);
>  }
>  
>  /**
> @@ -1203,7 +1234,7 @@ static int iomap_writethrough_begin(struct kiocb *iocb, struct folio *folio,
>  	bool fully_written;
>  	u64 zero = 0;
>  
> -	folio_prepare_writethrough(folio);
> +	folio_prepare_writethrough(iter->inode, folio);
>  
>  	wt_ctx->bvec = kmalloc(sizeof(struct bio_vec), GFP_KERNEL | GFP_NOFS);
>  	if (!wt_ctx->bvec)
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f4d8ff08a83a..12680d97d765 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -140,6 +140,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  		kiocb_invalidate_post_direct_write(iocb, dio->size);
>  
>  	inode_dio_end(file_inode(iocb->ki_filp));
> +	if (dio->flags & IOMAP_DIO_BUF_WRITETHROUGH)
> +		inode_writethrough_end(file_inode(iocb->ki_filp));
>  
>  	if (ret > 0) {
>  		iocb->ki_pos += ret;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ca291957140e..6b7491fdd51a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -456,6 +456,7 @@ extern const struct address_space_operations empty_aops;
>   *   memory mappings.
>   * @gfp_mask: Memory allocation flags to use for allocating pages.
>   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> + * @i_wt_count: Number of RWF_WRITETHROUGH writes ongoing in mapping.
>   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
>   * @i_mmap: Tree of private and shared mappings.
>   * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> @@ -474,6 +475,7 @@ struct address_space {
>  	struct rw_semaphore	invalidate_lock;
>  	gfp_t			gfp_mask;
>  	atomic_t		i_mmap_writable;
> +	atomic_t		i_wt_count;
>  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
>  	/* number of thp, only for non-shmem files */
>  	atomic_t		nr_thps;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index b96574bb2918..6d08b966ceaf 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -630,6 +630,8 @@ struct iomap_writethrough_ops {
>  ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
>  				      const struct iomap_writethrough_ops *wt_ops,
>  				      void *private);
> +inline void inode_writethrough_begin(struct inode *inode);
> +inline void inode_writethrough_end(struct inode *inode);
>  
>  #ifdef CONFIG_SWAP
>  struct file;
> -- 
> 2.52.0
> 
>
Re: [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
Posted by Ritesh Harjani (IBM) 1 month ago
"Darrick J. Wong" <djwong@kernel.org> writes:

> On Mon, Mar 09, 2026 at 11:04:32PM +0530, Ojaswin Mujoo wrote:
>> Currently, RWF_WRITETHROUGH writes wait for writeback to complete
>> on a folio before performing the writethrough. This serializes
>> writethrough with each other and the writeback path. However, it is also
>> desirable have similar guarantees between RWF_WRITETHROUGH and non
>> writethrough writes.
>> 
>> Hence, ensure stable writes are enabled on an inode's mapping as
>> long as a writethrough write is ongoing. This way, all paths will
>> wait for RWF_WRITETHROUGH to complete on a folio before proceeding.
>> 
>> To track inflight writethrough writes, we use an atomic counter in the
>> inode->i_mapping. This struct was chosen because (i) writethrough is an
>> operation on the folio and (ii) we don't want to add bloat to struct
>> inode.

Now I am also questioning the need of this counter.
If mapping has AS_STABLE_WRITES bit set, then that means the
inode->mapping is going through stable writes until that bit is
cleared. And since in future we are going to add support of async
buffered write-through, so the stable writes bit should get cleared in
the completion path (like how it is done now.) 

>
> What if we just set it whenever someone successfully initiates a
> RWF_WRITETHROUGH write?  Then we wouldn't need all this atomic counter
> machinery.
>

I agree. If we set the mapping as stable before initiating
iomap_write_begin() itself, then we don't need this atomic counter.

Maybe, we can set it in iomap_file_writethrough_write() itself
(we have mapping available from iocb).


> Also: What if some filesystem (not xfs, obviously) finds a need to
> change the stablepages bit while there might be writethrough writes in
> progress?

Is there a usecase where this can happen (just curious)?

> It's a little awkward to have a flag /and/ a counter; why not
> change mapping_{set,clear}_stable_pages to inc and dec the counter and
> base the test off that?
>

Yes, either ways, I agree that I don't see the need of an extra counter here.

-ritesh
Re: [RFC 2/3] iomap: Enable stable writes for RWF_WRITETHROUGH inodes
Posted by Ojaswin Mujoo 4 weeks, 1 day ago
On Tue, Mar 10, 2026 at 10:55:04AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > On Mon, Mar 09, 2026 at 11:04:32PM +0530, Ojaswin Mujoo wrote:
> >> Currently, RWF_WRITETHROUGH writes wait for writeback to complete
> >> on a folio before performing the writethrough. This serializes
> >> writethrough with each other and the writeback path. However, it is also
> >> desirable have similar guarantees between RWF_WRITETHROUGH and non
> >> writethrough writes.
> >> 
> >> Hence, ensure stable writes are enabled on an inode's mapping as
> >> long as a writethrough write is ongoing. This way, all paths will
> >> wait for RWF_WRITETHROUGH to complete on a folio before proceeding.
> >> 
> >> To track inflight writethrough writes, we use an atomic counter in the
> >> inode->i_mapping. This struct was chosen because (i) writethrough is an
> >> operation on the folio and (ii) we don't want to add bloat to struct
> >> inode.
> 
> Now I am also questioning the need of this counter.
> If mapping has AS_STABLE_WRITES bit set, then that means the
> inode->mapping is going through stable writes until that bit is
> cleared. And since in future we are going to add support of async
> buffered write-through, so the stable writes bit should get cleared in
> the completion path (like how it is done now.) 
> 
> >
> > What if we just set it whenever someone successfully initiates a
> > RWF_WRITETHROUGH write?  Then we wouldn't need all this atomic counter
> > machinery.
> >

> 
> I agree. If we set the mapping as stable before initiating
> iomap_write_begin() itself, then we don't need this atomic counter.
> 
> Maybe, we can set it in iomap_file_writethrough_write() itself
> (we have mapping available from iocb).

Hi Darrick, Ritesh,

Yes, I think we don't need the counter to know when to switch stable
writes on and off. Now that I'm thinking about it, maybe a mapping level
stable write is too restrictive? I understand that for certain hardware we
need it at mapping level but for cases like writethrough, all we need is
that particular folio to complete writeback. Why should we serialize
it with other non overlapping writes.

Maybe implementing a folio level stable writes or sprinkling around
folio_wait_writeback() makes more sense?

Also since we are on this topic, another thing that I should change is
where we call folio_mkclean(). Right now we call folio_mkclean() after
copying user data to pagecache, which means theres a window where mmap
write might change the data. I think we should proactively call it
before the memcpy?

> 
> 
> > Also: What if some filesystem (not xfs, obviously) finds a need to
> > change the stablepages bit while there might be writethrough writes in
> > progress?
> 
> Is there a usecase where this can happen (just curious)?
> 
> > It's a little awkward to have a flag /and/ a counter; why not
> > change mapping_{set,clear}_stable_pages to inc and dec the counter and
> > base the test off that?
> >
> 
> Yes, either ways, I agree that I don't see the need of an extra counter here.
> 
> -ritesh