[v8] large atomic writes for xfs

[PATCH v8 11/15] xfs: commit CoW-based atomic writes atomically

Posted by John Garry 9 months, 3 weeks ago

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 18 +++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h |  1 +
 fs/xfs/xfs_file.c              |  5 ++-
 fs/xfs/xfs_reflink.c           | 56 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h           |  2 ++
 5 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 580d00ae2857..6c74f47f980a 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -1284,6 +1284,18 @@ xfs_calc_namespace_reservations(
 	resp->tr_mkdir.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 }
 
+STATIC void
+xfs_calc_default_atomic_ioend_reservation(
+	struct xfs_mount	*mp,
+	struct xfs_trans_resv	*resp)
+{
+	if (xfs_has_reflink(mp))
+		resp->tr_atomic_ioend = resp->tr_itruncate;
+	else
+		memset(&resp->tr_atomic_ioend, 0,
+				sizeof(resp->tr_atomic_ioend));
+}
+
 void
 xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
@@ -1378,4 +1390,10 @@ xfs_trans_resv_calc(
 	resp->tr_itruncate.tr_logcount += logcount_adj;
 	resp->tr_write.tr_logcount += logcount_adj;
 	resp->tr_qm_dqalloc.tr_logcount += logcount_adj;
+
+	/*
+	 * Now that we've finished computing the static reservations, we can
+	 * compute the dynamic reservation for atomic writes.
+	 */
+	xfs_calc_default_atomic_ioend_reservation(mp, resp);
 }
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index d9d0032cbbc5..670045d417a6 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -48,6 +48,7 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_qm_dqalloc;	/* allocate quota on disk */
 	struct xfs_trans_res	tr_sb;		/* modify superblock */
 	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
+	struct xfs_trans_res	tr_atomic_ioend; /* untorn write completion */
 };
 
 /* shorthand way of accessing reservation structure */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1302783a7157..ba4b02abc6e4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -576,7 +576,10 @@ xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			error = xfs_reflink_end_atomic_cow(ip, offset, size);
+		else
+			error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5d338916098..218dee76768b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -984,6 +984,62 @@ xfs_reflink_end_cow(
 	return error;
 }
 
+/*
+ * Fully remap all of the file's data fork at once, which is the critical part
+ * in achieving atomic behaviour.
+ * The regular CoW end path does not use function as to keep the block
+ * reservation per transaction as low as possible.
+ */
+int
+xfs_reflink_end_atomic_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	int				error = 0;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_trans		*tp;
+	unsigned int			resblks;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+	/*
+	 * Each remapping operation could cause a btree split, so in the worst
+	 * case that's one for each block.
+	 */
+	resblks = (end_fsb - offset_fsb) *
+			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_atomic_ioend, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	while (end_fsb > offset_fsb && !error) {
+		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
+				end_fsb);
+	}
+	if (error) {
+		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+		goto out_cancel;
+	}
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
  * metadata.  The ondisk metadata does not track which inode created the
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 379619f24247..412e9b6f2082 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -45,6 +45,8 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count, bool cancel_real);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+int xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,
-- 
2.31.1

Re: [PATCH v8 11/15] xfs: commit CoW-based atomic writes atomically

Posted by Christoph Hellwig 9 months, 3 weeks ago

On Tue, Apr 22, 2025 at 12:27:35PM +0000, John Garry wrote:
> +STATIC void

Didn't we phase out STATIC for new code?

> +xfs_calc_default_atomic_ioend_reservation(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans_resv	*resp)
> +{
> +	if (xfs_has_reflink(mp))
> +		resp->tr_atomic_ioend = resp->tr_itruncate;
> +	else
> +		memset(&resp->tr_atomic_ioend, 0,
> +				sizeof(resp->tr_atomic_ioend));
> +}

What is the point of zeroing out the structure for the non-reflink
case?  Just as a poision for not using it when not supported as no
code should be doing that?  Just thinking of this because it is a
potentially nasty landmine for the zoned atomic support.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Re: [PATCH v8 11/15] xfs: commit CoW-based atomic writes atomically

Posted by Darrick J. Wong 9 months, 3 weeks ago

On Wed, Apr 23, 2025 at 10:23:07AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 22, 2025 at 12:27:35PM +0000, John Garry wrote:
> > +STATIC void
> 
> Didn't we phase out STATIC for new code?
> 
> > +xfs_calc_default_atomic_ioend_reservation(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans_resv	*resp)
> > +{
> > +	if (xfs_has_reflink(mp))
> > +		resp->tr_atomic_ioend = resp->tr_itruncate;
> > +	else
> > +		memset(&resp->tr_atomic_ioend, 0,
> > +				sizeof(resp->tr_atomic_ioend));
> > +}
> 
> What is the point of zeroing out the structure for the non-reflink
> case?  Just as a poision for not using it when not supported as no
> code should be doing that?  Just thinking of this because it is a
> potentially nasty landmine for the zoned atomic support.

Yes.  I thought about adding a really stupid helper:

static inline bool xfs_has_sw_atomic_write(struct xfs_mount *mp)
{
	return xfs_has_reflink(mp);
}

But that seemed too stupid so I left it out.  Maybe it wasn't so dumb,
since that would be where you'd enable ZNS support by changing that to:

	return xfs_has_reflink(mp) || xfs_has_zoned(mp);

--D

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>

Re: [PATCH v8 11/15] xfs: commit CoW-based atomic writes atomically

Posted by Christoph Hellwig 9 months, 3 weeks ago

On Wed, Apr 23, 2025 at 07:58:50AM -0700, Darrick J. Wong wrote:
> > > +xfs_calc_default_atomic_ioend_reservation(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans_resv	*resp)
> > > +{
> > > +	if (xfs_has_reflink(mp))
> > > +		resp->tr_atomic_ioend = resp->tr_itruncate;
> > > +	else
> > > +		memset(&resp->tr_atomic_ioend, 0,
> > > +				sizeof(resp->tr_atomic_ioend));
> > > +}
> > 
> > What is the point of zeroing out the structure for the non-reflink
> > case?  Just as a poision for not using it when not supported as no
> > code should be doing that?  Just thinking of this because it is a
> > potentially nasty landmine for the zoned atomic support.
> 
> Yes.  I thought about adding a really stupid helper:

Why don't we just always set up the xfs_trans_resv structure?  We
do that for all kinds of other transactions not supported as well,
don't we?

> static inline bool xfs_has_sw_atomic_write(struct xfs_mount *mp)
> {
> 	return xfs_has_reflink(mp);
> }
> 
> But that seemed too stupid so I left it out.  Maybe it wasn't so dumb,
> since that would be where you'd enable ZNS support by changing that to:
> 
> 	return xfs_has_reflink(mp) || xfs_has_zoned(mp);

But that helper might actually be useful in various places, so
independent of the above I'm in favor of it.

Re: [PATCH v8 11/15] xfs: commit CoW-based atomic writes atomically

Posted by Darrick J. Wong 9 months, 3 weeks ago

On Wed, Apr 23, 2025 at 05:53:40PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 23, 2025 at 07:58:50AM -0700, Darrick J. Wong wrote:
> > > > +xfs_calc_default_atomic_ioend_reservation(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_trans_resv	*resp)
> > > > +{
> > > > +	if (xfs_has_reflink(mp))
> > > > +		resp->tr_atomic_ioend = resp->tr_itruncate;
> > > > +	else
> > > > +		memset(&resp->tr_atomic_ioend, 0,
> > > > +				sizeof(resp->tr_atomic_ioend));
> > > > +}
> > > 
> > > What is the point of zeroing out the structure for the non-reflink
> > > case?  Just as a poision for not using it when not supported as no
> > > code should be doing that?  Just thinking of this because it is a
> > > potentially nasty landmine for the zoned atomic support.
> > 
> > Yes.  I thought about adding a really stupid helper:
> 
> Why don't we just always set up the xfs_trans_resv structure?  We
> do that for all kinds of other transactions not supported as well,
> don't we?

Works for me.  There's really no harm in it mirroring tr_itruncate since
it won't affect the log size calculation.

> > static inline bool xfs_has_sw_atomic_write(struct xfs_mount *mp)
> > {
> > 	return xfs_has_reflink(mp);
> > }
> > 
> > But that seemed too stupid so I left it out.  Maybe it wasn't so dumb,
> > since that would be where you'd enable ZNS support by changing that to:
> > 
> > 	return xfs_has_reflink(mp) || xfs_has_zoned(mp);
> 
> But that helper might actually be useful in various places, so
> independent of the above I'm in favor of it.

<nod> John, who should work on the next round, you or me?

--D

Re: [PATCH v8 11/15] xfs: commit CoW-based atomic writes atomically

Posted by John Garry 9 months, 3 weeks ago

On 23/04/2025 16:58, Darrick J. Wong wrote:
> <nod> John, who should work on the next round, you or me?

If you can just share your changes then I can re-post. I can fix up the 
smaller things, like commit messages.

Thanks,
John