[v8] large atomic writes for xfs

[PATCH v8 12/15] xfs: add xfs_file_dio_write_atomic()

Posted by John Garry 9 months, 3 weeks ago

Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

The function works based on two operating modes:
- HW offload, i.e. REQ_ATOMIC-based
- CoW based with out-of-places write and atomic extent remapping

The preferred method is HW offload as it will be faster. If HW offload is
not possible, then we fallback to the CoW-based method.

HW offload would not be possible for the write length exceeding the HW
offload limit, the write spanning multiple extents, unaligned disk blocks,
etc.

Apart from the write exceeding the HW offload limit, other conditions for
HW offload can only be detected in the iomap handling for the write. As
such, we use a fallback method to issue the write if we detect in the
->iomap_begin() handler that HW offload is not possible. Special code
-ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload
not possible.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ba4b02abc6e4..000bbb0d1413 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -728,6 +728,72 @@ xfs_file_dio_write_zoned(
 	return ret;
 }
 
+/*
+ * Handle block atomic writes
+ *
+ * Two methods of atomic writes are supported:
+ * - REQ_ATOMIC-based, which would typically use some form of HW offload in the
+ *   disk
+ * - COW-based, which uses a COW fork as a staging extent for data updates
+ *   before atomically updating extent mappings for the range being written
+ *
+ */
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	ssize_t			ret, ocount = iov_iter_count(from);
+	const struct iomap_ops	*dops;
+
+	/*
+	 * HW offload should be faster, so try that first if it is already
+	 * known that the write length is not too large.
+	 */
+	if (ocount > xfs_inode_hw_atomicwrite_max(ip))
+		dops = &xfs_atomic_write_cow_iomap_ops;
+	else
+		dops = &xfs_direct_write_iomap_ops;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
+	if (ret)
+		goto out_unlock;
+
+	/* Demote similar to xfs_file_dio_write_aligned() */
+	if (iolock == XFS_IOLOCK_EXCL) {
+		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
+		iolock = XFS_IOLOCK_SHARED;
+	}
+
+	trace_xfs_file_direct_write(iocb, from);
+	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
+			0, NULL, 0);
+
+	/*
+	 * The retry mechanism is based on the ->iomap_begin method returning
+	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
+	 * possible. The REQ_ATOMIC-based method typically not be possible if
+	 * the write spans multiple extents or the disk blocks are misaligned.
+	 */
+	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {
+		xfs_iunlock(ip, iolock);
+		dops = &xfs_atomic_write_cow_iomap_ops;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -843,6 +909,8 @@ xfs_file_dio_write(
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
 	if (xfs_is_zoned_inode(ip))
 		return xfs_file_dio_write_zoned(ip, iocb, from);
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from,
 			&xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL);
 }
-- 
2.31.1

Re: [PATCH v8 12/15] xfs: add xfs_file_dio_write_atomic()

Posted by Christoph Hellwig 9 months, 3 weeks ago

On Tue, Apr 22, 2025 at 12:27:36PM +0000, John Garry wrote:
> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
> 
> The function works based on two operating modes:
> - HW offload, i.e. REQ_ATOMIC-based
> - CoW based with out-of-places write and atomic extent remapping
> 
> The preferred method is HW offload as it will be faster. If HW offload is
> not possible, then we fallback to the CoW-based method.
> 
> HW offload would not be possible for the write length exceeding the HW
> offload limit, the write spanning multiple extents, unaligned disk blocks,
> etc.
> 
> Apart from the write exceeding the HW offload limit, other conditions for
> HW offload can only be detected in the iomap handling for the write. As
> such, we use a fallback method to issue the write if we detect in the
> ->iomap_begin() handler that HW offload is not possible. Special code
> -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload
> not possible.

This text could use a little rewrite starting with the fact that the
hardware offload now isn't required to start with and entirely
optional and then flow from the there to state when we can use it
instead of when we can't use it.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Re: [PATCH v8 12/15] xfs: add xfs_file_dio_write_atomic()

Posted by John Garry 9 months, 3 weeks ago

On 23/04/2025 09:25, Christoph Hellwig wrote:
> On Tue, Apr 22, 2025 at 12:27:36PM +0000, John Garry wrote:
>> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
>>
>> The function works based on two operating modes:
>> - HW offload, i.e. REQ_ATOMIC-based
>> - CoW based with out-of-places write and atomic extent remapping
>>
>> The preferred method is HW offload as it will be faster. If HW offload is
>> not possible, then we fallback to the CoW-based method.
>>
>> HW offload would not be possible for the write length exceeding the HW
>> offload limit, the write spanning multiple extents, unaligned disk blocks,
>> etc.
>>
>> Apart from the write exceeding the HW offload limit, other conditions for
>> HW offload can only be detected in the iomap handling for the write. As
>> such, we use a fallback method to issue the write if we detect in the
>> ->iomap_begin() handler that HW offload is not possible. Special code
>> -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload
>> not possible.
> 
> This text could use a little rewrite starting with the fact that the
> hardware offload now isn't required to start with and entirely
> optional and then flow from the there to state when we can use it
> instead of when we can't use it.

ok, sure

> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!