[v7] large atomic writes for xfs

[PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()

Posted by John Garry 10 months ago

For when large atomic writes (> 1x FS block) are supported, there will be
various occasions when HW offload may not be possible.

Such instances include:
- unaligned extent mapping wrt write length
- extent mappings which do not cover the full write, e.g. the write spans
  sparse or mixed-mapping extents
- the write length is greater than HW offload can support

In those cases, we need to fallback to the CoW-based atomic write mode. For
this, report special code -ENOPROTOOPT to inform the caller that HW
offload-based method is not possible.

In addition to the occasions mentioned, if the write covers an unallocated
range, we again judge that we need to rely on the CoW-based method when we
would need to allocate anything more than 1x block. This is because if we
allocate less blocks that is required for the write, then again HW
offload-based method would not be possible. So we are taking a pessimistic
approach to writes covering unallocated space.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 63 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 049655ebc3f7..02bb8257ea24 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -798,6 +798,41 @@ imap_spans_range(
 	return true;
 }
 
+static bool
+xfs_bmap_hw_atomic_write_possible(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fsize_t		len = XFS_FSB_TO_B(mp, end_fsb - offset_fsb);
+
+	/*
+	 * atomic writes are required to be naturally aligned for disk blocks,
+	 * which ensures that we adhere to block layer rules that we won't
+	 * straddle any boundary or violate write alignment requirement.
+	 */
+	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
+		return false;
+
+	/*
+	 * Spanning multiple extents would mean that multiple BIOs would be
+	 * issued, and so would lose atomicity required for REQ_ATOMIC-based
+	 * atomics.
+	 */
+	if (!imap_spans_range(imap, offset_fsb, end_fsb))
+		return false;
+
+	/*
+	 * The ->iomap_begin caller should ensure this, but check anyway.
+	 */
+	if (len > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
+		return false;
+
+	return true;
+}
+
 static int
 xfs_direct_write_iomap_begin(
 	struct inode		*inode,
@@ -812,9 +847,11 @@ xfs_direct_write_iomap_begin(
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	xfs_fileoff_t		orig_end_fsb = end_fsb;
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	u16			iomap_flags = 0;
+	bool			needs_alloc;
 	unsigned int		lockmode;
 	u64			seq;
 
@@ -875,13 +912,37 @@ xfs_direct_write_iomap_begin(
 				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
-		if (shared)
+		if (shared) {
+			if ((flags & IOMAP_ATOMIC) &&
+			    !xfs_bmap_hw_atomic_write_possible(ip, &cmap,
+					offset_fsb, end_fsb)) {
+				error = -ENOPROTOOPT;
+				goto out_unlock;
+			}
 			goto out_found_cow;
+		}
 		end_fsb = imap.br_startoff + imap.br_blockcount;
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
 	}
 
-	if (imap_needs_alloc(inode, flags, &imap, nimaps))
+	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
+
+	if (flags & IOMAP_ATOMIC) {
+		error = -ENOPROTOOPT;
+		/*
+		 * If we allocate less than what is required for the write
+		 * then we may end up with multiple extents, which means that
+		 * REQ_ATOMIC-based cannot be used, so avoid this possibility.
+		 */
+		if (needs_alloc && orig_end_fsb - offset_fsb > 1)
+			goto out_unlock;
+
+		if (!xfs_bmap_hw_atomic_write_possible(ip, &imap, offset_fsb,
+				orig_end_fsb))
+			goto out_unlock;
+	}
+
+	if (needs_alloc)
 		goto allocate_blocks;
 
 	/*
-- 
2.31.1

Re: [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()

Posted by Darrick J. Wong 9 months, 4 weeks ago

On Tue, Apr 15, 2025 at 12:14:20PM +0000, John Garry wrote:
> For when large atomic writes (> 1x FS block) are supported, there will be
> various occasions when HW offload may not be possible.
> 
> Such instances include:
> - unaligned extent mapping wrt write length
> - extent mappings which do not cover the full write, e.g. the write spans
>   sparse or mixed-mapping extents
> - the write length is greater than HW offload can support
> 
> In those cases, we need to fallback to the CoW-based atomic write mode. For
> this, report special code -ENOPROTOOPT to inform the caller that HW
> offload-based method is not possible.
> 
> In addition to the occasions mentioned, if the write covers an unallocated
> range, we again judge that we need to rely on the CoW-based method when we
> would need to allocate anything more than 1x block. This is because if we
> allocate less blocks that is required for the write, then again HW
> offload-based method would not be possible. So we are taking a pessimistic
> approach to writes covering unallocated space.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/xfs_iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 63 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 049655ebc3f7..02bb8257ea24 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -798,6 +798,41 @@ imap_spans_range(
>  	return true;
>  }
>  
> +static bool
> +xfs_bmap_hw_atomic_write_possible(
> +	struct xfs_inode	*ip,
> +	struct xfs_bmbt_irec	*imap,
> +	xfs_fileoff_t		offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fsize_t		len = XFS_FSB_TO_B(mp, end_fsb - offset_fsb);
> +
> +	/*
> +	 * atomic writes are required to be naturally aligned for disk blocks,
> +	 * which ensures that we adhere to block layer rules that we won't
> +	 * straddle any boundary or violate write alignment requirement.
> +	 */
> +	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
> +		return false;
> +
> +	/*
> +	 * Spanning multiple extents would mean that multiple BIOs would be
> +	 * issued, and so would lose atomicity required for REQ_ATOMIC-based
> +	 * atomics.
> +	 */
> +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
> +		return false;
> +
> +	/*
> +	 * The ->iomap_begin caller should ensure this, but check anyway.
> +	 */
> +	if (len > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
> +		return false;

This needs to check len against bt_bdev_awu_min so that we don't submit
too-short atomic writes to the hardware.  Let's say that the hardware
minimum is 32k and the fsblock size is 4k.  XFS can perform an out of
place write for 4k-16k writes, but right now we'll just throw invalid
commands at the bdev, and it'll return EINVAL.

/me wonders if statx should grow a atomic_write_unit_min_opt field
too, unless everyone in block layer land is convinced that awu_min will
always match lbasize?  (I probably missed that conversation)

--D

> +
> +	return true;
> +}
> +
>  static int
>  xfs_direct_write_iomap_begin(
>  	struct inode		*inode,
> @@ -812,9 +847,11 @@ xfs_direct_write_iomap_begin(
>  	struct xfs_bmbt_irec	imap, cmap;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
> +	xfs_fileoff_t		orig_end_fsb = end_fsb;
>  	int			nimaps = 1, error = 0;
>  	bool			shared = false;
>  	u16			iomap_flags = 0;
> +	bool			needs_alloc;
>  	unsigned int		lockmode;
>  	u64			seq;
>  
> @@ -875,13 +912,37 @@ xfs_direct_write_iomap_begin(
>  				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>  		if (error)
>  			goto out_unlock;
> -		if (shared)
> +		if (shared) {
> +			if ((flags & IOMAP_ATOMIC) &&
> +			    !xfs_bmap_hw_atomic_write_possible(ip, &cmap,
> +					offset_fsb, end_fsb)) {
> +				error = -ENOPROTOOPT;
> +				goto out_unlock;
> +			}
>  			goto out_found_cow;
> +		}
>  		end_fsb = imap.br_startoff + imap.br_blockcount;
>  		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>  	}
>  
> -	if (imap_needs_alloc(inode, flags, &imap, nimaps))
> +	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
> +
> +	if (flags & IOMAP_ATOMIC) {
> +		error = -ENOPROTOOPT;
> +		/*
> +		 * If we allocate less than what is required for the write
> +		 * then we may end up with multiple extents, which means that
> +		 * REQ_ATOMIC-based cannot be used, so avoid this possibility.
> +		 */
> +		if (needs_alloc && orig_end_fsb - offset_fsb > 1)
> +			goto out_unlock;
> +
> +		if (!xfs_bmap_hw_atomic_write_possible(ip, &imap, offset_fsb,
> +				orig_end_fsb))
> +			goto out_unlock;
> +	}
> +
> +	if (needs_alloc)
>  		goto allocate_blocks;
>  
>  	/*
> -- 
> 2.31.1
> 
>

Re: [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()

Posted by John Garry 9 months, 4 weeks ago

On 15/04/2025 18:34, Darrick J. Wong wrote:
>> +	/*
>> +	 * Spanning multiple extents would mean that multiple BIOs would be
>> +	 * issued, and so would lose atomicity required for REQ_ATOMIC-based
>> +	 * atomics.
>> +	 */
>> +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
>> +		return false;
>> +
>> +	/*
>> +	 * The ->iomap_begin caller should ensure this, but check anyway.
>> +	 */
>> +	if (len > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
>> +		return false;
> This needs to check len against bt_bdev_awu_min so that we don't submit
> too-short atomic writes to the hardware. 

Right, let me check this.

I think that we should only support sane HW which can write 1x FS block 
or more.

> Let's say that the hardware
> minimum is 32k and the fsblock size is 4k.  XFS can perform an out of
> place write for 4k-16k writes, but right now we'll just throw invalid
> commands at the bdev, and it'll return EINVAL.
> 
> /me wonders if statx should grow a atomic_write_unit_min_opt field
> too, unless everyone in block layer land is convinced that awu_min will
> always match lbasize?  (I probably missed that conversation)

Nothing states that it should (match lbasize), but again HW which can 
only write >1 FS block is something which I don't want to support (yet).

Thanks,
John