xfs: coalesce contiguous unwritten suffix into iomap for NOWAIT writes

[PATCH] xfs: coalesce contiguous unwritten suffix into iomap for NOWAIT writes

Posted by Brandon Allard 1 week, 2 days ago

In append-only direct-IO workloads with fallocate(with KEEP_SIZE +
ZERO_RANGE) writes can span the written/unwritten boundary of the
preallocated extent. This is from callers needing to overwrite the
previously written block if the last write didn't completely fill it.

Prior to commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent
boundaries") these writes succeeded by issuing two bios: one against the
NORM head, one against the UNWRITTEN tail. That commit closed off the
two-bio path to prevent NOWAIT callers from observing short writes when
one bio succeeded and another returned -EAGAIN. The side effect is that
the workload above now returns -EAGAIN on the spanning write.

This patch restores the success path by coalescing the (NORM, UNWRITTEN)
pair into a single in-memory iomap when the two records are physically
contiguous on disk. The merged range is reported as UNWRITTEN so
iomap_dio submits a single contiguous bio and the completion handler
runs xfs_iomap_write_unwritten across the merged range.

Short writes are still prevented. The merged iomap describes a single
contiguous physical extent, so iomap_dio still submits a single bio
for the full range. The caller observes the same all-or-nothing
behaviour as a write into a single UNWRITTEN extent.

The completion of the write also remains correct.
xfs_bmapi_convert_unwritten returns early on the already-NORM prefix
without dirtying the transaction, while the UNWRITTEN suffix flips to
NORM and the two adjacent same-state records coalesce in the bmbt as a
side effect of xfs_bmap_add_extent_unwritten_real.

A companion regression test will be posted separately to
fstests@vger.kernel.org.

Signed-off-by: Brandon Allard <brandon@redpanda.com>

Tested with xfstests against linux-xfs HEAD on a 4k-block XFS via
virtme-ng (vng --memory=64G --overlay-rwdir=/tmp -- ./check -g auto).
1266/1271 PASS. The failures seem unrelated and reproduce on 7.0:
  * generic/753 - flaky dm-error + log-recovery loop, unrelated path
  * generic/754 - pending xfs_repair fix (test self-declares xfsprogs
    commit XXXXXXXXXXXXX as a precondition)
  * xfs/078     - newer mkfs.xfs AG/log geometry vs stale golden
  * xfs/216     - newer mkfs.xfs log-size heuristic vs stale golden
  * xfs/217     - same heuristic, larger-fs sibling of xfs/216
---
 fs/xfs/xfs_iomap.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index be86d43044df..d7164d00a7b2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -809,6 +809,41 @@ imap_spans_range(
 	return true;
 }
 
+/*
+ * If imap doesn't span the requested range but the immediately-following bmbt
+ * record is unwritten and physically contiguous, extend imap to cover it and
+ * report the merged range as unwritten.
+ */
+static bool
+imap_extend_into_unwritten(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_bmbt_irec	next;
+	xfs_fileoff_t		next_off;
+	int			nimaps = 1;
+	int			error;
+
+	if (!xfs_bmap_is_written_extent(imap))
+		return false;
+
+	next_off = imap->br_startoff + imap->br_blockcount;
+	error = xfs_bmapi_read(ip, next_off, end_fsb - next_off, &next,
+			       &nimaps, 0);
+	if (error || nimaps == 0)
+		return false;
+
+	if (next.br_state != XFS_EXT_UNWRITTEN ||
+	    next.br_startblock != imap->br_startblock + imap->br_blockcount ||
+	    !imap_spans_range(&next, next_off, end_fsb))
+		return false;
+
+	imap->br_blockcount += next.br_blockcount;
+	imap->br_state = XFS_EXT_UNWRITTEN;
+	return true;
+}
+
 static bool
 xfs_bmap_hw_atomic_write_possible(
 	struct xfs_inode	*ip,
@@ -961,7 +996,14 @@ xfs_direct_write_iomap_begin(
 	 */
 	if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
 		error = -EAGAIN;
-		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
+		/*
+		 * Sequential append workloads with adaptive preallocation can
+		 * span the NORM->UNWRITTEN boundary that unwritten conversion
+		 * leaves behind. Try to coalesce the next contiguous unwritten record
+		 * into imap before returning -EAGAIN.
+		 */
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb) &&
+		    !imap_extend_into_unwritten(ip, &imap, end_fsb))
 			goto out_unlock;
 	}
 
-- 
2.48.1

Re: [PATCH] xfs: coalesce contiguous unwritten suffix into iomap for NOWAIT writes

Posted by Christoph Hellwig 3 days ago

On Fri, May 29, 2026 at 03:11:00PM -0400, Brandon Allard wrote:
> In append-only direct-IO workloads with fallocate(with KEEP_SIZE +
> ZERO_RANGE) writes can span the written/unwritten boundary of the
> preallocated extent. This is from callers needing to overwrite the
> previously written block if the last write didn't completely fill it.
> 
> Prior to commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent
> boundaries") these writes succeeded by issuing two bios: one against the
> NORM head, one against the UNWRITTEN tail. That commit closed off the
> two-bio path to prevent NOWAIT callers from observing short writes when
> one bio succeeded and another returned -EAGAIN. The side effect is that
> the workload above now returns -EAGAIN on the spanning write.
> 
> This patch restores the success path by coalescing the (NORM, UNWRITTEN)
> pair into a single in-memory iomap when the two records are physically
> contiguous on disk. The merged range is reported as UNWRITTEN so
> iomap_dio submits a single contiguous bio and the completion handler
> runs xfs_iomap_write_unwritten across the merged range.
> 
> Short writes are still prevented. The merged iomap describes a single
> contiguous physical extent, so iomap_dio still submits a single bio
> for the full range. The caller observes the same all-or-nothing
> behaviour as a write into a single UNWRITTEN extent.
> 
> The completion of the write also remains correct.
> xfs_bmapi_convert_unwritten returns early on the already-NORM prefix
> without dirtying the transaction, while the UNWRITTEN suffix flips to
> NORM and the two adjacent same-state records coalesce in the bmbt as a
> side effect of xfs_bmap_add_extent_unwritten_real.

I think the better way to do this would be a new XFS_BMAPI_ flag that
allows xfs_bmapi_read to merge maps with mismatching unwritten/written
state and just return your a single map.  The big benefits are that we
only need to do one search for the extent and can then iterate the
cursor, and I think it will also end up with less and more readable code.

We actually had such a flag until v4.19, so you might be able to just
look into a manual revert of commit c3a2f9fff1bb ("xfs: remove the now
unused XFS_BMAPI_IGSTATE flag") for a start, although for use case
you'd need to ensure that the unwritten flag is set on the output
map if it is set on any of the input maps.