fs/xfs/xfs_iomap.c | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-)
In append-only direct-IO workloads with fallocate(with KEEP_SIZE +
ZERO_RANGE) writes can span the written/unwritten boundary of the
preallocated extent. This is from callers needing to overwrite the
previously written block if the last write didn't completely fill it.
Prior to commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent
boundaries") these writes succeeded by issuing two bios: one against the
NORM head, one against the UNWRITTEN tail. That commit closed off the
two-bio path to prevent NOWAIT callers from observing short writes when
one bio succeeded and another returned -EAGAIN. The side effect is that
the workload above now returns -EAGAIN on the spanning write.
This patch restores the success path by coalescing the (NORM, UNWRITTEN)
pair into a single in-memory iomap when the two records are physically
contiguous on disk. The merged range is reported as UNWRITTEN so
iomap_dio submits a single contiguous bio and the completion handler
runs xfs_iomap_write_unwritten across the merged range.
Short writes are still prevented. The merged iomap describes a single
contiguous physical extent, so iomap_dio still submits a single bio
for the full range. The caller observes the same all-or-nothing
behaviour as a write into a single UNWRITTEN extent.
The completion of the write also remains correct.
xfs_bmapi_convert_unwritten returns early on the already-NORM prefix
without dirtying the transaction, while the UNWRITTEN suffix flips to
NORM and the two adjacent same-state records coalesce in the bmbt as a
side effect of xfs_bmap_add_extent_unwritten_real.
A companion regression test will be posted separately to
fstests@vger.kernel.org.
Signed-off-by: Brandon Allard <brandon@redpanda.com>
Tested with xfstests against linux-xfs HEAD on a 4k-block XFS via
virtme-ng (vng --memory=64G --overlay-rwdir=/tmp -- ./check -g auto).
1266/1271 PASS. The failures seem unrelated and reproduce on 7.0:
* generic/753 - flaky dm-error + log-recovery loop, unrelated path
* generic/754 - pending xfs_repair fix (test self-declares xfsprogs
commit XXXXXXXXXXXXX as a precondition)
* xfs/078 - newer mkfs.xfs AG/log geometry vs stale golden
* xfs/216 - newer mkfs.xfs log-size heuristic vs stale golden
* xfs/217 - same heuristic, larger-fs sibling of xfs/216
---
fs/xfs/xfs_iomap.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index be86d43044df..d7164d00a7b2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -809,6 +809,41 @@ imap_spans_range(
return true;
}
+/*
+ * If imap doesn't span the requested range but the immediately-following bmbt
+ * record is unwritten and physically contiguous, extend imap to cover it and
+ * report the merged range as unwritten.
+ */
+static bool
+imap_extend_into_unwritten(
+ struct xfs_inode *ip,
+ struct xfs_bmbt_irec *imap,
+ xfs_fileoff_t end_fsb)
+{
+ struct xfs_bmbt_irec next;
+ xfs_fileoff_t next_off;
+ int nimaps = 1;
+ int error;
+
+ if (!xfs_bmap_is_written_extent(imap))
+ return false;
+
+ next_off = imap->br_startoff + imap->br_blockcount;
+ error = xfs_bmapi_read(ip, next_off, end_fsb - next_off, &next,
+ &nimaps, 0);
+ if (error || nimaps == 0)
+ return false;
+
+ if (next.br_state != XFS_EXT_UNWRITTEN ||
+ next.br_startblock != imap->br_startblock + imap->br_blockcount ||
+ !imap_spans_range(&next, next_off, end_fsb))
+ return false;
+
+ imap->br_blockcount += next.br_blockcount;
+ imap->br_state = XFS_EXT_UNWRITTEN;
+ return true;
+}
+
static bool
xfs_bmap_hw_atomic_write_possible(
struct xfs_inode *ip,
@@ -961,7 +996,14 @@ xfs_direct_write_iomap_begin(
*/
if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
error = -EAGAIN;
- if (!imap_spans_range(&imap, offset_fsb, end_fsb))
+ /*
+ * Sequential append workloads with adaptive preallocation can
+ * span the NORM->UNWRITTEN boundary that unwritten conversion
+ * leaves behind. Try to coalesce the next contiguous unwritten record
+ * into imap before returning -EAGAIN.
+ */
+ if (!imap_spans_range(&imap, offset_fsb, end_fsb) &&
+ !imap_extend_into_unwritten(ip, &imap, end_fsb))
goto out_unlock;
}
--
2.48.1
On Fri, May 29, 2026 at 03:11:00PM -0400, Brandon Allard wrote:
> In append-only direct-IO workloads with fallocate(with KEEP_SIZE +
> ZERO_RANGE) writes can span the written/unwritten boundary of the
> preallocated extent. This is from callers needing to overwrite the
> previously written block if the last write didn't completely fill it.
>
> Prior to commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent
> boundaries") these writes succeeded by issuing two bios: one against the
> NORM head, one against the UNWRITTEN tail. That commit closed off the
> two-bio path to prevent NOWAIT callers from observing short writes when
> one bio succeeded and another returned -EAGAIN. The side effect is that
> the workload above now returns -EAGAIN on the spanning write.
>
> This patch restores the success path by coalescing the (NORM, UNWRITTEN)
> pair into a single in-memory iomap when the two records are physically
> contiguous on disk. The merged range is reported as UNWRITTEN so
> iomap_dio submits a single contiguous bio and the completion handler
> runs xfs_iomap_write_unwritten across the merged range.
>
> Short writes are still prevented. The merged iomap describes a single
> contiguous physical extent, so iomap_dio still submits a single bio
> for the full range. The caller observes the same all-or-nothing
> behaviour as a write into a single UNWRITTEN extent.
>
> The completion of the write also remains correct.
> xfs_bmapi_convert_unwritten returns early on the already-NORM prefix
> without dirtying the transaction, while the UNWRITTEN suffix flips to
> NORM and the two adjacent same-state records coalesce in the bmbt as a
> side effect of xfs_bmap_add_extent_unwritten_real.
I think the better way to do this would be a new XFS_BMAPI_ flag that
allows xfs_bmapi_read to merge maps with mismatching unwritten/written
state and just return your a single map. The big benefits are that we
only need to do one search for the extent and can then iterate the
cursor, and I think it will also end up with less and more readable code.
We actually had such a flag until v4.19, so you might be able to just
look into a manual revert of commit c3a2f9fff1bb ("xfs: remove the now
unused XFS_BMAPI_IGSTATE flag") for a start, although for use case
you'd need to ensure that the unwritten flag is set on the output
map if it is set on any of the input maps.
© 2016 - 2026 Red Hat, Inc.