Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
context, where folio_end_dropbehind() can safely invalidate folios.
With the bio layer now handling task-context deferral generically, XFS
no longer needs to route DONTCACHE ioends through its completion
workqueue for page cache invalidation. Remove the DONTCACHE check from
xfs_ioend_needs_wq_completion().
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
fs/iomap/ioend.c | 2 ++
fs/xfs/xfs_aops.c | 4 ----
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index e4d57cb969f1..6b8375d11cc0 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
GFP_NOFS, &iomap_ioend_bioset);
bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
bio->bi_write_hint = wpc->inode->i_write_hint;
+ if (ioend_flags & IOMAP_IOEND_DONTCACHE)
+ bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
wbc_init_bio(wpc->wbc, bio);
wpc->nr_folios = 0;
return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 76678814f46f..0d469b91377d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -510,10 +510,6 @@ xfs_ioend_needs_wq_completion(
if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))
return true;
- /* Page cache invalidation cannot be done in irq context. */
- if (ioend->io_flags & IOMAP_IOEND_DONTCACHE)
- return true;
-
return false;
}
--
2.39.5
On Wed, Mar 25, 2026 at 02:43:01PM -0400, Tal Zussman wrote: > Set BIO_COMPLETE_IN_TASK on iomap writeback bios when > IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task > context, where folio_end_dropbehind() can safely invalidate folios. > > With the bio layer now handling task-context deferral generically, XFS > no longer needs to route DONTCACHE ioends through its completion > workqueue for page cache invalidation. Remove the DONTCACHE check from > xfs_ioend_needs_wq_completion(). > > Signed-off-by: Tal Zussman <tz2294@columbia.edu> > --- > fs/iomap/ioend.c | 2 ++ > fs/xfs/xfs_aops.c | 4 ---- > 2 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c > index e4d57cb969f1..6b8375d11cc0 100644 > --- a/fs/iomap/ioend.c > +++ b/fs/iomap/ioend.c > @@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc, > GFP_NOFS, &iomap_ioend_bioset); > bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos); > bio->bi_write_hint = wpc->inode->i_write_hint; > + if (ioend_flags & IOMAP_IOEND_DONTCACHE) > + bio_set_flag(bio, BIO_COMPLETE_IN_TASK); > wbc_init_bio(wpc->wbc, bio); > wpc->nr_folios = 0; > return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags); > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 76678814f46f..0d469b91377d 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -510,10 +510,6 @@ xfs_ioend_needs_wq_completion( > if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED)) > return true; > > - /* Page cache invalidation cannot be done in irq context. */ > - if (ioend->io_flags & IOMAP_IOEND_DONTCACHE) > - return true; > - > return false; > } Ok, so higher layers can set it. At this point, I'd suggest that we should not be making random one-off changes to the iomap and filesystem layers like this just for one operation that needs deferred IO completion work. This needs to considered from the overall perspective of how we defer completion work - there are lots of different paths through filesystems and/or iomap that require/use task deferal for IO completion. We want them all to use the same mechanism - splitting deferal between multiple layers depending on IO type is not a particularly nice thing to be doing... -Dave. -- Dave Chinner dgc@kernel.org
On Thu, Mar 26, 2026 at 07:34:45AM +1100, Dave Chinner wrote: > At this point, I'd suggest that we should not be making random > one-off changes to the iomap and filesystem layers like this just > for one operation that needs deferred IO completion work. This needs > to considered from the overall perspective of how we defer > completion work - there are lots of different paths through > filesystems and/or iomap that require/use task deferal for IO > completion. We want them all to use the same mechanism - splitting > deferal between multiple layers depending on IO type is not a > particularly nice thing to be doing... Yes and no. The XFS/iomap write completions needs special handling for merging operation, using different workqueues, and also the serialization provided by the per-inode list. Everything that just needs a dumb user context should be the same, though. And this mechanism should work just fine for the T10 PI checksums. It does not currently work for the defer to user on error used by the fserror reporting, but should be adaptable to that by allowing to also defer an I/O completion from an already running end_io handler, although that might get ugly. It should work really well for other places that defer bio completions like the erofs decompression handler that recently came up, and it will be very useful to implement actually working REQ_NOWAIT support for file system writes. So yes, I think we need to look more at the whole picture, and I think this is a good building block considering the whole picture. I don't think we can coverge on just a single mechanism, but having few and generic ones is good.
Hi Christiph,
On 2026/3/27 14:08, Christoph Hellwig wrote:
> On Thu, Mar 26, 2026 at 07:34:45AM +1100, Dave Chinner wrote:
>> At this point, I'd suggest that we should not be making random
>> one-off changes to the iomap and filesystem layers like this just
>> for one operation that needs deferred IO completion work. This needs
>> to considered from the overall perspective of how we defer
>> completion work - there are lots of different paths through
>> filesystems and/or iomap that require/use task deferal for IO
>> completion. We want them all to use the same mechanism - splitting
>> deferal between multiple layers depending on IO type is not a
>> particularly nice thing to be doing...
>
> Yes and no. The XFS/iomap write completions needs special handling
> for merging operation, using different workqueues, and also the
> serialization provided by the per-inode list.
>
> Everything that just needs a dumb user context should be the same,
> though. And this mechanism should work just fine for the T10 PI
> checksums. It does not currently work for the defer to user on error
> used by the fserror reporting, but should be adaptable to that by
> allowing to also defer an I/O completion from an already running
> end_io handler, although that might get ugly.
>
> It should work really well for other places that defer bio completions
> like the erofs decompression handler that recently came up, and it will
I noticed this work, but typically the current EROFS
decompression has two latency-sensitive cases:
- dm-verity calls EROFS completion, yes, in that case, this
work can work well since dm-verity already takes some
merkle tree latencies, and we just don't want to add more
scheduling latencies with another workqueue;
- use EROFS directly, in that case, we still need process
contexts to decompress, but due to Android latency
requirements, they really need per-cpu RT threads instead,
otherwise it will cause serious regression too; but I'm not
sure that case can be replaced by this work since workqueues
don't support RT threads and I guess generic block layer
won't be bothered with that too.
Thanks,
Gao Xiang
> be very useful to implement actually working REQ_NOWAIT support for
> file system writes. So yes, I think we need to look more at the whole
> picture, and I think this is a good building block considering the
> whole picture. I don't think we can coverge on just a single mechanism,
> but having few and generic ones is good.
>
>
On Fri, Mar 27, 2026 at 02:24:02PM +0800, Gao Xiang wrote: > - use EROFS directly, in that case, we still need process > contexts to decompress, but due to Android latency > requirements, they really need per-cpu RT threads instead, > otherwise it will cause serious regression too; but I'm not > sure that case can be replaced by this work since workqueues > don't support RT threads and I guess generic block layer > won't be bothered with that too. All of the I/O completions should be latency sensitive. So I think it would be great if you could help out here with the requirements and implementation.
On 2026/3/27 14:27, Christoph Hellwig wrote:
> On Fri, Mar 27, 2026 at 02:24:02PM +0800, Gao Xiang wrote:
>> - use EROFS directly, in that case, we still need process
>> contexts to decompress, but due to Android latency
>> requirements, they really need per-cpu RT threads instead,
>> otherwise it will cause serious regression too; but I'm not
>> sure that case can be replaced by this work since workqueues
>> don't support RT threads and I guess generic block layer
>> won't be bothered with that too.
>
> All of the I/O completions should be latency sensitive. So I think it
> would be great if you could help out here with the requirements and
> implementation.
Yes, especially for sync read completion. Our requirement can
be outlined as:
- a mark to make the whole bio completion in task, so that
we ensure that the bio completion is in the task context
so that we don't need to worry about that;
- another per-CPU RT thread flag (or similiar) relates to
a bio or some other things, so that bio completion can be
handled by per-cpu RT threads instead of workqueues
instead.
If they meet, I think that would be very helpful to clean
up our internal codebase at least.
Thanks,
Gao Xiang
On Wed, Mar 25, 2026 at 02:43:01PM -0400, Tal Zussman wrote: > Set BIO_COMPLETE_IN_TASK on iomap writeback bios when > IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task > context, where folio_end_dropbehind() can safely invalidate folios. > > With the bio layer now handling task-context deferral generically, XFS > no longer needs to route DONTCACHE ioends through its completion > workqueue for page cache invalidation. Remove the DONTCACHE check from > xfs_ioend_needs_wq_completion(). > > Signed-off-by: Tal Zussman <tz2294@columbia.edu> > --- > fs/iomap/ioend.c | 2 ++ > fs/xfs/xfs_aops.c | 4 ---- > 2 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c > index e4d57cb969f1..6b8375d11cc0 100644 > --- a/fs/iomap/ioend.c > +++ b/fs/iomap/ioend.c > @@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc, > GFP_NOFS, &iomap_ioend_bioset); > bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos); > bio->bi_write_hint = wpc->inode->i_write_hint; > + if (ioend_flags & IOMAP_IOEND_DONTCACHE) > + bio_set_flag(bio, BIO_COMPLETE_IN_TASK); > wbc_init_bio(wpc->wbc, bio); > wpc->nr_folios = 0; > return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags); Can't we delete IOMAP_IOEND_DONTCACHE, and just do: if (folio_test_dropbehind(folio)) bio_set_flag(&ioend->io_bio, BIO_COMPLETE_IN_TASK); It'd need to move down a few lines in iomap_add_to_ioend() to after bio_add_folio() succeeds.
On Wed, Mar 25, 2026 at 08:21:28PM +0000, Matthew Wilcox wrote: > > + if (ioend_flags & IOMAP_IOEND_DONTCACHE) > > + bio_set_flag(bio, BIO_COMPLETE_IN_TASK); > > wbc_init_bio(wpc->wbc, bio); > > wpc->nr_folios = 0; > > return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags); > > Can't we delete IOMAP_IOEND_DONTCACHE, and just do: > > if (folio_test_dropbehind(folio)) > bio_set_flag(&ioend->io_bio, BIO_COMPLETE_IN_TASK); > > It'd need to move down a few lines in iomap_add_to_ioend() to after > bio_add_folio() succeeds. Yes, that sounds sensible.
© 2016 - 2026 Red Hat, Inc.