[v3] loop: Fix NULL pointer dereference in lo_rw_aio()

[PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 2 weeks ago

Some commit which was merged in the merge window for 7.1 broke the loop
driver; a race window where lo_release() clears the backing file via
__loop_clr_fd() despite some I/O requests are pending was introduced [1][2].

The exact commit which changed the behavior is not known due to lack of
reproducer and timing dependent behavior, but it seems that we need to
solve this problem in the loop driver despite there was no change for the
loop driver during this merge window.

To close this race, try to flush pending I/O requests. However, calling
drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
lockdep warnings [3][4]. We need to flush pending I/O requests without
disk->open_mutex held.

In the past, commit 322c4293ecc5 ("loop: make autoclear operation
asynchronous") has tried to defer __loop_clr_fd() to WQ context. But it was
reverted by commit bf23747ee053 ("loop: revert "make autoclear operation
asynchronous"") because userspace might be expecting that fput() on the
backing file is processed before lo_release() from close() returns to user
mode.

Therefore, this patch tries to defer __loop_clr_fd() to task work context.
__loop_clr_fd() is split into three steps:

  Step 1: Flush pending I/O requests without holding disk->open_mutex.

  Step 2: Do what __loop_clr_fd() from lo_release() was doing with
          disk->open_mutex held.

  Step 3: Drop refcounts without holding disk->open_mutex.

A potential side effect of this approach is that a userspace program who
issued open() request before __loop_clr_fd() completes might be confused
by observing -ENXIO because lo_open() can be called before __loop_clr_fd()
completes.

Except for the side effect above, I expect this patch to work by the
following reasons.

- The existing Lo_rundown state safely guarantees that any subsequent
  lo_open() attempts will immediately fail with -ENXIO, preventing races
  even after disk->open_mutex is temporarily released.

- Since returning from lo_release() normally allows the block layer to
  immediately drop module and device references, this patch explicitly
  increments the refcounts (__module_get() and get_device()) before
  deferring the work, and safely releases them at the end of Step 3
  inside __loop_clr_fd().

- It prefers task_work so that userspace processes expecting immediate
  completion (such as fput() side-effects) receive a deterministic
  behavior before returning from close(). It falls back to schedule_work()
  if the current context is a kernel thread (PF_KTHREAD) or if
  task_work_add() fails.

Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Link: https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 [2]
Link: https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e [3]
Link: https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc [4]
Analyzed-by: AI Mode in Google Search (no mail address)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 86 ++++++++++++++++++++++++++++++++++++--------
 kernel/task_work.c   |  1 +
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0000913f7efc..d97aa2c209e3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -36,6 +36,7 @@
 #include <linux/blk-mq.h>
 #include <linux/spinlock.h>
 #include <uapi/linux/loop.h>
+#include <linux/task_work.h>
 
 /* Possible states of device */
 enum {
@@ -74,6 +75,10 @@ struct loop_device {
 	struct gendisk		*lo_disk;
 	struct mutex		lo_mutex;
 	bool			idr_visible;
+	union {
+		struct callback_head lo_clr_task_work;
+		struct work_struct lo_clr_work;
+	};
 };
 
 struct loop_cmd {
@@ -1112,12 +1117,34 @@ static int loop_configure(struct loop_device *lo, blk_mode_t mode,
 	return error;
 }
 
-static void __loop_clr_fd(struct loop_device *lo)
+static void __loop_clr_fd(struct callback_head *callback)
 {
+	struct loop_device *lo = container_of(callback, struct loop_device, lo_clr_task_work);
 	struct queue_limits lim;
 	struct file *filp;
 	gfp_t gfp = lo->old_gfp_mask;
 
+	/* Step 1: Flush all outstanding I/O, without open_mutex held. */
+
+	/*
+	 * Now that loop_queue_rq() sees lo->lo_state != Lo_bound,
+	 * wait for already started loop_queue_rq() to complete.
+	 */
+	synchronize_rcu();
+	/*
+	 * Now that no more works are scheduled by loop_queue_rq(),
+	 * wait for already scheduled works to complete.
+	 */
+	drain_workqueue(lo->workqueue);
+	/*
+	 * Now that no more AIO requests are scheduled by lo_rw_aio(),
+	 * wait for already started AIO to complete.
+	 */
+	blk_mq_unfreeze_queue(lo->lo_queue, blk_mq_freeze_queue(lo->lo_queue));
+
+	/* Step 2: Perform remaining cleanup, with open_mutex held. */
+	mutex_lock(&lo->lo_disk->open_mutex);
+
 	spin_lock_irq(&lo->lo_lock);
 	filp = lo->lo_backing_file;
 	lo->lo_backing_file = NULL;
@@ -1128,12 +1155,7 @@ static void __loop_clr_fd(struct loop_device *lo)
 	lo->lo_sizelimit = 0;
 	memset(lo->lo_file_name, 0, LO_NAME_SIZE);
 
-	/*
-	 * Reset the block size to the default.
-	 *
-	 * No queue freezing needed because this is called from the final
-	 * ->release call only, so there can't be any outstanding I/O.
-	 */
+	/* Reset the block size to the default. */
 	lim = queue_limits_start_update(lo->lo_queue);
 	lim.logical_block_size = SECTOR_SIZE;
 	lim.physical_block_size = SECTOR_SIZE;
@@ -1145,8 +1167,6 @@ static void __loop_clr_fd(struct loop_device *lo)
 	/* let user-space know about this change */
 	kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 	mapping_set_gfp_mask(filp->f_mapping, gfp);
-	/* This is safe: open() is still holding a reference. */
-	module_put(THIS_MODULE);
 
 	disk_force_media_change(lo->lo_disk);
 
@@ -1154,9 +1174,6 @@ static void __loop_clr_fd(struct loop_device *lo)
 		int err;
 
 		/*
-		 * open_mutex has been held already in release path, so don't
-		 * acquire it if this function is called in such case.
-		 *
 		 * If the reread partition isn't from release path, lo_refcnt
 		 * must be at least one and it can only become zero when the
 		 * current holder is released.
@@ -1181,12 +1198,31 @@ static void __loop_clr_fd(struct loop_device *lo)
 	WRITE_ONCE(lo->lo_state, Lo_unbound);
 	mutex_unlock(&lo->lo_mutex);
 
+	/* Step 3: Drop refcounts, without open_mutex held. */
+	mutex_unlock(&lo->lo_disk->open_mutex);
+
 	/*
 	 * Need not hold lo_mutex to fput backing file. Calling fput holding
 	 * lo_mutex triggers a circular lock dependency possibility warning as
 	 * fput can take open_mutex which is usually taken before lo_mutex.
 	 */
 	fput(filp);
+
+	/*
+	 * Drop all references that would have been dropped as soon as
+	 * returning from lo_release() and releasing disk->open_mutex.
+	 */
+	module_put(lo->lo_disk->fops->owner);
+	put_device(disk_to_dev(lo->lo_disk));
+
+	module_put(THIS_MODULE);
+}
+
+static void loop_clr_work(struct work_struct *work)
+{
+	struct loop_device *lo = container_of(work, struct loop_device, lo_clr_work);
+
+	__loop_clr_fd(&lo->lo_clr_task_work);
 }
 
 static int loop_clr_fd(struct loop_device *lo)
@@ -1747,8 +1783,30 @@ static void lo_release(struct gendisk *disk)
 	need_clear = (lo->lo_state == Lo_rundown);
 	mutex_unlock(&lo->lo_mutex);
 
-	if (need_clear)
-		__loop_clr_fd(lo);
+	/*
+	 * In order to flush pending I/O requests before clearing the backing device,
+	 * defer __loop_clr_fd() to task work context or normal workqueue context.
+	 * The Lo_rundown state guarantees that lo_open() will fail with -ENXIO.
+	 */
+	if (need_clear) {
+		/*
+		 * Grab all references that will be dropped as soon as returning from
+		 * lo_release() and releasing disk->open_mutex.
+		 */
+		get_device(disk_to_dev(disk));
+		__module_get(disk->fops->owner);
+		/*
+		 * Prefer task work, for userspace might be expecting that fput()
+		 * on the backing file is processed before lo_release() from close()
+		 * returns to user mode.
+		 */
+		init_task_work(&lo->lo_clr_task_work, __loop_clr_fd);
+		if ((current->flags & PF_KTHREAD) ||
+		    task_work_add(current, &lo->lo_clr_task_work, TWA_RESUME)) {
+			INIT_WORK(&lo->lo_clr_work, loop_clr_work);
+			schedule_work(&lo->lo_clr_work);
+		}
+	}
 }
 
 static void lo_free_disk(struct gendisk *disk)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0f7519f8e7c9..45fd146b85df 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -102,6 +102,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(task_work_add);
 
 /**
  * task_work_cancel_match - cancel a pending work added by task_work_add()
-- 
2.54.0

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Ming Lei 2 weeks ago

On Mon, May 25, 2026 at 12:40:19PM +0900, Tetsuo Handa wrote:
> Some commit which was merged in the merge window for 7.1 broke the loop
> driver; a race window where lo_release() clears the backing file via
> __loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
> 
> The exact commit which changed the behavior is not known due to lack of
> reproducer and timing dependent behavior, but it seems that we need to
> solve this problem in the loop driver despite there was no change for the
> loop driver during this merge window.
> 
> To close this race, try to flush pending I/O requests. However, calling
> drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
> lockdep warnings [3][4]. We need to flush pending I/O requests without
> disk->open_mutex held.

No, please don't workaround before root cause.

No proof shows that the issue is in block layer or loop driver, the IO isn't
expected, you need to figure out why btrfs still issues IO after this loop
disk is closed by everyone and writeback is done.

https://syzkaller.appspot.com/x/log.txt?x=101e4702580000


Thanks,
Ming

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 1 week, 6 days ago

On 2026/05/26 0:19, Ming Lei wrote:
> On Mon, May 25, 2026 at 12:40:19PM +0900, Tetsuo Handa wrote:
>> Some commit which was merged in the merge window for 7.1 broke the loop
>> driver; a race window where lo_release() clears the backing file via
>> __loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
>>
>> The exact commit which changed the behavior is not known due to lack of
>> reproducer and timing dependent behavior, but it seems that we need to
>> solve this problem in the loop driver despite there was no change for the
>> loop driver during this merge window.
>>
>> To close this race, try to flush pending I/O requests. However, calling
>> drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
>> lockdep warnings [3][4]. We need to flush pending I/O requests without
>> disk->open_mutex held.
> 
> No, please don't workaround before root cause.
> 
> No proof shows that the issue is in block layer or loop driver, the IO isn't
> expected, you need to figure out why btrfs still issues IO after this loop
> disk is closed by everyone and writeback is done.
> 
> https://syzkaller.appspot.com/x/log.txt?x=101e4702580000
> 

Of course we should try to figure out the root cause first, but how can we do?

  Absolute fact:

    This problem started happening no later than next-20260413 in the linux-next.git tree.
    ( syzbot was unable to test next-202604{03,06,07,08,09,10} due to a different bug. )

    This problem is still happening as of v7.1-rc5 in the linux.git tree.

    No one has succeeded establishing steps to reproduce this problem.

    No one has identified the exact commit that is causing this problem.

  Likely fact:

    Since this problem did not happen using next-20260402 in the linux-next.git tree until 2026/04/13 16:31,
    this problem did not exist until next-20260402 in the linux-next.git tree.

    Since this problem did not happen until v7.0, this problem did not exist until v7.0.
    (Although last minute changes for v7.0-rc{6,7} or v7.0 could become the culprit,
     the merge window which accepts big changes for v7.1 is more likely.)

  My guess:

    The culprit commit is in between commit a028739a4330 ("Merge tag 'block-7.0-20260305' of
    git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux") and commit 7fe6ac157b7e ("Merge tag
    'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux"), for
    changes related to bio handling are merged in this period.

    "git log --oneline block/ drivers/block/" between next-20260402 and next-20260413 shows the following diff:

--------------------
-da93b347876b Merge branch 'master' of https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
-9b75c6e054b7 Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
-265720725a47 Merge branch 'fs-next' of linux-next
-ac9e99118030 Merge branch into tip/master: 'x86/cleanups'
-8ea5c0750d36 zram: do not forget to endio for partial discard requests
-0476d2e93477 zram: change scan_slots to return void
-1207420afea8 zram: propagate read_from_bdev_async() errors
-aafa569edb41 zram: optimize LZ4 dictionary compression performance
-24c76a259819 Merge branch 'for-7.1/block' into for-next
-eca714c0aac1 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+7d8d908556ca Merge branch 'master' of https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
+4391dc7df11d Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
+18c6a4c24187 Merge branch 'fs-next' of linux-next
+7f828a86cfef Merge branch into tip/master: 'x86/cleanups'
+716aa108c5bb zram: reject unrecognized type= values in recompress_store()
+3470a1d34f40 zram: do not forget to endio for partial discard requests
+88a57e158619 Merge branch 'for-7.1/block' into for-next
+36446de0c30c ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
+f2bab85781e8 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+9357dc97533a Merge branch 'vfs-7.1.integrity' into vfs.all
+e0b15707598c Merge branch 'for-7.1/block' into for-next
+539fb773a3f7 block: refactor blkdev_zone_mgmt_ioctl
+ddc1dfffcbea Merge branch 'for-7.1/block' into for-next
+365ea7cc6244 ublk: allow buffer registration before device is started
+5e864438e285 ublk: replace xarray with IDA for shmem buffer index allocation
+8ea8566a9aee ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
+211ff1602b67 ublk: verify all pages in multi-page bvec fall within registered range
+23b3b6f0b584 ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
+cb793ff1353d Merge branch 'for-7.1/block' into for-next
+92c3737a2473 block: add a bio_submit_or_kill helper
+6fa747550e35 block: factor out a bio_await helper
+65565ca5f99b block: unify the synchronous bi_end_io callbacks
+cc91702dedc5 Merge branch 'for-7.1/block' into for-next
+8a34e88769f6 ublk: eliminate permanent pages[] array from struct ublk_buf
+08677040a911 ublk: enable UBLK_F_SHMEM_ZC feature flag
+4d4a512a1f87 ublk: add PFN-based buffer matching in I/O path
+2fb0ded237bb ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
+dec615fa43c3 Merge branch 'for-7.1/block' into for-next
+fa0cac9a5158 drbd: use get_random_u64() where appropriate
+0b581d2fb4cf Merge branch 'for-7.1/block' into for-next
+a9c4b1d37622 drbd: remove DRBD_GENLA_F_MANDATORY flag handling
+d436cfb3a259 Merge branch 'for-7.1/block' into for-next
+e9b004ff8306 blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
+09ebc43b5edc Merge branch 'for-7.1/block' into for-next
+0842186d2c4e ublk: reset per-IO canceled flag on each fetch
+cba82993308d zram: change scan_slots to return void
+bf989ade270d zram: propagate read_from_bdev_async() errors
+f0f6f7871430 zram: optimize LZ4 dictionary compression performance
+301f39220096 zram: unify and harden algo/priority params handling
+cedfa028b54e zram: remove chained recompression
+5004a27edba5 zram: drop ->num_active_comps
+ed19b9d5504f zram: do not autocorrect bad recompression parameters
+241f9005b1c8 zram: do not permit params change after init
+c09fb53d293a zram: use statically allocated compression algorithm names
+6030f93e5c71 Merge branch 'for-7.1/io_uring-fuse' into for-next
+29ebfdd7db89 io_uring/rsrc: rename io_buffer_register_bvec()/io_buffer_unregister_bvec()
+6568edbea553 Merge branch 'for-7.1/block' into for-next
+a175ee827331 block: use sysfs_emit in sysfs show functions
+c691e4b0d80b bio: fix kmemleak false positives from percpu bio alloc cache
 f91ffe89b201 blk-iocost: fix busy_level reset when no IOs complete
 23308af722fe blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
 b2a78fec344e zloop: add max_open_zones option
 2a2f520fda82 block: fix zones_cond memory leak on zone revalidation error paths
 267ec4d7223a loop: fix partition scan race between udev and loop_reread_partitions()
 499d2d2f4cf9 sed-opal: Add STACK_RESET command
-c61825bb46bc Merge branch 'vfs-7.1.integrity' into vfs.all
-fc2093641448 zram: unify and harden algo/priority params handling
-4fd453f16446 zram: remove chained recompression
-e2b717936d1a zram: drop ->num_active_comps
-3578bb37f7d1 zram: do not autocorrect bad recompression parameters
-5331373bfebd zram: do not permit params change after init
 2b31e86387e6 drbd: Balance RCU calls in drbd_adm_dump_devices()
 f9480ecf939d bdev: Drop pointless invalidate_inode_buffers() call
-b00ff1b25f85 zram: use statically allocated compression algorithm names
 630bbba45cfd drbd: use genl pre_doit/post_doit
 829def1e35ca zloop: forget write cache on force removal
 eff8d1656e83 zloop: refactor zloop_rw
--------------------

    "git log --oneline block/" between next-20260402 and next-20260413 shows the following diff:

--------------------
-9b75c6e054b7 Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
-eca714c0aac1 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+4391dc7df11d Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
+f2bab85781e8 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+539fb773a3f7 block: refactor blkdev_zone_mgmt_ioctl
+92c3737a2473 block: add a bio_submit_or_kill helper
+6fa747550e35 block: factor out a bio_await helper
+65565ca5f99b block: unify the synchronous bi_end_io callbacks
+e9b004ff8306 blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
+a175ee827331 block: use sysfs_emit in sysfs show functions
+c691e4b0d80b bio: fix kmemleak false positives from percpu bio alloc cache
 f91ffe89b201 blk-iocost: fix busy_level reset when no IOs complete
 23308af722fe blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
 2a2f520fda82 block: fix zones_cond memory leak on zone revalidation error paths
--------------------

Possible approaches for finding the exact commit that is causing this problem:

  (a) Revert all changes in the block layer from linux.git and monitor for one week for whether this
      problem is still happening (because linux.git is more frequently hitting this problem than
      linux-next.git ).

  (b) Revert all changes in the block layer from linux-next.git and monitor for two weeks for
      whether this problem is still happening (less reliable than linux.git but a candidate).

  (c) Let sashiko review all changes between v7.0 and v7.1 that may cause this problem.
      (Human developers have no time to review. But is investigation with moving baseline commit
      possible for sashiko ?)

  (d) Any ideas?

P.S. Since the loop driver is a critical infrastructure for testing filesystems by syzbot,
I want this problem be addressed before 7.1 is released.

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Ming Lei 1 week, 5 days ago

On Tue, May 26, 2026 at 09:25:30AM +0900, Tetsuo Handa wrote:
> On 2026/05/26 0:19, Ming Lei wrote:
> > On Mon, May 25, 2026 at 12:40:19PM +0900, Tetsuo Handa wrote:
> >> Some commit which was merged in the merge window for 7.1 broke the loop
> >> driver; a race window where lo_release() clears the backing file via
> >> __loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
> >>
> >> The exact commit which changed the behavior is not known due to lack of
> >> reproducer and timing dependent behavior, but it seems that we need to
> >> solve this problem in the loop driver despite there was no change for the
> >> loop driver during this merge window.
> >>
> >> To close this race, try to flush pending I/O requests. However, calling
> >> drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
> >> lockdep warnings [3][4]. We need to flush pending I/O requests without
> >> disk->open_mutex held.
> > 
> > No, please don't workaround before root cause.
> > 
> > No proof shows that the issue is in block layer or loop driver, the IO isn't
> > expected, you need to figure out why btrfs still issues IO after this loop
> > disk is closed by everyone and writeback is done.
> > 
> > https://syzkaller.appspot.com/x/log.txt?x=101e4702580000
> > 
> 
> Of course we should try to figure out the root cause first, but how can we do?

Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
which may cause data loss, so CC btrfs list and maintainer.

...
 
> Possible approaches for finding the exact commit that is causing this problem:
> 
>   (a) Revert all changes in the block layer from linux.git and monitor for one week for whether this
>       problem is still happening (because linux.git is more frequently hitting this problem than
>       linux-next.git ).
> 
>   (b) Revert all changes in the block layer from linux-next.git and monitor for two weeks for
>       whether this problem is still happening (less reliable than linux.git but a candidate).
> 
>   (c) Let sashiko review all changes between v7.0 and v7.1 that may cause this problem.
>       (Human developers have no time to review. But is investigation with moving baseline commit
>       possible for sashiko ?)
> 
>   (d) Any ideas?
> 
> P.S. Since the loop driver is a critical infrastructure for testing filesystems by syzbot,
> I want this problem be addressed before 7.1 is released.

syzbot is for finding real problem, here the real trouble is unexpected write IO from btrfs.

So please do not try to paper over real bug by 'fixing' loop.


Thanks,
Ming

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 1 week, 5 days ago

On 2026/05/27 10:20, Ming Lei wrote:
>> Of course we should try to figure out the root cause first, but how can we do?
> 
> Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
> which may cause data loss, so CC btrfs list and maintainer.

Why do you assume that the culprit is btrfs?

https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
this similar race is also happening with jfs.

[  678.816570][ T1038] read_mapping_page failed!
[  678.816584][ T1038] ERROR: (device loop3): txCommit: 
[  678.816584][ T1038] 
[  678.816633][ T1038] jfs_write_inode: jfs_commit_inode failed!
[  678.895688][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  678.956225][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  678.970652][   T12] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.102838][ T4281] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.104701][ T4281] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.121329][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.122119][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.199283][ T2183] lo_rw_aio(loop3) starting read with raw_refcnt=0x0, refcnt=1
[  679.200014][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.275613][ T5615] __loop_clr_fd(loop3) clearing lo_backing_file with raw_refcnt=0x0, refcnt=1
[  679.397358][   T13] bridge_slave_1: left allmulticast mode
[  679.397399][   T13] bridge_slave_1: left promiscuous mode
[  679.410004][   T13] bridge0: port 2(bridge_slave_1) entered disabled state
[  679.433576][ T2183] ------------[ cut here ]------------
[  679.433592][ T2183] d_inode(dentry) != file_inode(file)
[  679.433617][ T2183] WARNING: ./include/linux/fs.h:1368 at file_remove_privs_flags+0x58c/0x640, CPU#0: kworker/u8:12/2183
[  679.433676][ T2183] Modules linked in:
[  679.433695][ T2183] CPU: 0 UID: 0 PID: 2183 Comm: kworker/u8:12 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
[  679.433720][ T2183] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
[  679.433739][ T2183] Workqueue: loop3 loop_workfn
[  679.433805][ T2183] RIP: 0010:file_remove_privs_flags+0x58c/0x640
[  679.433848][ T2183] Code: 00 75 4d 44 89 e8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 5f d4 80 ff e9 90 fe ff ff e8 55 d4 80 ff 90 <0f> 0b 90 e9 85 fb ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c b7
[  679.433867][ T2183] RSP: 0018:ffffc90007e374e0 EFLAGS: 00010293
[  679.433885][ T2183] RAX: ffffffff8243f7cb RBX: ffff888036fa8ca0 RCX: ffff88802c0abd80
[  679.433902][ T2183] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  679.433933][ T2183] RBP: ffffc90007e37638 R08: 0000000000000000 R09: 0000000000000000
[  679.433946][ T2183] R10: dffffc0000000000 R11: fffffbfff1f1597f R12: ffff888063726220
[  679.433962][ T2183] R13: 1ffff11006df5194 R14: 0000000000000000 R15: 1ffff1100c6e4c44
[  679.433978][ T2183] FS:  0000000000000000(0000) GS:ffff888125f1f000(0000) knlGS:0000000000000000
[  679.433998][ T2183] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  679.434016][ T2183] CR2: 00007f22e1be7dac CR3: 000000003e332000 CR4: 00000000003526f0
[  679.434038][ T2183] Call Trace:
[  679.434049][ T2183]  <TASK>
[  679.434072][ T2183]  ? __pfx_file_remove_privs_flags+0x10/0x10
[  679.434118][ T2183]  ? rt_mutex_post_schedule+0xd1/0x1c0
[  679.434172][ T2183]  ? generic_write_checks_count+0x449/0x550
[  679.434212][ T2183]  ? generic_write_checks+0xc8/0x110
[  679.434249][ T2183]  shmem_file_write_iter+0xaa/0x120
[  679.434286][ T2183]  lo_rw_aio+0xef0/0x1170
[  679.434349][ T2183]  ? __pfx_lo_rw_aio+0x10/0x10
[  679.434401][ T2183]  ? kthread_associate_blkcg+0x490/0x600
[  679.434432][ T2183]  ? rt_spin_unlock+0x160/0x200
[  679.434476][ T2183]  loop_process_work+0x637/0x11b0
[  679.434539][ T2183]  ? __pfx_loop_process_work+0x10/0x10
[  679.434582][ T2183]  ? look_up_lock_class+0x57/0x110
[  679.434626][ T2183]  ? register_lock_class+0x31/0x2e0
[  679.434661][ T2183]  ? __lock_acquire+0x6b5/0x2d10
[  679.434741][ T2183]  ? do_raw_spin_lock+0x12b/0x2f0
[  679.434785][ T2183]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  679.434830][ T2183]  ? process_one_work+0x8be/0x1630
[  679.434870][ T2183]  ? process_one_work+0x8be/0x1630
[  679.434922][ T2183]  ? process_one_work+0x8be/0x1630
[  679.434959][ T2183]  process_one_work+0x98b/0x1630
[  679.435026][ T2183]  ? __pfx_process_one_work+0x10/0x10
[  679.435060][ T2183]  ? do_raw_spin_lock+0x12b/0x2f0
[  679.435128][ T2183]  worker_thread+0xb49/0x1140
[  679.435202][ T2183]  kthread+0x388/0x470
[  679.435233][ T2183]  ? __pfx_worker_thread+0x10/0x10
[  679.435276][ T2183]  ? __pfx_kthread+0x10/0x10
[  679.435309][ T2183]  ret_from_fork+0x514/0xb70
[  679.435348][ T2183]  ? __pfx_ret_from_fork+0x10/0x10
[  679.435382][ T2183]  ? __switch_to+0xc79/0x1410
[  679.435415][ T2183]  ? __pfx_kthread+0x10/0x10
[  679.435447][ T2183]  ret_from_fork_asm+0x1a/0x30
[  679.435517][ T2183]  </TASK>

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Ming Lei 1 week, 5 days ago

On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
> On 2026/05/27 10:20, Ming Lei wrote:
> >> Of course we should try to figure out the root cause first, but how can we do?
> > 
> > Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
> > which may cause data loss, so CC btrfs list and maintainer.
> 
> Why do you assume that the culprit is btrfs?
> 
> https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
> this similar race is also happening with jfs.

I just didn't see the above report on jfs.

It doesn't change anything, the same question still stands: unexpected write IO is issued
or crosses umount & last closing of loop disk.



Thanks,
Ming

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 1 week, 4 days ago

On Tue, 26 May 2026 22:00:49 -0500 Ming Lei wrote:
>On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>> On 2026/05/27 10:20, Ming Lei wrote:
>> >> Of course we should try to figure out the root cause first, but how can we do?
>> > 
>> > Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>> > which may cause data loss, so CC btrfs list and maintainer.
>> 
>> Why do you assume that the culprit is btrfs?
>> 
>> https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
>> this similar race is also happening with jfs.
>
> I just didn't see the above report on jfs.
> 
> It doesn't change anything, the same question still stands: unexpected write IO is issued
> or crosses umount & last closing of loop disk.
>
Given the loop workqueue that triggered the jfs warning, can you specify
the reason why the workqueue in question is NOT flushed while closing disk?

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 1 week, 4 days ago

On Thu, 28 May 2026 13:43:31 +0800 Hillf Danton wrote:
>On Tue, 26 May 2026 22:00:49 -0500 Ming Lei wrote:
>>On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>>> On 2026/05/27 10:20, Ming Lei wrote:
>>> >> Of course we should try to figure out the root cause first, but how can we do?
>>> > 
>>> > Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>>> > which may cause data loss, so CC btrfs list and maintainer.
>>> 
>>> Why do you assume that the culprit is btrfs?
>>> 
>>> https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
>>> this similar race is also happening with jfs.
>>
>> I just didn't see the above report on jfs.
>> 
>> It doesn't change anything, the same question still stands: unexpected write IO is issued
>> or crosses umount & last closing of loop disk.
>>
> Given the loop workqueue that triggered the jfs warning, can you specify
> the reason why the workqueue in question is NOT flushed while closing disk?
>
Got it, the loop workqueue is NOT flushed to avoid deadlock, see d292dc80686a
("loop: don't destroy lo->workqueue in __loop_clr_fd") for detail.
And the deadlock can be reproduced by flushing the loop workqueue with
disk->open_mutex held [1].

[1] Subject: Re: [syzbot] possible deadlock in blkdev_put (3)
https://lore.kernel.org/lkml/000000000000ea753505da2658d5@google.com/

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 1 week, 3 days ago

On 2026/05/29 8:00, Hillf Danton wrote:
>> Given the loop workqueue that triggered the jfs warning, can you specify
>> the reason why the workqueue in question is NOT flushed while closing disk?
>>
> Got it, the loop workqueue is NOT flushed to avoid deadlock, see d292dc80686a
> ("loop: don't destroy lo->workqueue in __loop_clr_fd") for detail.
> And the deadlock can be reproduced by flushing the loop workqueue with
> disk->open_mutex held [1].
> 
> [1] Subject: Re: [syzbot] possible deadlock in blkdev_put (3)
> https://lore.kernel.org/lkml/000000000000ea753505da2658d5@google.com/

We can avoid the following lockdep warnings (including [1] you mentioned)

  https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e
  https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc
  https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7
  https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97
  https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4

caused by "drain_workqueue() with disk->open_mutex held" if we assign
caller-specific lockdep class to disk->open_mutex

  https://sourceforge.net/p/tomoyo/tomoyo.git/ci/c2245c765ebeba9dcb924d9171d8d470a9ac41c8/

.

Also, we can avoid lockdep warning caused by "drain_workqueue() with disk->open_mutex held" +
"holding system_transition_mutex" if we forbid binding to pseudo files as backing file
in the loop driver

  https://lkml.kernel.org/r/d38e4600-3c32-491f-aa49-905f4fad1bfb@I-love.SAKURA.ne.jp

which we can reproduce with

  echo 7:0 > /sys/power/resume
  losetup /dev/loop0 /sys/power/resume
  cat /dev/loop0 > /dev/null
  losetup -d /dev/loop0

.

Therefore, I think we can address this problem by "drain_workqueue() with disk->open_mutex
held" in the loop driver side.



However, the possibility that the last milli-second writeback request
(which runs during unmount operation) from filesystem fails due to

    if (data_race(READ_ONCE(lo->lo_state)) != Lo_bound)
        return BLK_STS_IOERR;

check in loop_queue_rq() will remain. Therefore, addressing this problem
within individual filesystem will be more strict solution. But guessing from
the pace jfs fixes bugs, it would take long time before we stop seeing
this problem...

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 1 week, 3 days ago

On Fri, 29 May 2026 09:14:47 +0900 Tetsuo Handa wrote:
>On 2026/05/29 8:00, Hillf Danton wrote:
>>> Given the loop workqueue that triggered the jfs warning, can you specify
>>> the reason why the workqueue in question is NOT flushed while closing disk?
>>>
>> Got it, the loop workqueue is NOT flushed to avoid deadlock, see d292dc80686a
>> ("loop: don't destroy lo->workqueue in __loop_clr_fd") for detail.
>> And the deadlock can be reproduced by flushing the loop workqueue with
>> disk->open_mutex held [1].
>> 
>> [1] Subject: Re: [syzbot] possible deadlock in blkdev_put (3)
>> https://lore.kernel.org/lkml/000000000000ea753505da2658d5@google.com/
>
>We can avoid the following lockdep warnings (including [1] you mentioned)
>
>  https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e
>  https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc
>  https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7
>  https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97
>  https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4
>
>caused by "drain_workqueue() with disk->open_mutex held" if we assign
>caller-specific lockdep class to disk->open_mutex
>
>  https://sourceforge.net/p/tomoyo/tomoyo.git/ci/c2245c765ebeba9dcb924d9171d8d470a9ac41c8/
>
>.
>
>Also, we can avoid lockdep warning caused by "drain_workqueue() with disk->open_mutex held" +
>"holding system_transition_mutex" if we forbid binding to pseudo files as backing file
>in the loop driver
>
>  https://lkml.kernel.org/r/d38e4600-3c32-491f-aa49-905f4fad1bfb@I-love.SAKURA.ne.jp
>
>which we can reproduce with
>
>  echo 7:0 > /sys/power/resume
>  losetup /dev/loop0 /sys/power/resume
>  cat /dev/loop0 > /dev/null
>  losetup -d /dev/loop0
>
>.
>
>Therefore, I think we can address this problem by "drain_workqueue() with disk->open_mutex
>held" in the loop driver side.
>
Good news.
>
>
>However, the possibility that the last milli-second writeback request
>(which runs during unmount operation) from filesystem fails due to
>
>    if (data_race(READ_ONCE(lo->lo_state)) != Lo_bound)
>        return BLK_STS_IOERR;
>
>check in loop_queue_rq() will remain.

This conflicts with "There is no need to destroy the workqueue when
clearing unbinding a loop device from a backing file." in d292dc80686a

>Therefore, addressing this problem
>within individual filesystem will be more strict solution. But guessing from

Conflicts with "Another thing is, if it's some btrfs bios on-the-fly after 
close_ctree(), the most common symptom should be NULL pointer 
dereference inside various btrfs endio functions." [2] once more.

And you need to pay the fs guys more than two cents I think for cooking
a FIX.

[2] Subject: Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
https://lore.kernel.org/lkml/36571f8a-4df8-4152-b078-d82dbff4ad7e@suse.com/

>the pace jfs fixes bugs, it would take long time before we stop seeing
>this problem...

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 1 week, 3 days ago

On Fri, 29 May 2026 15:04:10 +0800 Hillf Danton wrote:
>On Fri, 29 May 2026 09:14:47 +0900 Tetsuo Handa wrote:
>>On 2026/05/29 8:00, Hillf Danton wrote:
>>>> Given the loop workqueue that triggered the jfs warning, can you specify
>>>> the reason why the workqueue in question is NOT flushed while closing disk?
>>>>
>>> Got it, the loop workqueue is NOT flushed to avoid deadlock, see d292dc80686a
>>> ("loop: don't destroy lo->workqueue in __loop_clr_fd") for detail.
>>> And the deadlock can be reproduced by flushing the loop workqueue with
>>> disk->open_mutex held [1].
>>> 
>>> [1] Subject: Re: [syzbot] possible deadlock in blkdev_put (3)
>>> https://lore.kernel.org/lkml/000000000000ea753505da2658d5@google.com/
>>
>>We can avoid the following lockdep warnings (including [1] you mentioned)
>>
>>  https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e
>>  https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc
>>  https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7
>>  https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97
>>  https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4
>>
>>caused by "drain_workqueue() with disk->open_mutex held" if we assign
>>caller-specific lockdep class to disk->open_mutex
>>
>>  https://sourceforge.net/p/tomoyo/tomoyo.git/ci/c2245c765ebeba9dcb924d9171d8d470a9ac41c8/
>>
>>.
>>
>>Also, we can avoid lockdep warning caused by "drain_workqueue() with disk->open_mutex held" +
>>"holding system_transition_mutex" if we forbid binding to pseudo files as backing file
>>in the loop driver
>>
>>  https://lkml.kernel.org/r/d38e4600-3c32-491f-aa49-905f4fad1bfb@I-love.SAKURA.ne.jp
>>
>>which we can reproduce with
>>
>>  echo 7:0 > /sys/power/resume
>>  losetup /dev/loop0 /sys/power/resume
>>  cat /dev/loop0 > /dev/null
>>  losetup -d /dev/loop0
>>
>>.
>>
>> Therefore, I think we can address this problem by "drain_workqueue() with disk->open_mutex
>> held" in the loop driver side.
>>
> Good news.
>
Bad news: Subject: [syzbot] [block?] possible deadlock in loop_process_work
[3] https://lore.kernel.org/lkml/6a19f5f7.5099cdd9.8e407.0004.GAE@google.com/

syzbot found the following issue on:

HEAD commit:    c1ecb239fa34 Add linux-next specific files for 20260522
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12fa6336580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=78ad2c6a58c0a1faa5f5
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+78ad2c6a58c0a1faa5f5@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G             L
------------------------------------------------------
kworker/u8:15/1491 is trying to acquire lock:
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: do_req_filebacked drivers/block/loop.c:433 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_handle_cmd drivers/block/loop.c:1941 [inline]
ffff88805e1a6480 (sb_writers#5){.+.+}-{0:0}, at: loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976

but task is already holding lock:
ffffc90006e27c40 ((work_completion)(&worker->work)){+.+.}-{0:0}, at: process_one_work+0x8be/0x1630 kernel/workqueue.c:3294

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #7 ((work_completion)(&worker->work)){+.+.}-{0:0}:
       process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #6 ((wq_completion)loop4){+.+.}-{0:0}:
       touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
       __flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
       drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
       __loop_clr_fd drivers/block/loop.c:1130 [inline]
       lo_release+0x287/0x8f0 drivers/block/loop.c:1767
       bdev_release+0x541/0x660 block/bdev.c:-1
       blkdev_release+0x15/0x20 block/fops.c:705
       __fput+0x461/0xa70 fs/file_table.c:510
       fput_close_sync+0x11f/0x240 fs/file_table.c:615
       __do_sys_close fs/open.c:1511 [inline]
       __se_sys_close fs/open.c:1496 [inline]
       __x64_sys_close+0x7e/0x110 fs/open.c:1496
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #5 (&disk->open_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       __del_gendisk+0x127/0x980 block/genhd.c:710
       del_gendisk+0xe7/0x160 block/genhd.c:823
       nbd_dev_remove drivers/block/nbd.c:268 [inline]
       nbd_dev_remove_work+0x47/0xe0 drivers/block/nbd.c:284
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #4 (&set->update_nr_hwq_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       add_disk_fwnode+0xe7/0x480 block/genhd.c:596
       add_disk include/linux/blkdev.h:794 [inline]
       nbd_dev_add+0x72c/0xb50 drivers/block/nbd.c:1984
       nbd_genl_connect+0x965/0x1c80 drivers/block/nbd.c:2125
       genl_family_rcv_msg_doit+0x22a/0x330 net/netlink/genetlink.c:1114
       genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
       genl_rcv_msg+0x61c/0x7a0 net/netlink/genetlink.c:1209
       netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2551
       genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
       sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
       __sock_sendmsg net/socket.c:812 [inline]
       ____sys_sendmsg+0x55c/0x870 net/socket.c:2716
       ___sys_sendmsg+0x2a5/0x360 net/socket.c:2770
       __sys_sendmsg net/socket.c:2802 [inline]
       __do_sys_sendmsg net/socket.c:2807 [inline]
       __se_sys_sendmsg net/socket.c:2805 [inline]
       __x64_sys_sendmsg+0x1c3/0x2a0 net/socket.c:2805
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #3 (genl_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       genl_lock net/netlink/genetlink.c:35 [inline]
       genl_lock_all net/netlink/genetlink.c:48 [inline]
       genl_register_family+0x7b9/0x17b0 net/netlink/genetlink.c:784
       vdpa_init+0x39/0x70 drivers/vdpa/vdpa.c:1565
       do_one_initcall+0x250/0x870 init/main.c:1347
       do_initcall_level+0x104/0x190 init/main.c:1409
       do_initcalls+0x59/0xa0 init/main.c:1425
       kernel_init_freeable+0x2a6/0x3e0 init/main.c:1658
       kernel_init+0x1d/0x1d0 init/main.c:1548
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #2 (cb_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       genl_rcv+0x19/0x40 net/netlink/genetlink.c:1217
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x780/0x920 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1895
       sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
       __sock_sendmsg net/socket.c:812 [inline]
       sock_sendmsg+0x1ca/0x2d0 net/socket.c:835
       splice_to_socket+0xae5/0x11f0 fs/splice.c:884
       do_splice_from fs/splice.c:936 [inline]
       do_splice+0xef8/0x1940 fs/splice.c:1349
       __do_splice fs/splice.c:1431 [inline]
       __do_sys_splice fs/splice.c:1634 [inline]
       __se_sys_splice+0x353/0x490 fs/splice.c:1616
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #1 (&pipe->mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       iter_file_splice_write+0x1f3/0x10f0 fs/splice.c:682
       do_splice_from fs/splice.c:936 [inline]
       do_splice+0xef8/0x1940 fs/splice.c:1349
       __do_splice fs/splice.c:1431 [inline]
       __do_sys_splice fs/splice.c:1634 [inline]
       __se_sys_splice+0x353/0x490 fs/splice.c:1616
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (sb_writers#5){.+.+}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:3167 [inline]
       check_prevs_add kernel/locking/lockdep.c:3286 [inline]
       validate_chain kernel/locking/lockdep.c:3910 [inline]
       __lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
       lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
       percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
       percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
       __sb_start_write include/linux/fs/super.h:19 [inline]
       sb_start_write include/linux/fs/super.h:125 [inline]
       kiocb_start_write include/linux/fs.h:2767 [inline]
       lo_rw_aio+0xb1b/0xf00 drivers/block/loop.c:401
       do_req_filebacked drivers/block/loop.c:433 [inline]
       loop_handle_cmd drivers/block/loop.c:1941 [inline]
       loop_process_work+0x637/0x11b0 drivers/block/loop.c:1976
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

other info that might help us debug this:

Chain exists of:
  sb_writers#5 --> (wq_completion)loop4 --> (work_completion)(&worker->work)

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&worker->work));
                               lock((wq_completion)loop4);
                               lock((work_completion)(&worker->work));
  rlock(sb_writers#5);

 *** DEADLOCK ***

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 1 week, 2 days ago

On 2026/05/30 7:05, Hillf Danton wrote:
>>> Therefore, I think we can address this problem by "drain_workqueue() with disk->open_mutex
>>> held" in the loop driver side.
>>>
>> Good news.
>>
> Bad news: Subject: [syzbot] [block?] possible deadlock in loop_process_work
> [3] https://lore.kernel.org/lkml/6a19f5f7.5099cdd9.8e407.0004.GAE@google.com/
> 

OK. I sent two patches

  https://lkml.kernel.org/r/147ed056-03d9-4214-b925-0f10fc00cf27@I-love.SAKURA.ne.jp
  https://lkml.kernel.org/r/148efba2-a0b6-47d7-ac76-b19d2f4b696c@I-love.SAKURA.ne.jp

as a preparation for evaluating the possibility of calling drain_workqueue() from
__loop_clr_fd(). But as far as syzbot has tested using linux-next tree

  https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc
  https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97

seems to remain even if we applied above patches.

Therefore, I think that we need to call drain_workqueue() from __loop_clr_fd()
without holding disk->open_mutex (if we address this NULL pointer dereference
problem by updating the loop driver).

"[PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()" was an attempt to call
drain_workqueue() from __loop_clr_fd() without holding disk->open_mutex, but Sashiko's
review ( https://sashiko.dev/#/patchset/fda8abc8-6aa2-463b-bf72-865f6b838034%40I-love.SAKURA.ne.jp )
mentioned that the "module_put(THIS_MODULE);" executed as the last step of __loop_clr_fd() has
a race window of concurrently triggering module unload operation because module refcount of
the loop driver can become 0 due to this module_put(THIS_MODULE) call. In other words,
we cannot safely manage refcount of the loop module without a support by the caller of
lo_release() (i.e. bdev_release()).

  void bdev_release(struct file *bdev_file)
  {
  (...snipped...)
  	if (bdev_is_partition(bdev))
  		blkdev_put_part(bdev);
  	else
  		blkdev_put_whole(bdev);
  	mutex_unlock(&disk->open_mutex); // <= Keeping holding disk->open_mutex until __loop_clr_fd() completes causes circular locking problem.

  	module_put(disk->fops->owner); // <= Calling after __loop_clr_fd() completed is required for managing module refcount safely.
  put_no_open:
  	blkdev_put_no_open(bdev);
  }

Therefore, I think that the only robust and safe approach is, although you won't be
happy to see layering violation / tricky code, either

  (a) allow __loop_clr_fd() to temporarily drop disk->open_mutex

or

  (b) add a new callback for the loop driver which is called between mutex_unlock(&disk->open_mutex) and module_put(disk->fops->owner)

. Jens, what do you think?

One might argue that this problem should be fixed on the filesystem side by
ensuring all filesystems wait for I/O requests safely. However, from the
perspective of defensive programming, the loop driver should be robust enough
to handle incomplete I/O serialization from underlying layers to prevent GPF.
Furthermore, without adding noisy debug printk() messages, it is extremely
difficult to pinpoint which specific layer or filesystem failed to wait for
the I/O requests.

[PATCH v4] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 1 day, 13 hours ago

syzbot is reporting NULL pointer dereference in lo_rw_aio() [1][2].
An analysis by the Gemini AI collaborator [3] considers that this problem
is caused by a timing shift primarily exposed by commit 65565ca5f99b
("block: unify the synchronous bi_end_io callbacks"), along with helper
refactorings like commit 92c3737a2473 ("block: add a bio_submit_or_kill
helper").

But due to difficulty of reproducing this race, discussion about what is
happening and how to fix this problem is stalling. Also, we haven't
identified how many filesystems are subjected to this problem.

Therefore, this patch introduces a grace period for flushing pending I/O
requests (which should be a good thing from the perspective of defensive
programming) so that we won't hit NULL pointer dereference problem, and
also emits BUG: message in order to help filesystem developers identify
the caller of an I/O request that failed to wait for completion so that
filesystem developers can fix such caller to wait for completion.

Note that emitting BUG: message is enabled only if CONFIG_KCOV=y, for
this check is a waste of computation resources for almost all users.

Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Link: https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 [2]
Link: https://lkml.kernel.org/r/fbb3edda-f108-4e5b-acf2-266f043f8125@I-love.SAKURA.ne.jp [3]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/block/loop.c | 82 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0000913f7efc..4ff254d8b623 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -85,8 +85,26 @@ struct loop_cmd {
 	struct bio_vec *bvec;
 	struct cgroup_subsys_state *blkcg_css;
 	struct cgroup_subsys_state *memcg_css;
+#ifdef CONFIG_KCOV
+	unsigned long stack_entries[30];
+	int stack_nr;
+	pid_t pid;
+	char comm[TASK_COMM_LEN];
+#endif
 };
 
+static void loop_check_io_race(struct loop_device *lo, struct loop_cmd *cmd)
+{
+#ifdef CONFIG_KCOV
+	if (unlikely(data_race(READ_ONCE(lo->lo_state)) == Lo_rundown)) {
+		pr_err("BUG: %s/%u is doing I/O request on loop%d in Lo_rundown state.\n",
+		       cmd->comm, cmd->pid, lo->lo_number);
+		printk("Call trace:\n");
+		stack_trace_print(cmd->stack_entries, cmd->stack_nr, 4);
+	}
+#endif
+}
+
 #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
 #define LOOP_DEFAULT_HW_Q_DEPTH 128
 
@@ -1747,8 +1765,59 @@ static void lo_release(struct gendisk *disk)
 	need_clear = (lo->lo_state == Lo_rundown);
 	mutex_unlock(&lo->lo_mutex);
 
-	if (need_clear)
+	if (need_clear) {
+		/*
+		 * Temporarily release disk->open_mutex in order to flush pending I/O
+		 * requests before clearing the backing device.
+		 *
+		 * This is a layering violation. But since bdev->bd_disk->fops->release()
+		 * (which is mapped to lo_release()) is the final function which
+		 * blkdev_put_whole() from bdev_release() calls immediately before
+		 * releasing disk->open_mutex, this changes nothing except opens a new
+		 * race window for allowing disk->fops->open() (which is mapped to
+		 * lo_open()) to be called.
+		 *
+		 * Even if lo_open() is called from blkdev_get_whole() due to this race,
+		 * the Lo_rundown state guarantees that lo_open() will fail with -ENXIO.
+		 * Thus, there will be effectively no change caused by this violation.
+		 */
+		mutex_unlock(&lo->lo_disk->open_mutex);
+		/*
+		 * Now that loop_queue_rq() sees lo->lo_state != Lo_bound,
+		 * wait for already started loop_queue_rq() to complete.
+		 */
+		synchronize_rcu();
+		/*
+		 * Now that no more works are scheduled by loop_queue_rq(),
+		 * wait for already scheduled works to complete.
+		 */
+		drain_workqueue(lo->workqueue);
+		/*
+		 * Now that no more AIO requests are scheduled by lo_rw_aio(),
+		 * wait for already started AIO to complete.
+		 *
+		 * Due to synchronize_rcu() + drain_workqueue() sequence above,
+		 * calling blk_mq_unfreeze_queue() immediately after blk_mq_freeze_queue()
+		 * returns has to be safe, for loop_queue_rq() no longer schedules new
+		 * lo_rw_aio() works and lo_rw_aio() no longer submits new AIO requests.
+		 *
+		 * Deferring blk_mq_unfreeze_queue() does not help because we are about
+		 * to clear the backing device and drop the refcount for the backing device.
+		 * There is nothing we can do if blk_mq_freeze_queue() fails to flush.
+		 */
+		blk_mq_unfreeze_queue(lo->lo_queue, blk_mq_freeze_queue(lo->lo_queue));
+		/*
+		 * Perform remaining cleanup, with disk->open_mutex held.
+		 *
+		 * The lo->lo_state should remain Lo_rundown despite we temporarily
+		 * released disk->open_mutex, for I am the only and the last user of
+		 * this loop device because lo_open() cannot succeed.
+		 */
+		mutex_lock(&lo->lo_disk->open_mutex);
+		if (WARN_ON(data_race(READ_ONCE(lo->lo_state)) != Lo_rundown))
+			return;
 		__loop_clr_fd(lo);
+	}
 }
 
 static void lo_free_disk(struct gendisk *disk)
@@ -1855,10 +1924,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 	struct loop_device *lo = rq->q->queuedata;
 
+#ifdef CONFIG_KCOV
+	cmd->stack_nr = stack_trace_save(cmd->stack_entries, ARRAY_SIZE(cmd->stack_entries), 0);
+	cmd->pid = current->pid;
+	get_task_comm(cmd->comm, current);
+#endif
+
 	blk_mq_start_request(rq);
 
-	if (data_race(READ_ONCE(lo->lo_state)) != Lo_bound)
+	if (data_race(READ_ONCE(lo->lo_state)) != Lo_bound) {
+		loop_check_io_race(lo, cmd);
 		return BLK_STS_IOERR;
+	}
 
 	switch (req_op(rq)) {
 	case REQ_OP_FLUSH:
@@ -1901,6 +1978,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
 	int ret = 0;
 	struct mem_cgroup *old_memcg = NULL;
 
+	loop_check_io_race(lo, cmd);
 	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
 		ret = -EIO;
 		goto failed;
-- 
2.47.3

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Tetsuo Handa 1 week, 5 days ago

On 2026/05/27 12:00, Ming Lei wrote:
> On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>> On 2026/05/27 10:20, Ming Lei wrote:
>>>> Of course we should try to figure out the root cause first, but how can we do?
>>>
>>> Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>>> which may cause data loss, so CC btrfs list and maintainer.
>>

I had a conversation with Google AI mode, and received the following response.

--------------------------------------------------------------------------------
Technical Analysis: lo_rw_aio Null Pointer Dereference / UAF since v7.1-rc1

1. The Root Cause of the Timing Shift

This regression was introduced during the v7.1-rc1 merge window, primarily exposed by
Commit 65565ca5f99b ("block: unify the synchronous bi_end_io callbacks"), along with
helper refactorings like Commit 92c3737a2473 ("block: add a bio_submit_or_kill helper").

Prior to v7.0, the synchronous I/O completion path inherently contained execution lags (due
to serialized completion handling and context switches) before notifying upper layers. This
latency accidentally acted as a natural safety barrier. It ensured that by the time a file
system completed its final sync_filesystem() and initiated umount, the loop driver's internal
workqueue (lo_rw_aio) had already finished processing everything.

In v7.1, the unification and optimization of bi_end_io significantly minimized this latency.
The filesystem now learns of "I/O completion" much faster. Consequently, highly-concurrent
execution pipelines like btrfs or jfs proceed rapidly through kill_sb() and blkdev_put(),
ultimately invoking lo_release() -> __loop_clr_fd() while the loop driver's backend kworker
is still in the middle of executing the last sub-millisecond asynchronous file-backed I/O
request.

2. Why the Block Layer's Built-in Quiesce/Freeze Fails

There is an implicit assumption that standard block layer freeze mechanisms (blk_mq_freeze_queue())
protect the device lifetime during release. However, the v7.1 BIO helper refactoring introduced
a synchronization gap:

  1. The filesystem triggers its final metadata or journal updates (e.g., txCommit in jfs or
     delayed refcount updates in btrfs) right during the unmount/close boundary.
  2. Due to the optimized execution path, these requests bypass the block layer's active
     request-tracking metrics at the exact moment blk_mq_freeze_queue() or state validation
     checks evaluated them as zero.
  3. The block layer assumes the queue is safe and silent, allowing __loop_clr_fd() to
     progress and nullify lo->lo_backing_file (or trigger fput()).
  4. The leaked asynchronous kworker wakes up a fraction of a millisecond too late, attempts
     to access lo->lo_backing_file or invokes kiocb_end_write() -> file_inode(), leading to
     either a general protection fault (Null pointer dereference) or a Use-After-Free (UAF).

3. Why This Isn't Just an "Unexpected FS Bug"

While the write I/O originates from file systems like btrfs and jfs post-close, blaming the
file systems entirely ignores the underlying infrastructure change. The core issue is that the
block layer altered its synchronization behavior, breaking the barrier contract that
VFS and file systems historically relied on during the device release path.

Papering over this inside individual file systems would require adding heavy, duplicated
barriers inside every single filesystem's unmount path.

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Damien Le Moal 1 week, 5 days ago

On 2026/05/27 20:29, Tetsuo Handa wrote:
> On 2026/05/27 12:00, Ming Lei wrote:
>> On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>>> On 2026/05/27 10:20, Ming Lei wrote:
>>>>> Of course we should try to figure out the root cause first, but how can we do?
>>>>
>>>> Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>>>> which may cause data loss, so CC btrfs list and maintainer.
>>>
> 
> I had a conversation with Google AI mode, and received the following response.
> 
> --------------------------------------------------------------------------------
> Technical Analysis: lo_rw_aio Null Pointer Dereference / UAF since v7.1-rc1
> 
> 
> 1. The Root Cause of the Timing Shift
> 
> This regression was introduced during the v7.1-rc1 merge window, primarily exposed by
> Commit 65565ca5f99b ("block: unify the synchronous bi_end_io callbacks"), along with
> helper refactorings like Commit 92c3737a2473 ("block: add a bio_submit_or_kill helper").
> 
> Prior to v7.0, the synchronous I/O completion path inherently contained execution lags (due
> to serialized completion handling and context switches) before notifying upper layers. This
> latency accidentally acted as a natural safety barrier. It ensured that by the time a file
> system completed its final sync_filesystem() and initiated umount, the loop driver's internal
> workqueue (lo_rw_aio) had already finished processing everything.
> 
> In v7.1, the unification and optimization of bi_end_io significantly minimized this latency.
> The filesystem now learns of "I/O completion" much faster. Consequently, highly-concurrent
> execution pipelines like btrfs or jfs proceed rapidly through kill_sb() and blkdev_put(),
> ultimately invoking lo_release() -> __loop_clr_fd() while the loop driver's backend kworker
> is still in the middle of executing the last sub-millisecond asynchronous file-backed I/O
> request.
> 
> 
> 2. Why the Block Layer's Built-in Quiesce/Freeze Fails
> 
> There is an implicit assumption that standard block layer freeze mechanisms (blk_mq_freeze_queue())
> protect the device lifetime during release. However, the v7.1 BIO helper refactoring introduced
> a synchronization gap:
> 
>   1. The filesystem triggers its final metadata or journal updates (e.g., txCommit in jfs or
>      delayed refcount updates in btrfs) right during the unmount/close boundary.
>   2. Due to the optimized execution path, these requests bypass the block layer's active
>      request-tracking metrics at the exact moment blk_mq_freeze_queue() or state validation
>      checks evaluated them as zero.
>   3. The block layer assumes the queue is safe and silent, allowing __loop_clr_fd() to
>      progress and nullify lo->lo_backing_file (or trigger fput()).
>   4. The leaked asynchronous kworker wakes up a fraction of a millisecond too late, attempts
>      to access lo->lo_backing_file or invokes kiocb_end_write() -> file_inode(), leading to
>      either a general protection fault (Null pointer dereference) or a Use-After-Free (UAF).
> 
> 
> 3. Why This Isn't Just an "Unexpected FS Bug"
> 
> While the write I/O originates from file systems like btrfs and jfs post-close, blaming the
> file systems entirely ignores the underlying infrastructure change. The core issue is that the
> block layer altered its synchronization behavior, breaking the barrier contract that
> VFS and file systems historically relied on during the device release path.
> 
> Papering over this inside individual file systems would require adding heavy, duplicated
> barriers inside every single filesystem's unmount path.

It sounds like the VFS unmount call needs to have something that waits for
sync() to complete. Though, it really feels very strange that an FS can complete
unmount without itself ensuring that there are no more IOs in flight. The
generic VFS layer cannot know what the FS needs to flush on unmount, so waiting
on a generic sync might not be enough.

It really feels like this is a btrfs and jfs issue, unless the same can be
reproduced with any file system (XFS, ext4, f2fs, ...).

Just my 2 cents.


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Christoph Hellwig 1 week, 4 days ago

On Thu, May 28, 2026 at 03:11:05AM +0900, Damien Le Moal wrote:
> It sounds like the VFS unmount call needs to have something that waits for
> sync() to complete. Though, it really feels very strange that an FS can complete

I don't think this is the VFS-controlled VFS file data writeback, which
we wait on, but some kind of fs controlled metadata.  And yes, it looks
like those file systems are buggy in that area.  We definitively had
such bugs in XFS before and fixed them.

e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
unmount")

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Qu Wenruo 1 week, 4 days ago

在 2026/5/28 18:08, Christoph Hellwig 写道:
> On Thu, May 28, 2026 at 03:11:05AM +0900, Damien Le Moal wrote:
>> It sounds like the VFS unmount call needs to have something that waits for
>> sync() to complete. Though, it really feels very strange that an FS can complete
> 
> I don't think this is the VFS-controlled VFS file data writeback, which
> we wait on, but some kind of fs controlled metadata.  And yes, it looks
> like those file systems are buggy in that area.  We definitively had
> such bugs in XFS before and fixed them.
> 
> e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
> unmount")
Considering the xfs fix is pretty old, it's before the fix hint thus no 
such mention in fstests.

Do you happen to know which test case is for that fix?
I'd like to adapt it for btrfs as a reproducer.

This syzbot report doesn't provide a reproducer.

Another thing is, if it's some btrfs bios on-the-fly after 
close_ctree(), the most common symptom should be NULL pointer 
dereference inside various btrfs endio functions.
As all those end_bbio_*() functions are referring to either fs_info or 
inode/eb, thus if the fs is unmounted before the bio finished, they 
should all cause use-after-free.

The only exception is discard, which is using blkdev_issue_discard() 
thus has no such reference to btrfs internal structure, but that's out 
of my understanding.

Thanks,
Qu

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Ming Lei 1 week ago

On Thu, May 28, 2026 at 5:16 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2026/5/28 18:08, Christoph Hellwig 写道:
> > On Thu, May 28, 2026 at 03:11:05AM +0900, Damien Le Moal wrote:
> >> It sounds like the VFS unmount call needs to have something that waits for
> >> sync() to complete. Though, it really feels very strange that an FS can complete
> >
> > I don't think this is the VFS-controlled VFS file data writeback, which
> > we wait on, but some kind of fs controlled metadata.  And yes, it looks
> > like those file systems are buggy in that area.  We definitively had
> > such bugs in XFS before and fixed them.
> >
> > e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
> > unmount")
> Considering the xfs fix is pretty old, it's before the fix hint thus no
> such mention in fstests.
>
> Do you happen to know which test case is for that fix?
> I'd like to adapt it for btrfs as a reproducer.
>
> This syzbot report doesn't provide a reproducer.
>
>
> Another thing is, if it's some btrfs bios on-the-fly after
> close_ctree(), the most common symptom should be NULL pointer
> dereference inside various btrfs endio functions.
> As all those end_bbio_*() functions are referring to either fs_info or
> inode/eb, thus if the fs is unmounted before the bio finished, they
> should all cause use-after-free.
>
> The only exception is discard, which is using blkdev_issue_discard()
> thus has no such reference to btrfs internal structure, but that's out
> of my understanding.

syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.

https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28

Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
this bio submission context if this req isn't issued via wq.

Thanks,
Ming Lei

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 1 week ago

On Mon, 1 Jun 2026 10:29:25 -0500 Ming Lei wrote:
>On Thu, May 28, 2026 at 5:16 AM Qu Wenruo <wqu@suse.com> wrote:
>> 在 2026/5/28 18:08, Christoph Hellwig 写道:
>> > On Thu, May 28, 2026 at 03:11:05AM +0900, Damien Le Moal wrote:
>> >> It sounds like the VFS unmount call needs to have something that waits for
>> >> sync() to complete. Though, it really feels very strange that an FS can complete
>> >
>> > I don't think this is the VFS-controlled VFS file data writeback, which
>> > we wait on, but some kind of fs controlled metadata.  And yes, it looks
>> > like those file systems are buggy in that area.  We definitively had
>> > such bugs in XFS before and fixed them.
>> >
>> > e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
>> > unmount")
>> Considering the xfs fix is pretty old, it's before the fix hint thus no
>> such mention in fstests.
>>
>> Do you happen to know which test case is for that fix?
>> I'd like to adapt it for btrfs as a reproducer.
>>
>> This syzbot report doesn't provide a reproducer.
>>
>>
>> Another thing is, if it's some btrfs bios on-the-fly after
>> close_ctree(), the most common symptom should be NULL pointer
>> dereference inside various btrfs endio functions.
>> As all those end_bbio_*() functions are referring to either fs_info or
>> inode/eb, thus if the fs is unmounted before the bio finished, they
>> should all cause use-after-free.
>>
>> The only exception is discard, which is using blkdev_issue_discard()
>> thus has no such reference to btrfs internal structure, but that's out
>> of my understanding.
>
> syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.
>
> https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28
>
> Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
> this bio submission context if this req isn't issued via wq.
>
I suspect this makes $.02 sense given the check of Lo_bound upon queuing rq.

static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
		const struct blk_mq_queue_data *bd)
{
	struct request *rq = bd->rq;
	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
	struct loop_device *lo = rq->q->queuedata;

	blk_mq_start_request(rq);

	if (data_race(READ_ONCE(lo->lo_state)) != Lo_bound)
		return BLK_STS_IOERR;

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Ming Lei 1 week ago

On Tue, Jun 02, 2026 at 05:51:26AM +0800, Hillf Danton wrote:
> On Mon, 1 Jun 2026 10:29:25 -0500 Ming Lei wrote:
> >On Thu, May 28, 2026 at 5:16 AM Qu Wenruo <wqu@suse.com> wrote:
> >> 在 2026/5/28 18:08, Christoph Hellwig 写道:
> >> > On Thu, May 28, 2026 at 03:11:05AM +0900, Damien Le Moal wrote:
> >> >> It sounds like the VFS unmount call needs to have something that waits for
> >> >> sync() to complete. Though, it really feels very strange that an FS can complete
> >> >
> >> > I don't think this is the VFS-controlled VFS file data writeback, which
> >> > we wait on, but some kind of fs controlled metadata.  And yes, it looks
> >> > like those file systems are buggy in that area.  We definitively had
> >> > such bugs in XFS before and fixed them.
> >> >
> >> > e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
> >> > unmount")
> >> Considering the xfs fix is pretty old, it's before the fix hint thus no
> >> such mention in fstests.
> >>
> >> Do you happen to know which test case is for that fix?
> >> I'd like to adapt it for btrfs as a reproducer.
> >>
> >> This syzbot report doesn't provide a reproducer.
> >>
> >>
> >> Another thing is, if it's some btrfs bios on-the-fly after
> >> close_ctree(), the most common symptom should be NULL pointer
> >> dereference inside various btrfs endio functions.
> >> As all those end_bbio_*() functions are referring to either fs_info or
> >> inode/eb, thus if the fs is unmounted before the bio finished, they
> >> should all cause use-after-free.
> >>
> >> The only exception is discard, which is using blkdev_issue_discard()
> >> thus has no such reference to btrfs internal structure, but that's out
> >> of my understanding.
> >
> > syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.
> >
> > https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28
> >
> > Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
> > this bio submission context if this req isn't issued via wq.
> >
> I suspect this makes $.02 sense given the check of Lo_bound upon queuing rq.

Can't lo->lo_state be updated after the check? It is totally lockless...


Thanks,
Ming

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 1 week ago

On Mon, 1 Jun 2026 17:14:59 -0500 Ming Lei wrote:
> On Tue, Jun 02, 2026 at 05:51:26AM +0800, Hillf Danton wrote:
> > On OnMon, 1 Jun 2026 10:29:25 -0500 Ming Lei wrote:
> > > syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.
> > >
> > > https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28
> > >
> > > Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
> > > this bio submission context if this req isn't issued via wq.
> > >
> > I suspect this makes $.02 sense given the check of Lo_bound upon queuing rq.
> 
> Can't lo->lo_state be updated after the check? It is totally lockless...
>
Sounds good hm... do you mean it is UNWISE to not flush the loop workqueue
when closing disk?

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Ming Lei 1 week ago

On Tue, Jun 02, 2026 at 07:17:30AM +0800, Hillf Danton wrote:
> On Mon, 1 Jun 2026 17:14:59 -0500 Ming Lei wrote:
> > On Tue, Jun 02, 2026 at 05:51:26AM +0800, Hillf Danton wrote:
> > > On OnMon, 1 Jun 2026 10:29:25 -0500 Ming Lei wrote:
> > > > syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.
> > > >
> > > > https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28
> > > >
> > > > Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
> > > > this bio submission context if this req isn't issued via wq.
> > > >
> > > I suspect this makes $.02 sense given the check of Lo_bound upon queuing rq.
> > 
> > Can't lo->lo_state be updated after the check? It is totally lockless...
> >
> Sounds good hm... do you mean it is UNWISE to not flush the loop workqueue
> when closing disk?

Quite the opposite, it is wise to not flush wq in __loop_clr_fd(), please
see my previous comment.


Thanks,
Ming

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Hillf Danton 6 days, 21 hours ago

on Mon, 1 Jun 2026 18:36:19 -0500 Ming Lei wrote:
> On Tue, Jun 02, 2026 at 07:17:30AM +0800, Hillf Danton wrote:
> > On Mon, 1 Jun 2026 17:14:59 -0500 Ming Lei wrote:
> > > On Tue, Jun 02, 2026 at 05:51:26AM +0800, Hillf Danton wrote:
> > > > On OnMon, 1 Jun 2026 10:29:25 -0500 Ming Lei wrote:
> > > > > syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.
> > > > >
> > > > > https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28
> > > > >
> > > > > Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
> > > > > this bio submission context if this req isn't issued via wq.
> > > > >
> > > > I suspect this makes $.02 sense given the check of Lo_bound upon queuing rq.
> > > 
> > > Can't lo->lo_state be updated after the check? It is totally lockless...
> > >
> > Sounds good hm... do you mean it is UNWISE to not flush the loop workqueue
> > when closing disk?
> 
> Quite the opposite, it is wise to not flush wq in __loop_clr_fd(), please
> see my previous comment.
>
When queuing rq, if lo_state is updated after checking Lo_bond, I see nothing
that prevents syzbot from reporting null-ptr-deref exists. Can you tippoint
why flush is NOT needed if you are right?

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Christoph Hellwig 1 week ago

On Thu, May 28, 2026 at 07:46:24PM +0930, Qu Wenruo wrote:
>> e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
>> unmount")
> Considering the xfs fix is pretty old, it's before the fix hint thus no 
> such mention in fstests.
>
> Do you happen to know which test case is for that fix?
> I'd like to adapt it for btrfs as a reproducer.

No.  Adding Brian who authored that commit.

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Brian Foster 1 week ago

On Mon, Jun 01, 2026 at 04:40:34PM +0200, Christoph Hellwig wrote:
> On Thu, May 28, 2026 at 07:46:24PM +0930, Qu Wenruo wrote:
> >> e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
> >> unmount")
> > Considering the xfs fix is pretty old, it's before the fix hint thus no 
> > such mention in fstests.
> >
> > Do you happen to know which test case is for that fix?
> > I'd like to adapt it for btrfs as a reproducer.
> 
> No.  Adding Brian who authored that commit.
> 

I haven't followed through the full thread here... But if you're just
looking for an existing test case associated with the commit above on
XFS, I did some quick digging and xfs/311 is the original reproducer for
that one.

Brian

Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

Posted by Qu Wenruo 1 week ago


在 2026/6/2 01:59, Brian Foster 写道:
> On Mon, Jun 01, 2026 at 04:40:34PM +0200, Christoph Hellwig wrote:
>> On Thu, May 28, 2026 at 07:46:24PM +0930, Qu Wenruo wrote:
>>>> e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
>>>> unmount")
>>> Considering the xfs fix is pretty old, it's before the fix hint thus no
>>> such mention in fstests.
>>>
>>> Do you happen to know which test case is for that fix?
>>> I'd like to adapt it for btrfs as a reproducer.
>>
>> No.  Adding Brian who authored that commit.
>>
> 
> I haven't followed through the full thread here... But if you're just
> looking for an existing test case associated with the commit above on
> XFS, I did some quick digging and xfs/311 is the original reproducer for
> that one.

Thanks a lot! I'll use the same delayed umount to verify the behavior of 
btrfs.

Thanks,
Qu

> 
> Brian
>