[v5] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

[PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Caleb Sander Mateos 1 month, 3 weeks ago

io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
workloads. Even when only one thread pinned to a single CPU is accessing
the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
are very hot instructions. The mutex's primary purpose is to prevent
concurrent io_uring system calls on the same io_ring_ctx. However, there
is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
task will make io_uring_enter() and io_uring_register() system calls on
the io_ring_ctx once it's enabled.
So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
other tasks acquiring the ctx uring lock, use a task work item to
suspend the submitter_task for the critical section.
If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
may be set concurrently, so acquire the uring_lock before checking it.
If submitter_task isn't set yet, the uring_lock suffices to provide
mutual exclusion.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Tested-by: syzbot@syzkaller.appspotmail.com
---
 io_uring/io_uring.c |  12 +++++
 io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 123 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ac71350285d7..9a9dfcb0476e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -363,10 +363,22 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_destroy(&ctx->io_bl_xa);
 	kfree(ctx);
 	return NULL;
 }
 
+void io_ring_suspend_work(struct callback_head *cb_head)
+{
+	struct io_ring_suspend_work *suspend_work =
+		container_of(cb_head, struct io_ring_suspend_work, cb_head);
+	DECLARE_COMPLETION_ONSTACK(suspend_end);
+
+	*suspend_work->suspend_end = &suspend_end;
+	complete(&suspend_work->suspend_start);
+
+	wait_for_completion(&suspend_end);
+}
+
 static void io_clean_op(struct io_kiocb *req)
 {
 	if (unlikely(req->flags & REQ_F_BUFFER_SELECTED))
 		io_kbuf_drop_legacy(req);
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 57c3eef26a88..2b08d0ddab30 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -1,8 +1,9 @@
 #ifndef IOU_CORE_H
 #define IOU_CORE_H
 
+#include <linux/completion.h>
 #include <linux/errno.h>
 #include <linux/lockdep.h>
 #include <linux/resume_user_mode.h>
 #include <linux/kasan.h>
 #include <linux/poll.h>
@@ -195,19 +196,85 @@ void io_queue_next(struct io_kiocb *req);
 void io_task_refs_refill(struct io_uring_task *tctx);
 bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 
 void io_activate_pollwq(struct io_ring_ctx *ctx);
 
+/*
+ * The ctx uring lock protects most of the mutable struct io_ring_ctx state
+ * accessed in the struct io_kiocb issue path. In the I/O path, it is typically
+ * acquired in the io_uring_enter() syscall and in io_handle_tw_list(). For
+ * IORING_SETUP_SQPOLL, it's acquired by io_sq_thread() instead. io_kiocb's
+ * issued with IO_URING_F_UNLOCKED in issue_flags (e.g. by io_wq_submit_work())
+ * acquire and release the ctx uring lock whenever they must touch io_ring_ctx
+ * state. io_uring_register() also acquires the ctx uring lock because most
+ * opcodes mutate io_ring_ctx state accessed in the issue path.
+ *
+ * For !IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, acquiring the ctx uring lock
+ * is done via mutex_(try)lock(&ctx->uring_lock).
+ *
+ * However, for IORING_SETUP_SINGLE_ISSUER, we can avoid the mutex_lock() +
+ * mutex_unlock() overhead on submitter_task because a single thread can't race
+ * with itself. In the uncommon case where the ctx uring lock is needed on
+ * another thread, it must suspend submitter_task by scheduling a task work item
+ * on it. io_ring_ctx_lock() returns once the task work item has started.
+ * io_ring_ctx_unlock() allows the task work item to complete.
+ * If io_ring_ctx_lock() is called while the ctx is IORING_SETUP_R_DISABLED
+ * (e.g. during ctx create or exit), io_ring_ctx_lock() must acquire uring_lock
+ * because submitter_task isn't set yet. submitter_task can be accessed once
+ * uring_lock is held. If submitter_task exists, we do the same thing as in the
+ * non-IORING_SETUP_R_DISABLED case (except with uring_lock also held). If
+ * submitter_task isn't set, all other io_ring_ctx_lock() callers will also
+ * acquire uring_lock, so it suffices for mutual exclusion.
+ */
+
+struct io_ring_suspend_work {
+	struct callback_head cb_head;
+	struct completion suspend_start;
+	struct completion **suspend_end;
+};
+
+void io_ring_suspend_work(struct callback_head *cb_head);
+
 struct io_ring_ctx_lock_state {
+	bool need_mutex;
+	struct completion *suspend_end;
 };
 
 /* Acquire the ctx uring lock with the given nesting level */
 static inline void io_ring_ctx_lock_nested(struct io_ring_ctx *ctx,
 					   unsigned int subclass,
 					   struct io_ring_ctx_lock_state *state)
 {
-	mutex_lock_nested(&ctx->uring_lock, subclass);
+	struct io_ring_suspend_work suspend_work;
+
+	if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) {
+		mutex_lock_nested(&ctx->uring_lock, subclass);
+		return;
+	}
+
+	state->suspend_end = NULL;
+	state->need_mutex =
+		!!(smp_load_acquire(&ctx->flags) & IORING_SETUP_R_DISABLED);
+	if (unlikely(state->need_mutex)) {
+		mutex_lock_nested(&ctx->uring_lock, subclass);
+		if (likely(!ctx->submitter_task))
+			return;
+	}
+
+	if (likely(current == ctx->submitter_task))
+		return;
+
+	/* Use task work to suspend submitter_task */
+	init_task_work(&suspend_work.cb_head, io_ring_suspend_work);
+	init_completion(&suspend_work.suspend_start);
+	suspend_work.suspend_end = &state->suspend_end;
+	/* If task_work_add() fails, task is exiting, so no need to suspend */
+	if (unlikely(task_work_add(ctx->submitter_task, &suspend_work.cb_head,
+				   TWA_SIGNAL)))
+		return;
+
+	wait_for_completion(&suspend_work.suspend_start);
 }
 
 /* Acquire the ctx uring lock */
 static inline void io_ring_ctx_lock(struct io_ring_ctx *ctx,
 				    struct io_ring_ctx_lock_state *state)
@@ -217,29 +284,70 @@ static inline void io_ring_ctx_lock(struct io_ring_ctx *ctx,
 
 /* Attempt to acquire the ctx uring lock without blocking */
 static inline bool io_ring_ctx_trylock(struct io_ring_ctx *ctx,
 				       struct io_ring_ctx_lock_state *state)
 {
-	return mutex_trylock(&ctx->uring_lock);
+	if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER))
+		return mutex_trylock(&ctx->uring_lock);
+
+	state->suspend_end = NULL;
+	state->need_mutex =
+		!!(smp_load_acquire(&ctx->flags) & IORING_SETUP_R_DISABLED);
+	if (unlikely(state->need_mutex)) {
+		if (!mutex_trylock(&ctx->uring_lock))
+			return false;
+		if (likely(!ctx->submitter_task))
+			return true;
+	}
+
+	if (unlikely(current != ctx->submitter_task))
+		goto unlock;
+
+	return true;
+
+unlock:
+	if (unlikely(state->need_mutex))
+		mutex_unlock(&ctx->uring_lock);
+	return false;
 }
 
 /* Release the ctx uring lock */
 static inline void io_ring_ctx_unlock(struct io_ring_ctx *ctx,
 				      struct io_ring_ctx_lock_state *state)
 {
-	mutex_unlock(&ctx->uring_lock);
+	if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) {
+		mutex_unlock(&ctx->uring_lock);
+		return;
+	}
+
+	if (unlikely(state->need_mutex))
+		mutex_unlock(&ctx->uring_lock);
+	if (unlikely(state->suspend_end))
+		complete(state->suspend_end);
 }
 
 /* Return (if CONFIG_LOCKDEP) whether the ctx uring lock is held */
 static inline bool io_ring_ctx_lock_held(const struct io_ring_ctx *ctx)
 {
+	/*
+	 * No straightforward way to check that submitter_task is suspended
+	 * without access to struct io_ring_ctx_lock_state
+	 */
+	if (ctx->flags & IORING_SETUP_SINGLE_ISSUER &&
+	    !(ctx->flags & IORING_SETUP_R_DISABLED))
+		return true;
+
 	return lockdep_is_held(&ctx->uring_lock);
 }
 
 /* Assert (if CONFIG_LOCKDEP) that the ctx uring lock is held */
 static inline void io_ring_ctx_assert_locked(const struct io_ring_ctx *ctx)
 {
+	if (ctx->flags & IORING_SETUP_SINGLE_ISSUER &&
+	    !(ctx->flags & IORING_SETUP_R_DISABLED))
+		return;
+
 	lockdep_assert_held(&ctx->uring_lock);
 }
 
 static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
 {
-- 
2.45.2

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Joanne Koong 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> workloads. Even when only one thread pinned to a single CPU is accessing
> the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> are very hot instructions. The mutex's primary purpose is to prevent
> concurrent io_uring system calls on the same io_ring_ctx. However, there
> is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> task will make io_uring_enter() and io_uring_register() system calls on
> the io_ring_ctx once it's enabled.
> So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> other tasks acquiring the ctx uring lock, use a task work item to
> suspend the submitter_task for the critical section.

Does this open the pathway to various data corruption issues since the
submitter task can be suspended while it's in the middle of executing
a section of logic that was previously protected by the mutex? With
this patch (if I'm understandng it correctly), there's now no
guarantee that the logic inside the mutexed section for
IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
if it gets suspended between two state changes that need to be atomic
/ bundled together, then I think the task that does the suspend would
now see corrupt state.

I did a quick grep and I think one example of this race shows up in
io_uring/rsrc.c for buffer cloning where if the src_ctx has
IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
time the submitter task is unregistering the buffers, then this chain
of events happens:
* submitter task is executing the logic in io_sqe_buffers_unregister()
-> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
updated
* submitter task gets suspended through io_register_clone_buffers() ->
lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
* after suspending the src ctx, -> io_clone_buffers() runs, which will
get the incorrect "nbufs = src_ctx->buf_table.nr;" value
* io_clone_buffers() calls io_rsrc_node_lookup() which will
dereference a NULL pointer

Thanks,
Joanne

> If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> may be set concurrently, so acquire the uring_lock before checking it.
> If submitter_task isn't set yet, the uring_lock suffices to provide
> mutual exclusion.
>
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> Tested-by: syzbot@syzkaller.appspotmail.com
> ---
>  io_uring/io_uring.c |  12 +++++
>  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 123 insertions(+), 3 deletions(-)
>

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Caleb Sander Mateos 1 month, 3 weeks ago

On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > workloads. Even when only one thread pinned to a single CPU is accessing
> > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > are very hot instructions. The mutex's primary purpose is to prevent
> > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > task will make io_uring_enter() and io_uring_register() system calls on
> > the io_ring_ctx once it's enabled.
> > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > other tasks acquiring the ctx uring lock, use a task work item to
> > suspend the submitter_task for the critical section.
>
> Does this open the pathway to various data corruption issues since the
> submitter task can be suspended while it's in the middle of executing
> a section of logic that was previously protected by the mutex? With

I don't think so. The submitter task is suspended by having it run a
task work item that blocks it until the uring lock is released by the
other task. Any section where the uring lock is held should either be
on kernel threads, contained within an io_uring syscall, or contained
within a task work item, none of which run other task work items. So
whenever the submitter task runs the suspend task work, it shouldn't
be in a uring-lock-protected section.

> this patch (if I'm understandng it correctly), there's now no
> guarantee that the logic inside the mutexed section for
> IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> if it gets suspended between two state changes that need to be atomic
> / bundled together, then I think the task that does the suspend would
> now see corrupt state.

Yes, I suppose there's nothing that prevents code from holding the
uring lock across syscalls or task work items, but that would already
be problematic. If a task holds the uring lock on return from a
syscall or task work and then runs another task work item that tries
to acquire the uring lock, it would deadlock.

>
> I did a quick grep and I think one example of this race shows up in
> io_uring/rsrc.c for buffer cloning where if the src_ctx has
> IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> time the submitter task is unregistering the buffers, then this chain
> of events happens:
> * submitter task is executing the logic in io_sqe_buffers_unregister()
> -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> updated
> * submitter task gets suspended through io_register_clone_buffers() ->
> lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)

I think what this is missing is that the submitter task can't get
suspended at arbitrary points. It gets suspended in task work, and
task work only runs when returning from the kernel to userspace. At
which point "nothing" should be running on the task in userspace or
the kernel and it should be safe to run arbitrary task work items on
the task. Though Ming recently found an interesting deadlock caused by
acquiring a mutex in task work that runs on an unlucky ublk server
thread[1].

[1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/

Best,
Caleb

> * after suspending the src ctx, -> io_clone_buffers() runs, which will
> get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> * io_clone_buffers() calls io_rsrc_node_lookup() which will
> dereference a NULL pointer
>
> Thanks,
> Joanne
>
> > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > may be set concurrently, so acquire the uring_lock before checking it.
> > If submitter_task isn't set yet, the uring_lock suffices to provide
> > mutual exclusion.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > Tested-by: syzbot@syzkaller.appspotmail.com
> > ---
> >  io_uring/io_uring.c |  12 +++++
> >  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 123 insertions(+), 3 deletions(-)
> >

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Joanne Koong 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > > workloads. Even when only one thread pinned to a single CPU is accessing
> > > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > > are very hot instructions. The mutex's primary purpose is to prevent
> > > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > > task will make io_uring_enter() and io_uring_register() system calls on
> > > the io_ring_ctx once it's enabled.
> > > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > > other tasks acquiring the ctx uring lock, use a task work item to
> > > suspend the submitter_task for the critical section.
> >
> > Does this open the pathway to various data corruption issues since the
> > submitter task can be suspended while it's in the middle of executing
> > a section of logic that was previously protected by the mutex? With
>
> I don't think so. The submitter task is suspended by having it run a
> task work item that blocks it until the uring lock is released by the
> other task. Any section where the uring lock is held should either be
> on kernel threads, contained within an io_uring syscall, or contained
> within a task work item, none of which run other task work items. So
> whenever the submitter task runs the suspend task work, it shouldn't
> be in a uring-lock-protected section.
>
> > this patch (if I'm understandng it correctly), there's now no
> > guarantee that the logic inside the mutexed section for
> > IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> > if it gets suspended between two state changes that need to be atomic
> > / bundled together, then I think the task that does the suspend would
> > now see corrupt state.
>
> Yes, I suppose there's nothing that prevents code from holding the
> uring lock across syscalls or task work items, but that would already
> be problematic. If a task holds the uring lock on return from a
> syscall or task work and then runs another task work item that tries
> to acquire the uring lock, it would deadlock.
>
> >
> > I did a quick grep and I think one example of this race shows up in
> > io_uring/rsrc.c for buffer cloning where if the src_ctx has
> > IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> > time the submitter task is unregistering the buffers, then this chain
> > of events happens:
> > * submitter task is executing the logic in io_sqe_buffers_unregister()
> > -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> > updated
> > * submitter task gets suspended through io_register_clone_buffers() ->
> > lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
>
> I think what this is missing is that the submitter task can't get
> suspended at arbitrary points. It gets suspended in task work, and
> task work only runs when returning from the kernel to userspace. At

Ahh I see, thanks for the explanation. The documentation for
TWA_SIGNAL in task_work_add() says "@TWA_SIGNAL works like signals, in
that the it will interrupt the targeted task and run the task_work,
regardless of whether the task is currently running in the kernel or
userspace" so i had assumed this preempts the kernel.

Thanks,
Joanne

> which point "nothing" should be running on the task in userspace or
> the kernel and it should be safe to run arbitrary task work items on
> the task. Though Ming recently found an interesting deadlock caused by
> acquiring a mutex in task work that runs on an unlucky ublk server
> thread[1].
>
> [1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/
>
> Best,
> Caleb
>
> > * after suspending the src ctx, -> io_clone_buffers() runs, which will
> > get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> > * io_clone_buffers() calls io_rsrc_node_lookup() which will
> > dereference a NULL pointer
> >
> > Thanks,
> > Joanne
> >
> > > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > > may be set concurrently, so acquire the uring_lock before checking it.
> > > If submitter_task isn't set yet, the uring_lock suffices to provide
> > > mutual exclusion.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > Tested-by: syzbot@syzkaller.appspotmail.com
> > > ---
> > >  io_uring/io_uring.c |  12 +++++
> > >  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> > >  2 files changed, 123 insertions(+), 3 deletions(-)
> > >

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Caleb Sander Mateos 1 month, 3 weeks ago

On Mon, Dec 15, 2025 at 11:47 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > > > workloads. Even when only one thread pinned to a single CPU is accessing
> > > > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > > > are very hot instructions. The mutex's primary purpose is to prevent
> > > > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > > > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > > > task will make io_uring_enter() and io_uring_register() system calls on
> > > > the io_ring_ctx once it's enabled.
> > > > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > > > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > > > other tasks acquiring the ctx uring lock, use a task work item to
> > > > suspend the submitter_task for the critical section.
> > >
> > > Does this open the pathway to various data corruption issues since the
> > > submitter task can be suspended while it's in the middle of executing
> > > a section of logic that was previously protected by the mutex? With
> >
> > I don't think so. The submitter task is suspended by having it run a
> > task work item that blocks it until the uring lock is released by the
> > other task. Any section where the uring lock is held should either be
> > on kernel threads, contained within an io_uring syscall, or contained
> > within a task work item, none of which run other task work items. So
> > whenever the submitter task runs the suspend task work, it shouldn't
> > be in a uring-lock-protected section.
> >
> > > this patch (if I'm understandng it correctly), there's now no
> > > guarantee that the logic inside the mutexed section for
> > > IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> > > if it gets suspended between two state changes that need to be atomic
> > > / bundled together, then I think the task that does the suspend would
> > > now see corrupt state.
> >
> > Yes, I suppose there's nothing that prevents code from holding the
> > uring lock across syscalls or task work items, but that would already
> > be problematic. If a task holds the uring lock on return from a
> > syscall or task work and then runs another task work item that tries
> > to acquire the uring lock, it would deadlock.
> >
> > >
> > > I did a quick grep and I think one example of this race shows up in
> > > io_uring/rsrc.c for buffer cloning where if the src_ctx has
> > > IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> > > time the submitter task is unregistering the buffers, then this chain
> > > of events happens:
> > > * submitter task is executing the logic in io_sqe_buffers_unregister()
> > > -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> > > updated
> > > * submitter task gets suspended through io_register_clone_buffers() ->
> > > lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
> >
> > I think what this is missing is that the submitter task can't get
> > suspended at arbitrary points. It gets suspended in task work, and
> > task work only runs when returning from the kernel to userspace. At
>
> Ahh I see, thanks for the explanation. The documentation for
> TWA_SIGNAL in task_work_add() says "@TWA_SIGNAL works like signals, in
> that the it will interrupt the targeted task and run the task_work,
> regardless of whether the task is currently running in the kernel or
> userspace" so i had assumed this preempts the kernel.

Yeah, that documentation seems a bit misleading. Task work doesn't run
in interrupt context, otherwise it wouldn't be safe to take mutexes
like the uring lock. I think the comment is trying to say that
TWA_SIGNAL immediately kicks the task into the kernel, interrupting
any *userspace work*. But if the task is already in the kernel, it
won't run task work until returning to userspace. Though I could also
be misunderstanding how task work works.

Best,
Caleb

>
> Thanks,
> Joanne
>
> > which point "nothing" should be running on the task in userspace or
> > the kernel and it should be safe to run arbitrary task work items on
> > the task. Though Ming recently found an interesting deadlock caused by
> > acquiring a mutex in task work that runs on an unlucky ublk server
> > thread[1].
> >
> > [1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/
> >
> > Best,
> > Caleb
> >
> > > * after suspending the src ctx, -> io_clone_buffers() runs, which will
> > > get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> > > * io_clone_buffers() calls io_rsrc_node_lookup() which will
> > > dereference a NULL pointer
> > >
> > > Thanks,
> > > Joanne
> > >
> > > > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > > > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > > > may be set concurrently, so acquire the uring_lock before checking it.
> > > > If submitter_task isn't set yet, the uring_lock suffices to provide
> > > > mutual exclusion.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > Tested-by: syzbot@syzkaller.appspotmail.com
> > > > ---
> > > >  io_uring/io_uring.c |  12 +++++
> > > >  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> > > >  2 files changed, 123 insertions(+), 3 deletions(-)
> > > >

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Joanne Koong 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 3:47 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > > > workloads. Even when only one thread pinned to a single CPU is accessing
> > > > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > > > are very hot instructions. The mutex's primary purpose is to prevent
> > > > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > > > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > > > task will make io_uring_enter() and io_uring_register() system calls on
> > > > the io_ring_ctx once it's enabled.
> > > > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > > > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > > > other tasks acquiring the ctx uring lock, use a task work item to
> > > > suspend the submitter_task for the critical section.
> > >
> > > Does this open the pathway to various data corruption issues since the
> > > submitter task can be suspended while it's in the middle of executing
> > > a section of logic that was previously protected by the mutex? With
> >
> > I don't think so. The submitter task is suspended by having it run a
> > task work item that blocks it until the uring lock is released by the
> > other task. Any section where the uring lock is held should either be
> > on kernel threads, contained within an io_uring syscall, or contained
> > within a task work item, none of which run other task work items. So
> > whenever the submitter task runs the suspend task work, it shouldn't
> > be in a uring-lock-protected section.
> >
> > > this patch (if I'm understandng it correctly), there's now no
> > > guarantee that the logic inside the mutexed section for
> > > IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> > > if it gets suspended between two state changes that need to be atomic
> > > / bundled together, then I think the task that does the suspend would
> > > now see corrupt state.
> >
> > Yes, I suppose there's nothing that prevents code from holding the
> > uring lock across syscalls or task work items, but that would already
> > be problematic. If a task holds the uring lock on return from a
> > syscall or task work and then runs another task work item that tries
> > to acquire the uring lock, it would deadlock.
> >
> > >
> > > I did a quick grep and I think one example of this race shows up in
> > > io_uring/rsrc.c for buffer cloning where if the src_ctx has
> > > IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> > > time the submitter task is unregistering the buffers, then this chain
> > > of events happens:
> > > * submitter task is executing the logic in io_sqe_buffers_unregister()
> > > -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> > > updated
> > > * submitter task gets suspended through io_register_clone_buffers() ->
> > > lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
> >
> > I think what this is missing is that the submitter task can't get
> > suspended at arbitrary points. It gets suspended in task work, and
> > task work only runs when returning from the kernel to userspace. At
>
> Ahh I see, thanks for the explanation. The documentation for
> TWA_SIGNAL in task_work_add() says "@TWA_SIGNAL works like signals, in
> that the it will interrupt the targeted task and run the task_work,
> regardless of whether the task is currently running in the kernel or
> userspace" so i had assumed this preempts the kernel.
>

Hmm, thinking about this buffer cloning + IORING_SINGLE_ISSUER
submitter task's buffer unregistration stuff some more though...
doesn't this same race with the corrupted values exist if the cloning
logic acquires the mutex before the submitter task formally runs and
then the submitter task starts executing immediately right after with
the buffer unregistration logic while the cloning logic is
simultaneously executing the logic inside the mutex section? In the
io_ring_ctx_lock_nested() logic, I'm not seeing where this checks
whether the lock is currently acquired by other tasks or am I missing
something here and this is already accounted for?

Thanks,
Joanne

> Thanks,
> Joanne
>
> > which point "nothing" should be running on the task in userspace or
> > the kernel and it should be safe to run arbitrary task work items on
> > the task. Though Ming recently found an interesting deadlock caused by
> > acquiring a mutex in task work that runs on an unlucky ublk server
> > thread[1].
> >
> > [1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/
> >
> > Best,
> > Caleb
> >
> > > * after suspending the src ctx, -> io_clone_buffers() runs, which will
> > > get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> > > * io_clone_buffers() calls io_rsrc_node_lookup() which will
> > > dereference a NULL pointer
> > >
> > > Thanks,
> > > Joanne
> > >
> > > > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > > > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > > > may be set concurrently, so acquire the uring_lock before checking it.
> > > > If submitter_task isn't set yet, the uring_lock suffices to provide
> > > > mutual exclusion.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > Tested-by: syzbot@syzkaller.appspotmail.com
> > > > ---
> > > >  io_uring/io_uring.c |  12 +++++
> > > >  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> > > >  2 files changed, 123 insertions(+), 3 deletions(-)
> > > >

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Caleb Sander Mateos 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 12:14 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Dec 16, 2025 at 3:47 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> > > > <csander@purestorage.com> wrote:
> > > > >
> > > > > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > > > > workloads. Even when only one thread pinned to a single CPU is accessing
> > > > > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > > > > are very hot instructions. The mutex's primary purpose is to prevent
> > > > > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > > > > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > > > > task will make io_uring_enter() and io_uring_register() system calls on
> > > > > the io_ring_ctx once it's enabled.
> > > > > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > > > > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > > > > other tasks acquiring the ctx uring lock, use a task work item to
> > > > > suspend the submitter_task for the critical section.
> > > >
> > > > Does this open the pathway to various data corruption issues since the
> > > > submitter task can be suspended while it's in the middle of executing
> > > > a section of logic that was previously protected by the mutex? With
> > >
> > > I don't think so. The submitter task is suspended by having it run a
> > > task work item that blocks it until the uring lock is released by the
> > > other task. Any section where the uring lock is held should either be
> > > on kernel threads, contained within an io_uring syscall, or contained
> > > within a task work item, none of which run other task work items. So
> > > whenever the submitter task runs the suspend task work, it shouldn't
> > > be in a uring-lock-protected section.
> > >
> > > > this patch (if I'm understandng it correctly), there's now no
> > > > guarantee that the logic inside the mutexed section for
> > > > IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> > > > if it gets suspended between two state changes that need to be atomic
> > > > / bundled together, then I think the task that does the suspend would
> > > > now see corrupt state.
> > >
> > > Yes, I suppose there's nothing that prevents code from holding the
> > > uring lock across syscalls or task work items, but that would already
> > > be problematic. If a task holds the uring lock on return from a
> > > syscall or task work and then runs another task work item that tries
> > > to acquire the uring lock, it would deadlock.
> > >
> > > >
> > > > I did a quick grep and I think one example of this race shows up in
> > > > io_uring/rsrc.c for buffer cloning where if the src_ctx has
> > > > IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> > > > time the submitter task is unregistering the buffers, then this chain
> > > > of events happens:
> > > > * submitter task is executing the logic in io_sqe_buffers_unregister()
> > > > -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> > > > updated
> > > > * submitter task gets suspended through io_register_clone_buffers() ->
> > > > lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
> > >
> > > I think what this is missing is that the submitter task can't get
> > > suspended at arbitrary points. It gets suspended in task work, and
> > > task work only runs when returning from the kernel to userspace. At
> >
> > Ahh I see, thanks for the explanation. The documentation for
> > TWA_SIGNAL in task_work_add() says "@TWA_SIGNAL works like signals, in
> > that the it will interrupt the targeted task and run the task_work,
> > regardless of whether the task is currently running in the kernel or
> > userspace" so i had assumed this preempts the kernel.
> >
>
> Hmm, thinking about this buffer cloning + IORING_SINGLE_ISSUER
> submitter task's buffer unregistration stuff some more though...
> doesn't this same race with the corrupted values exist if the cloning
> logic acquires the mutex before the submitter task formally runs and

What do you mean by "before the submitter task formally runs"? The
submitter task is running all the time, it's the one that created (or
enabled) the io_uring and will make all the io_uring_enter() and
io_uring_register() syscalls for the io_uring.

> then the submitter task starts executing immediately right after with
> the buffer unregistration logic while the cloning logic is
> simultaneously executing the logic inside the mutex section? In the
> io_ring_ctx_lock_nested() logic, I'm not seeing where this checks
> whether the lock is currently acquired by other tasks or am I missing
> something here and this is already accounted for?

In the IORING_SETUP_SINGLE_ISSUER case, which task holds the uring
lock is determined by which io_ring_suspend_work() task work item (if
any) is running on the submitter_task. If io_ring_suspend_work() isn't
running, then only submitter_task can acquire the uring lock. And it
can do so without any additional checks because no other task can
acquire the uring lock. (We assume the task doesn't already hold the
uring lock, otherwise this would be a deadlock.) If an
io_ring_suspend_work() task work item is running, then the uring lock
has been acquired by whichever task enqueued the task work. And
io_ring_suspend_work() won't return until that task releases the uring
lock. So mutual exclusion is guaranteed by the fact that at most one
task work item is executing on submitter_task at a time.

Best,
Caleb

>
> Thanks,
> Joanne
>
> > Thanks,
> > Joanne
> >
> > > which point "nothing" should be running on the task in userspace or
> > > the kernel and it should be safe to run arbitrary task work items on
> > > the task. Though Ming recently found an interesting deadlock caused by
> > > acquiring a mutex in task work that runs on an unlucky ublk server
> > > thread[1].
> > >
> > > [1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/
> > >
> > > Best,
> > > Caleb
> > >
> > > > * after suspending the src ctx, -> io_clone_buffers() runs, which will
> > > > get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> > > > * io_clone_buffers() calls io_rsrc_node_lookup() which will
> > > > dereference a NULL pointer
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > > > > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > > > > may be set concurrently, so acquire the uring_lock before checking it.
> > > > > If submitter_task isn't set yet, the uring_lock suffices to provide
> > > > > mutual exclusion.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > Tested-by: syzbot@syzkaller.appspotmail.com
> > > > > ---
> > > > >  io_uring/io_uring.c |  12 +++++
> > > > >  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> > > > >  2 files changed, 123 insertions(+), 3 deletions(-)
> > > > >

Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Posted by Joanne Koong 1 month, 3 weeks ago

On Wed, Dec 17, 2025 at 12:03 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Tue, Dec 16, 2025 at 12:14 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Tue, Dec 16, 2025 at 3:47 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
> > > <csander@purestorage.com> wrote:
> > > >
> > > > On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Hmm, thinking about this buffer cloning + IORING_SINGLE_ISSUER
> > submitter task's buffer unregistration stuff some more though...
> > doesn't this same race with the corrupted values exist if the cloning
> > logic acquires the mutex before the submitter task formally runs and
>
> What do you mean by "before the submitter task formally runs"? The
> submitter task is running all the time, it's the one that created (or
> enabled) the io_uring and will make all the io_uring_enter() and
> io_uring_register() syscalls for the io_uring.

Ok, that's the part I was missing. I was mistakenly thinking of the
submitter task as something that gets scheduled in/out for io_uring
work specifically, rather than being the persistent task that owns and
operates the ring. That clears it up, thanks.

>
> > then the submitter task starts executing immediately right after with
> > the buffer unregistration logic while the cloning logic is
> > simultaneously executing the logic inside the mutex section? In the
> > io_ring_ctx_lock_nested() logic, I'm not seeing where this checks
> > whether the lock is currently acquired by other tasks or am I missing
> > something here and this is already accounted for?
>
> In the IORING_SETUP_SINGLE_ISSUER case, which task holds the uring
> lock is determined by which io_ring_suspend_work() task work item (if
> any) is running on the submitter_task. If io_ring_suspend_work() isn't
> running, then only submitter_task can acquire the uring lock. And it
> can do so without any additional checks because no other task can
> acquire the uring lock. (We assume the task doesn't already hold the
> uring lock, otherwise this would be a deadlock.) If an
> io_ring_suspend_work() task work item is running, then the uring lock
> has been acquired by whichever task enqueued the task work. And
> io_ring_suspend_work() won't return until that task releases the uring
> lock. So mutual exclusion is guaranteed by the fact that at most one
> task work item is executing on submitter_task at a time.
>
> Best,
> Caleb
>
> >
> > Thanks,
> > Joanne

[PATCH v5 1/6] io_uring: use release-acquire ordering for IORING_SETUP_R_DISABLED
[PATCH v5 2/6] io_uring: clear IORING_SETUP_SINGLE_ISSUER for IORING_SETUP_SQPOLL
[PATCH v5 3/6] io_uring: ensure io_uring_create() initializes submitter_task
[PATCH v5 4/6] io_uring: use io_ring_submit_lock() in io_iopoll_req_issued()
[PATCH v5 5/6] io_uring: factor out uring_lock helpers
[PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER