[PATCH v3 4/4] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

Caleb Sander Mateos posted 4 patches 5 days, 23 hours ago
[PATCH v3 4/4] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER
Posted by Caleb Sander Mateos 5 days, 23 hours ago
io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
workloads. Even when only one thread pinned to a single CPU is accessing
the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
are very hot instructions. The mutex's primary purpose is to prevent
concurrent io_uring system calls on the same io_ring_ctx. However, there
is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
task will make io_uring_enter() and io_uring_register() system calls on
the io_ring_ctx once it's enabled.
So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
other tasks acquiring the ctx uring lock, use a task work item to
suspend the submitter_task for the critical section.
In io_uring_register(), continue to always acquire the uring_lock mutex.
io_uring_register() can be called on a disabled io_ring_ctx (indeed,
it's required to enable it), when submitter_task isn't set yet. After
submitter_task is set, io_uring_register() is only permitted on
submitter_task, so uring_lock suffices to exclude all other users.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
 io_uring/io_uring.c |  11 +++++
 io_uring/io_uring.h | 101 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e05e56a840f9..64e4e57e2c11 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -363,10 +363,21 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_destroy(&ctx->io_bl_xa);
 	kfree(ctx);
 	return NULL;
 }
 
+void io_ring_suspend_work(struct callback_head *cb_head)
+{
+	struct io_ring_suspend_work *suspend_work =
+		container_of(cb_head, struct io_ring_suspend_work, cb_head);
+	DECLARE_COMPLETION_ONSTACK(suspend_end);
+
+	suspend_work->lock_state->suspend_end = &suspend_end;
+	complete(&suspend_work->suspend_start);
+	wait_for_completion(&suspend_end);
+}
+
 static void io_clean_op(struct io_kiocb *req)
 {
 	if (unlikely(req->flags & REQ_F_BUFFER_SELECTED))
 		io_kbuf_drop_legacy(req);
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 23dae0af530b..262971224cc6 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -1,8 +1,9 @@
 #ifndef IOU_CORE_H
 #define IOU_CORE_H
 
+#include <linux/completion.h>
 #include <linux/errno.h>
 #include <linux/lockdep.h>
 #include <linux/resume_user_mode.h>
 #include <linux/kasan.h>
 #include <linux/poll.h>
@@ -195,36 +196,130 @@ void io_queue_next(struct io_kiocb *req);
 void io_task_refs_refill(struct io_uring_task *tctx);
 bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 
 void io_activate_pollwq(struct io_ring_ctx *ctx);
 
+/*
+ * The ctx uring lock protects most of the mutable struct io_ring_ctx state
+ * accessed in the struct io_kiocb issue path. In the I/O path, it is typically
+ * acquired in the io_uring_enter() syscall and io_handle_tw_list(). For
+ * IORING_SETUP_SQPOLL, it's acquired by io_sq_thread() instead. io_kiocb's
+ * issued with IO_URING_F_UNLOCKED in issue_flags (e.g. by io_wq_submit_work())
+ * acquire and release the ctx uring lock whenever they must touch io_ring_ctx
+ * state. io_uring_register() also acquires the ctx uring lock because most
+ * opcodes mutate io_ring_ctx state accessed in the issue path.
+ *
+ * For !IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, acquiring the ctx uring lock
+ * is always done via mutex_(try)lock(&ctx->uring_lock).
+ *
+ * However, for IORING_SETUP_SINGLE_ISSUER, we can avoid the mutex_lock() +
+ * mutex_unlock() overhead on submitter_task because a single thread can't race
+ * with itself. In the uncommon case where the ctx uring lock is needed on
+ * another thread, it must suspend submitter_task by scheduling a task work item
+ * on it. io_ring_ctx_lock() returns once the task work item has started.
+ * submitter_task is unblocked once io_ring_ctx_unlock() is called.
+ *
+ * io_uring_register() requires special treatment for IORING_SETUP_SINGLE_ISSUER
+ * since it's allowed on a IORING_SETUP_R_DISABLED io_ring_ctx, where
+ * submitter_task isn't set yet. Hence the io_ring_register_ctx_*() family
+ * of helpers. They unconditionally acquire the uring_lock mutex, which always
+ * works to exclude other ctx uring lock users:
+ * - For !IORING_SETUP_SINGLE_ISSUER, all users acquire the ctx uring lock via
+ *   the uring_lock mutex
+ * - For IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_R_DISABLED, only
+ *   io_uring_register() is allowed before the io_ring_ctx is enabled.
+ *   So again, all ctx uring lock users acquire the uring_lock mutex.
+ * - For IORING_SETUP_SINGLE_ISSUER and !IORING_SETUP_R_DISABLED,
+ *   io_uring_register() is only permitted on submitter_task, which is always
+ *   granted the ctx uring lock unless suspended.
+ *   Acquiring the uring_lock mutex is unnecessary but still correct.
+ */
+
 struct io_ring_ctx_lock_state {
+	struct completion *suspend_end;
 };
 
+struct io_ring_suspend_work {
+	struct callback_head cb_head;
+	struct completion suspend_start;
+	struct io_ring_ctx_lock_state *lock_state;
+};
+
+void io_ring_suspend_work(struct callback_head *cb_head);
+
 /* Acquire the ctx uring lock */
 static inline void io_ring_ctx_lock(struct io_ring_ctx *ctx,
 				    struct io_ring_ctx_lock_state *state)
 {
-	mutex_lock(&ctx->uring_lock);
+	struct io_ring_suspend_work suspend_work;
+	struct task_struct *submitter_task;
+
+	if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) {
+		mutex_lock(&ctx->uring_lock);
+		return;
+	}
+
+	submitter_task = ctx->submitter_task;
+	/*
+	 * Not suitable for use while IORING_SETUP_R_DISABLED.
+	 * Must use io_ring_register_ctx_lock() in that case.
+	 */
+	WARN_ON_ONCE(!submitter_task);
+	if (likely(current == submitter_task))
+		return;
+
+	/* Use task work to suspend submitter_task */
+	init_task_work(&suspend_work.cb_head, io_ring_suspend_work);
+	init_completion(&suspend_work.suspend_start);
+	suspend_work.lock_state = state;
+	/* If task_work_add() fails, task is exiting, so no need to suspend */
+	if (unlikely(task_work_add(submitter_task, &suspend_work.cb_head,
+				   TWA_SIGNAL))) {
+		state->suspend_end = NULL;
+		return;
+	}
+
+	wait_for_completion(&suspend_work.suspend_start);
 }
 
 /* Attempt to acquire the ctx uring lock without blocking */
 static inline bool io_ring_ctx_trylock(struct io_ring_ctx *ctx)
 {
-	return mutex_trylock(&ctx->uring_lock);
+	if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER))
+		return mutex_trylock(&ctx->uring_lock);
+
+	/* Not suitable for use while IORING_SETUP_R_DISABLED */
+	WARN_ON_ONCE(!ctx->submitter_task);
+	return current == ctx->submitter_task;
 }
 
 /* Release the ctx uring lock */
 static inline void io_ring_ctx_unlock(struct io_ring_ctx *ctx,
 				      struct io_ring_ctx_lock_state *state)
 {
-	mutex_unlock(&ctx->uring_lock);
+	if (!(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) {
+		mutex_unlock(&ctx->uring_lock);
+		return;
+	}
+
+	if (likely(current == ctx->submitter_task))
+		return;
+
+	if (likely(state->suspend_end))
+		complete(state->suspend_end);
 }
 
 /* Assert (if CONFIG_LOCKDEP) that the ctx uring lock is held */
 static inline void io_ring_ctx_assert_locked(const struct io_ring_ctx *ctx)
 {
+	/*
+	 * No straightforward way to check that submitter_task is suspended
+	 * without access to struct io_ring_ctx_lock_state
+	 */
+	if (ctx->flags & IORING_SETUP_SINGLE_ISSUER)
+		return;
+
 	lockdep_assert_held(&ctx->uring_lock);
 }
 
 /* Acquire the ctx uring lock during the io_uring_register() syscall */
 static inline void io_ring_register_ctx_lock(struct io_ring_ctx *ctx)
-- 
2.45.2
Re: [PATCH v3 4/4] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER
Posted by kernel test robot 4 days, 17 hours ago

Hello,

kernel test robot noticed "WARNING:at_io_uring/io_uring.h:#io_ring_ctx_wait_and_kill" on:

commit: f82bff8359e8780d831f1180c9e8196d2d20f03c ("[PATCH v3 4/4] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER")
url: https://github.com/intel-lab-lkp/linux/commits/Caleb-Sander-Mateos/io_uring-clear-IORING_SETUP_SINGLE_ISSUER-for-IORING_SETUP_SQPOLL/20251126-074303
base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux.git for-next
patch link: https://lore.kernel.org/all/20251125233928.3962947-5-csander@purestorage.com/
patch subject: [PATCH v3 4/4] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

in testcase: trinity
version: 
with following parameters:

	runtime: 300s
	group: group-01
	nr_groups: 5



config: x86_64-kexec
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202511270630.4e038598-lkp@intel.com


[   22.144951][ T6341] ------------[ cut here ]------------
[   22.145674][ T1564] [main] futex: 0 owner:0
[   22.146000][ T6341] WARNING: CPU: 1 PID: 6341 at io_uring/io_uring.h:266 io_ring_ctx_wait_and_kill (io_uring/io_uring.h:266)
[   22.146149][ T1564]
[   22.146482][ T6341] Modules linked in:
[   22.147631][ T1564] [main] futex: 0 owner:0
[   22.147751][ T6341]  can_bcm
[   22.148030][ T1564]
[   22.148330][ T6341]  can_raw
[   22.148974][ T1564] [main] futex: 0 owner:0
[   22.149038][ T6341]  can
[   22.149264][ T1564]
[   22.149572][ T6341]  cn scsi_transport_iscsi sr_mod cdrom
[   22.149579][ T6341] CPU: 1 UID: 65534 PID: 6341 Comm: trinity-c3 Not tainted 6.18.0-rc6-00278-gf82bff8359e8 #1 PREEMPT(voluntary)
[   22.149582][ T6341] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   22.151109][ T1564] [main] Reserved/initialized 10 futexes.
[   22.151645][ T6341] RIP: 0010:io_ring_ctx_wait_and_kill (io_uring/io_uring.h:266)
[   22.152382][ T1564]
[   22.152774][ T6341] Code: 02 00 00 e8 af e2 0b 00 65 48 8b 05 97 1a 5f 02 48 3b 44 24 48 0f 85 98 00 00 00 48 83 c4 50 5b 41 5e 41 5f e9 be c6 f8 00 cc <0f> 0b e9 ed fe ff ff 31 c0 4c 8d 7c 24 28 49 89 47 18 49 89 47 10
All code
========
   0:	02 00                	add    (%rax),%al
   2:	00 e8                	add    %ch,%al
   4:	af                   	scas   %es:(%rdi),%eax
   5:	e2 0b                	loop   0x12
   7:	00 65 48             	add    %ah,0x48(%rbp)
   a:	8b 05 97 1a 5f 02    	mov    0x25f1a97(%rip),%eax        # 0x25f1aa7
  10:	48 3b 44 24 48       	cmp    0x48(%rsp),%rax
  15:	0f 85 98 00 00 00    	jne    0xb3
  1b:	48 83 c4 50          	add    $0x50,%rsp
  1f:	5b                   	pop    %rbx
  20:	41 5e                	pop    %r14
  22:	41 5f                	pop    %r15
  24:	e9 be c6 f8 00       	jmp    0xf8c6e7
  29:	cc                   	int3
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	e9 ed fe ff ff       	jmp    0xffffffffffffff1e
  31:	31 c0                	xor    %eax,%eax
  33:	4c 8d 7c 24 28       	lea    0x28(%rsp),%r15
  38:	49 89 47 18          	mov    %rax,0x18(%r15)
  3c:	49 89 47 10          	mov    %rax,0x10(%r15)

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	e9 ed fe ff ff       	jmp    0xfffffffffffffef4
   7:	31 c0                	xor    %eax,%eax
   9:	4c 8d 7c 24 28       	lea    0x28(%rsp),%r15
   e:	49 89 47 18          	mov    %rax,0x18(%r15)
  12:	49 89 47 10          	mov    %rax,0x10(%r15)
[   22.154288][ T1564] [main] sysv_shm: id:0 size:40960 flags:7b0 ptr:(nil)
[   22.155063][ T6341] RSP: 0018:ffffc90000217db0 EFLAGS: 00010246
[   22.155249][ T1564]
[   22.155714][ T6341]
[   22.157177][ T1564] [main] sysv_shm: id:1 size:4096 flags:17b0 ptr:(nil)
[   22.157242][ T6341] RAX: 0000000000000000 RBX: ffff88810d435800 RCX: 0000000000000001
[   22.157489][ T1564]
[   22.157954][ T6341] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88810d435800
[   22.159286][ T1564] [main] Added 28 filenames from /dev
[   22.159684][ T6341] RBP: 00000000fffffff2 R08: 0000000000000000 R09: ffffffff00000000
[   22.159688][ T6341] R10: 0000000000000010 R11: 0000000000100000 R12: 0000000000000000
[   22.160050][ T1564]
[   22.160414][ T6341] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888119bf9000
[   22.161762][ T1564] [main] Added 17230 filenames from /proc
[   22.161924][ T6341] FS:  00000000010a2880(0000) GS:ffff88889c519000(0000) knlGS:0000000000000000
[   22.162568][ T1564]
[   22.162958][ T6341] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.164372][ T1564] [main] Added 15134 filenames from /sys
[   22.164448][ T6341] CR2: 0000000000000008 CR3: 0000000165c9c000 CR4: 00000000000406f0
[   22.164456][ T6341] Call Trace:
[   22.164626][ T1564]
[   22.166367][ T6341]  <TASK>
[   22.168128][ T1564] [main] Couldn't open socket (30:1:0). Address family not supported by protocol
[   22.168186][ T6341]  io_uring_create (io_uring/io_uring.c:?)
[   22.168352][ T1564]
[   22.168974][ T6341]  __x64_sys_io_uring_setup (io_uring/io_uring.c:3766)
[   22.170902][ T1564] [main] Couldn't open socket (27:1:3). Address family not supported by protocol
[   22.172459][ T6341]  do_syscall_64 (arch/x86/entry/syscall_64.c:?)
[   22.172512][ T1564]
[   22.173163][ T6341]  ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:114 arch/x86/mm/fault.c:1484 arch/x86/mm/fault.c:1532)
[   22.175090][ T1564] [main] Couldn't open socket (44:3:0). Address family not supported by protocol
[   22.175097][ T6341]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[   22.175309][ T1564]
[   22.175935][ T6341] RIP: 0033:0x453b29
[   22.176865][ T1564] Can't do protocol NETBEUI
[   22.176935][ T6341] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 84 00 00 c3 66 2e 0f 1f 84 00 00 00 00
All code
========
   0:	00 f3                	add    %dh,%bl
   2:	c3                   	ret
   3:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
   a:	00 00 00 
   d:	0f 1f 40 00          	nopl   0x0(%rax)
  11:	48 89 f8             	mov    %rdi,%rax
  14:	48 89 f7             	mov    %rsi,%rdi
  17:	48 89 d6             	mov    %rdx,%rsi
  1a:	48 89 ca             	mov    %rcx,%rdx
  1d:	4d 89 c2             	mov    %r8,%r10
  20:	4d 89 c8             	mov    %r9,%r8
  23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
  28:	0f 05                	syscall
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	0f 83 3b 84 00 00    	jae    0x8471
  36:	c3                   	ret
  37:	66                   	data16
  38:	2e                   	cs
  39:	0f                   	.byte 0xf
  3a:	1f                   	(bad)
  3b:	84 00                	test   %al,(%rax)
  3d:	00 00                	add    %al,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	0f 83 3b 84 00 00    	jae    0x8447
   c:	c3                   	ret
   d:	66                   	data16
   e:	2e                   	cs
   f:	0f                   	.byte 0xf
  10:	1f                   	(bad)
  11:	84 00                	test   %al,(%rax)
  13:	00 00                	add    %al,(%rax)


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251127/202511270630.4e038598-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki