[v2] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

[PATCH V2] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

Posted by Zizhi Wo 4 weeks, 1 day ago

From: Zizhi Wo <wozizhi@huawei.com>

[BUG]
A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

    [root@fedora io_uring_stress]# ps -ef | grep io_uring
    root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>

The task loops inside io_cqring_wait() and never returns to userspace, and
SIGKILL has no effect.

[CAUSE]
The CQ ring exposes rings->cq.head to userspace as writable, while the
authoritative tail lives in kernel-private ctx->cached_cq_tail.
io_cqe_cache_refill() computes free space as an unsigned subtraction:

    free = ctx->cq_entries - min(tail - head, ctx->cq_entries);

If userspace keeps head within [0, tail], the subtraction is well defined
and min() just acts as a defensive clamp. But if userspace advances head
past tail, (tail - head) wraps to a huge value, free becomes 0, and
io_cqe_cache_refill() fails. The CQE is pushed onto the overflow list and
IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings->cq.tail has
been advanced to iowq->cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings->cq.tail never catches up, io_should_wake() stays
false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

[FIX]
Introduce io_cqring_queued() as the single point that converts the
(tail, head) pair into a trustworthy queued count. Since the real
head/tail distance is bounded by cq_entries (far below 2^31),
a signed comparison reliably detects userspace moving head past tail;
in that case treat the queue as empty so callers see the full cache as
free and forward progress is preserved.

CQEs that would otherwise be delivered may be lost when the application
corrupts its own head pointer, but that is an application-visible
consequence of its own action; the kernel's responsibility here is limited
to keeping the task killable and making forward progress.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
---
 io_uring/io_uring.c |  5 ++---
 io_uring/wait.h     | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4ed998d60c09..458f4a53179f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -692,7 +692,7 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
  */
 static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
 {
-	if (__io_cqring_events(ctx) < ctx->cq_entries) {
+	if (io_cqring_queued(ctx) < ctx->cq_entries) {
 		struct io_uring_cqe *cqe = &ctx->rings->cqes[off];
 
 		cqe->user_data = 0;
@@ -733,8 +733,7 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 		off = 0;
 	}
 
-	/* userspace may cheat modifying the tail, be safe and do min */
-	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
+	queued = io_cqring_queued(ctx);
 	free = ctx->cq_entries - queued;
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
diff --git a/io_uring/wait.h b/io_uring/wait.h
index a4274b137f81..b987837b9051 100644
--- a/io_uring/wait.h
+++ b/io_uring/wait.h
@@ -50,4 +50,23 @@ static inline unsigned io_cqring_events(struct io_ring_ctx *ctx)
 	return __io_cqring_events(ctx);
 }
 
+/*
+ * Compute queued CQEs for free-space calculation, clamped to cq_entries.
+ *
+ * rings->cq.head is user-writable. If userspace advances it past
+ * cached_cq_tail, an unsigned (tail - head) underflows to a huge
+ * value, which traps io_cqring_wait() in an unkillable loop via the
+ * overflow path. Use a signed comparison to handle it.
+ */
+static inline unsigned int io_cqring_queued(struct io_ring_ctx *ctx)
+{
+	struct io_rings *rings = io_get_rings(ctx);
+	int diff;
+
+	diff = (int)(ctx->cached_cq_tail - READ_ONCE(rings->cq.head));
+	if (diff >= 0)
+		return min((unsigned int)diff, ctx->cq_entries);
+	return 0;
+}
+
 #endif
-- 
2.52.0

Re: [PATCH V2] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

Posted by Jens Axboe 4 weeks, 1 day ago

On 5/13/26 8:18 PM, Zizhi Wo wrote:
> From: Zizhi Wo <wozizhi@huawei.com>
> 
> [BUG]
> A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:
> 
>     [root@fedora io_uring_stress]# ps -ef | grep io_uring
>     root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>
> 
> The task loops inside io_cqring_wait() and never returns to userspace, and
> SIGKILL has no effect.

Thanks - applied with a few edits, see final result here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/commit/?h=io_uring-7.1&id=f44d38a31f1802b7222adaea9ee69f9d280f698a

The comments (and commit message) read very LLM'ish, so dialed that back
a bit. And there's no reason to put io_cqring_queued() in a header file
when it's only used in io_uring.c. And finally, the 'queued' variable is
now useless, so kill that too.

-- 
Jens Axboe

Re: [PATCH V2] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

Posted by Zizhi Wo 4 weeks, 1 day ago


在 2026/5/14 21:26, Jens Axboe 写道:
> On 5/13/26 8:18 PM, Zizhi Wo wrote:
>> From: Zizhi Wo <wozizhi@huawei.com>
>>
>> [BUG]
>> A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:
>>
>>      [root@fedora io_uring_stress]# ps -ef | grep io_uring
>>      root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>
>>
>> The task loops inside io_cqring_wait() and never returns to userspace, and
>> SIGKILL has no effect.
> 
> Thanks - applied with a few edits, see final result here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/commit/?h=io_uring-7.1&id=f44d38a31f1802b7222adaea9ee69f9d280f698a
> 
> The comments (and commit message) read very LLM'ish, so dialed that back

Indeed..since my English isn't great, I used AI to polish it up a bit :)

> a bit. And there's no reason to put io_cqring_queued() in a header file
> when it's only used in io_uring.c. And finally, the 'queued' variable is
> now useless, so kill that too.
> 
Thanks for pointing that out.

Thanks,
Zizhi Wo

Re: [PATCH V2] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

Posted by Jens Axboe 4 weeks, 1 day ago

On Thu, 14 May 2026 10:18:47 +0800, Zizhi Wo wrote:
> [BUG]
> A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:
> 
>     [root@fedora io_uring_stress]# ps -ef | grep io_uring
>     root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>
> 
> The task loops inside io_cqring_wait() and never returns to userspace, and
> SIGKILL has no effect.
> 
> [...]

Applied, thanks!

[1/1] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
      commit: f44d38a31f1802b7222adaea9ee69f9d280f698a

Best regards,
-- 
Jens Axboe