[v1] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

[PATCH AUTOSEL 7.0-6.18] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
Posted by Sasha Levin 4 days, 11 hours ago
From: Zizhi Wo <wozizhi@huawei.com>

[ Upstream commit f44d38a31f1802b7222adaea9ee69f9d280f698a ]

A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

[root@fedora io_uring_stress]# ps -ef | grep io_uring
root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>

The task loops inside io_cqring_wait() and never returns to userspace,
and SIGKILL has no effect.

This is caused by the CQ ring exposing rings->cq.head to userspace as
writable, while the authoritative tail lives in kernel-private
ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an
unsigned subtraction:

    free = ctx->cq_entries - min(tail - head, ctx->cq_entries);

If userspace keeps head within [0, tail], the subtraction is well
defined and min() just acts as a defensive clamp. But if userspace
advances head past tail, (tail - head) wraps to a huge value, free
becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the
overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings->cq.tail has
been advanced to iowq->cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings->cq.tail never catches up, io_should_wake()
stays false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

Introduce io_cqring_queued() as the single point that converts the
(tail, head) pair into a trustworthy queued count. Since the real
head/tail distance is bounded by cq_entries (far below 2^31), a signed
comparison reliably detects userspace moving head past tail; in that
case treat the queue as empty so callers see the full cache as free and
forward progress is preserved.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com
[axboe: fixup commit message, kill 'queued' var, and keep it all in
io_uring.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Phase 1: Commit Message Forensics
Record: Subsystem `io_uring`; action verb `validate`; intent is to
validate a user-controlled CQ head value used by
`io_cqe_cache_refill()`.

Record: Tags found: `Suggested-by: Jens Axboe`, `Signed-off-by: Zizhi
Wo`, `Link: https://patch.msgid.link/20260514021847.4062782-1-
wozizhi@huaweicloud.com`, maintainer edit note from Jens, `Signed-off-
by: Jens Axboe`. No `Fixes:`, `Reported-by:`, `Tested-by:`, `Reviewed-
by`, or `Cc: stable` tags were present.

Record: The commit describes a fuzzed, reproducible user-visible
failure: an `io_uring` task spins at about 100% CPU inside
`io_cqring_wait()`, never returns to userspace, and ignores `SIGKILL`.
Root cause is that userspace can write `rings->cq.head`; if it advances
`head` past the kernel-private `ctx->cached_cq_tail`, unsigned
subtraction wraps, `io_cqe_cache_refill()` sees no free space, overflow
stays set, and the wait loop keeps retrying.

Record: This is not hidden cleanup. It is an explicit bug fix for a
userspace-triggerable livelock/unkillable task.

## Phase 2: Diff Analysis
Record: One file changed: `io_uring/io_uring.c`, 17 insertions and 5
deletions. Modified/added functions: new `io_cqring_queued()`, modified
`io_fill_nop_cqe()`, modified `io_cqe_cache_refill()`. Scope is a
single-file surgical fix.

Record: Before, free CQ space was computed as `ctx->cq_entries -
min(__io_cqring_events(ctx), ctx->cq_entries)`, where
`__io_cqring_events()` is `cached_cq_tail - user_head`. If `user_head >
cached_cq_tail`, that unsigned subtraction wraps and is clamped to
`cq_entries`, making `free` zero.

Record: After, `io_cqring_queued()` casts the tail-head difference to
signed `int`; non-negative values are clamped to `cq_entries`, while
negative values are treated as zero queued entries. `io_fill_nop_cqe()`
uses the same trusted queued-count helper.

Record: Bug category is logic/correctness with user-controlled index
validation failure, causing an overflow-path livelock. It is not a
feature, API, refactor, or hardware enablement.

Record: Fix quality is good: for valid rings it preserves existing
behavior; for invalid `head > tail` it chooses forward progress.
Regression risk is low because the helper is local and affects only CQ
free-space calculation. The only semantic change is for corrupted user
CQ head state.

## Phase 3: Git History Investigation
Record: `git blame` shows the affected free-space calculation in
`io_cqe_cache_refill()` comes from `faf88dde060f74` (`io_uring: don't
inline __io_get_cqe()`), first contained in `v6.0-rc1~181^2~85`. The
overflow ordering guard comes from `aa1df3a360a0c5` (`io_uring: fix CQE
reordering`), first contained in `v6.1-rc1~135^2~10`. The later
`cqe32`/NOP path comes from `e26dca67fde19`, first contained in
`v6.18-rc1~137^2~45`.

Record: No `Fixes:` tag is present, so there was no tagged introducing
commit to follow.

Record: Recent file history shows multiple `io_uring` fixes around
CQ/ring handling, including `61a11cf481272` protecting lockless
`ctx->rings` accesses and `a7d755ed9ce97` fixing overflow CQE
reordering. No prerequisite specific to this helper was identified.

Record: Author Zizhi Wo has other kernel commits, but no recent local
`io_uring` commits found. Jens Axboe is the `IO_URING` maintainer in
`MAINTAINERS` and applied the final patch with edits.

Record: Dependencies: the fix depends only on existing
`ctx->cached_cq_tail`, `ctx->cq_entries`, `READ_ONCE(rings->cq.head)`,
and `min()`. It can be backported standalone, though older stable trees
need context adjustment because the exact function signature and file
layout differ.

## Phase 4: Mailing List And External Research
Record: `b4 dig -c f44d38a31f1802b7222adaea9ee69f9d280f698a` found the
original v2 submission at `https://patch.msgid.link/20260514021847.40627
82-1-wozizhi@huaweicloud.com`.

Record: `b4 dig -a` found v1 and v2. v1 was
`20260513063254.1122354-1-wozizhi@huaweicloud.com`; v2 was the submitted
version that matches the final fix concept. Jens reviewed v1 and said
snapshotting `tail` before a possible NOP fill looked wrong, and noted
the refill path had the same unsigned issue. v2 addressed this by
introducing a helper used by both paths.

Record: `b4 dig -w` showed the right recipients: Jens Axboe, Pavel
Begunkov, `io-uring@vger.kernel.org`, `linux-kernel@vger.kernel.org`,
and related Huawei contacts.

Record: The v2 mbox shows Jens applied it and then further edited it by
moving the helper into `io_uring.c`, removing the now-unused `queued`
variable, and trimming the comments/message. No NAK was found. No stable
nomination was found in the fetched thread.

Record: WebFetch access to lore search pages and git.kernel.org was
blocked by Anubis, so stable-list web search could not be verified
through WebFetch. Local `git log --grep` on sampled stable branches
found no existing exact stable commit.

## Phase 5: Code Semantic Analysis
Record: Key functions: `io_cqring_queued()`, `io_fill_nop_cqe()`,
`io_cqe_cache_refill()`.

Record: Callers: `io_cqe_cache_refill()` is called by
`io_get_cqe_overflow()` in `io_uring/io_uring.h`, which feeds normal CQE
posting, auxiliary CQEs, request completions, multishot completions,
message-ring completions, and overflow flushing. `io_cqring_wait()` is
reached from `SYSCALL_DEFINE6(io_uring_enter)` when
`IORING_ENTER_GETEVENTS` is used.

Record: Callees/side effects: the affected code reads the user-writable
CQ head, computes queue occupancy/free space, sets
`ctx->cqe_cached`/`ctx->cqe_sentinel`, and decides whether completions
go directly to the CQ ring or the overflow list.

Record: Reachability is verified from userspace through
`io_uring_enter()`. The provided reproduction ran as root; unprivileged
triggerability was not independently verified, but the affected state is
controlled by the userspace owner of the mmaped CQ ring.

Record: Similar pattern found: `__io_cqring_events()` in current code
and stable branches computes `cached_cq_tail - READ_ONCE(cq.head)`, so
the unsigned wrap condition is real in the relevant code paths.

## Phase 6: Cross-Referencing And Stable Tree Analysis
Record: The buggy free-space logic exists in sampled stable trees:
`stable/linux-6.1.y` has it in `__io_get_cqe()`, and
`stable/linux-6.6.y`, `stable/linux-6.12.y`, `stable/linux-6.18.y`, and
`stable/linux-6.19.y` have it in `io_cqe_cache_refill()` or equivalent.
The specific min/free logic was introduced for v6.0-rc1, so v6.1+ stable
trees are affected.

Record: `stable/linux-5.15.y` has an older `io_get_cqe()` form using
`__io_cqring_events(ctx) == ctx->cq_entries`, not the same `min(tail -
head, cq_entries)` free-space calculation. I did not verify that the
exact livelock fixed here applies to 5.15, so this decision is driven by
verified v6.1+ evidence.

Record: Expected backport difficulty: low to moderate. 6.18/6.19 are
close but may lack the exact split into `wait.c`/`wait.h` seen in
current 7.0; 6.6/6.12 need a smaller adaptation because there is no
`cqe32`/NOP path; 6.1 needs the helper folded into the older
`__io_get_cqe()` path. The semantic fix is standalone.

Record: No related fix already present was found by exact subject search
in sampled stable branches.

## Phase 7: Subsystem And Maintainer Context
Record: Subsystem is `io_uring`, a core async I/O userspace API.
Criticality is IMPORTANT/CORE-adjacent because it is syscall reachable
and used by databases, storage/network software, runtimes, and fuzzers.

Record: Subsystem activity is high: recent local history shows many
`io_uring` fixes and refactors. The patch was handled by Jens Axboe,
listed maintainer for `IO_URING`.

## Phase 8: Impact And Risk Assessment
Record: Affected users are systems using `io_uring`; trigger requires a
userspace process manipulating its CQ head and waiting for completions.
The reproduction is a fuzzing/stress case with direct userspace control
of the mapped CQ ring.

Record: Trigger likelihood is not “everyday normal app behavior”, but it
is syscall/userspace reachable and can create an unkillable high-CPU
task. Unprivileged triggerability was not independently verified beyond
normal `io_uring` userspace reachability.

Record: Failure mode is HIGH severity: livelock/tight retry loop, 100%
CPU, no return to userspace, and `SIGKILL` ineffective per the commit
and mailing-list patch.

Record: Benefit is high for affected stable trees because it prevents a
userspace-triggered unkillable task. Risk is low because the change is
small, local, and only changes behavior for invalid user-controlled CQ
head state. Risk/benefit strongly favors backporting.

## Phase 9: Final Synthesis
Record: Evidence for backporting: real fuzzed bug; clear root cause;
userspace-reachable path; severe livelock/unkillable task; small local
fix; maintainer-reviewed evolution from v1 to v2; final maintainer-
applied version; verified affected code in v6.1+ stable branches.

Record: Evidence against backporting: no explicit `Cc: stable`, no
`Fixes:` tag, no `Tested-by`, and exact patch may need small branch-
specific backport adjustments. These are not enough to outweigh the
verified bug severity and small fix.

Record: Unresolved questions: exact applicability to 5.15 was not
established; exact clean-apply status on each stable branch was not
tested; unprivileged triggerability beyond ordinary userspace `io_uring`
access was not independently proven.

Stable rules checklist:
1. Obviously correct and tested: mostly yes by inspection and maintainer
   review; no explicit `Tested-by` and no local runtime test.
2. Fixes a real bug: yes, fuzzed livelock/unkillable task.
3. Important issue: yes, high-severity CPU spin and unkillable wait.
4. Small and contained: yes, one file, 17 insertions and 5 deletions.
5. No new feature/API: yes, static helper only.
6. Can apply to stable: yes with likely minor backport adjustments for
   older branches.

Exception category: none; this is a direct bug fix, not a device ID,
quirk, DT, build, or documentation exception.

Decision: backport to affected stable trees, especially v6.1+ where the
buggy free-space calculation was verified. Avoid claiming 5.15 without a
separate targeted analysis/backport.

## Verification
- [Phase 1] Parsed commit object
  `f44d38a31f1802b7222adaea9ee69f9d280f698a` with `git show`; confirmed
  subject, tags, and 17/5 diffstat.
- [Phase 2] Inspected the candidate diff with `git show`; confirmed new
  `io_cqring_queued()` and replacements in `io_fill_nop_cqe()` and
  `io_cqe_cache_refill()`.
- [Phase 3] Ran `git blame` on affected lines; confirmed key code came
  from `faf88dde060f74`, `aa1df3a360a0c5`, and `e26dca67fde19`.
- [Phase 3] Ran `git describe --contains`; confirmed first containment
  around v6.0-rc1, v6.1-rc1, and v6.18-rc1 respectively.
- [Phase 4] Ran `b4 dig -c`, `b4 dig -a`, and `b4 dig -w`; confirmed
  v1/v2 patch history, recipients, and maintainer involvement.
- [Phase 4] Fetched v1/v2 mboxes with `b4 mbox`; confirmed Jens’ v1
  concern and v2 application with edits.
- [Phase 5] Used `rg` and file reads to trace `io_cqe_cache_refill()`
  through CQE posting and `io_uring_enter()`/`IORING_ENTER_GETEVENTS`.
- [Phase 6] Checked stable branches with `git blame`; verified affected
  logic in sampled v6.1, v6.6, v6.12, v6.18, and v6.19 branches.
- [Phase 7] Checked `MAINTAINERS`; verified Jens Axboe is listed
  maintainer for `IO_URING`.
- [Phase 8] Verified failure mode from commit message and mailing-list
  patch body; did not independently run the fuzzer or reproducer.
- UNVERIFIED: exact clean apply on every stable tree, exact
  applicability to 5.15, and unprivileged triggerability.

**YES**

 io_uring/io_uring.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a72efb3a62bac..431d157e81595 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -680,13 +680,27 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
 	return ocqe;
 }
 
+/*
+ * Compute queued CQEs for free-space calculation, clamped to cq_entries.
+ */
+static unsigned int io_cqring_queued(struct io_ring_ctx *ctx)
+{
+	struct io_rings *rings = io_get_rings(ctx);
+	int diff;
+
+	diff = (int)(ctx->cached_cq_tail - READ_ONCE(rings->cq.head));
+	if (diff >= 0)
+		return min((unsigned int)diff, ctx->cq_entries);
+	return 0;
+}
+
 /*
  * Fill an empty dummy CQE, in case alignment is off for posting a 32b CQE
  * because the ring is a single 16b entry away from wrapping.
  */
 static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
 {
-	if (__io_cqring_events(ctx) < ctx->cq_entries) {
+	if (io_cqring_queued(ctx) < ctx->cq_entries) {
 		struct io_uring_cqe *cqe = &ctx->rings->cqes[off];
 
 		cqe->user_data = 0;
@@ -707,7 +721,7 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 {
 	struct io_rings *rings = ctx->rings;
 	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
-	unsigned int free, queued, len;
+	unsigned int free, len;
 
 	/*
 	 * Posting into the CQ when there are pending overflowed CQEs may break
@@ -727,9 +741,7 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 		off = 0;
 	}
 
-	/* userspace may cheat modifying the tail, be safe and do min */
-	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
-	free = ctx->cq_entries - queued;
+	free = ctx->cq_entries - io_cqring_queued(ctx);
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
 	if (len < (cqe32 + 1))
-- 
2.53.0