From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D5893F660B; Wed, 20 May 2026 11:20:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779276058; cv=none; b=b2pViDpv03WYlKGvPtgGHd1CcvDa4IrN9PRAiWkpMuE/pVQY252gJ0QgNr+WUj9qnfev9q4yycfKNbdwWKQCSLQcgfMMvh5QRITfeCq3U7TwO8+u1xjLmeOHmOIAdlZ6X4ThN2/2FvUj0i+l/rqEHDB4ZzP5rq41qObgcBt5ih0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779276058; c=relaxed/simple; bh=O3GUQrcgrREjME4OlpBgGAHjugGqpCFiIiai4MQP2Qw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=tAzvLy6hUomfCVwrWCCfmz3vFT1BFOgEmBG3SzGB890rRpo3ElYgVUwNW+ZVtptQVt2OshGOU80JWOsnd+e/7sGzvict8FSp6+eeWySnPcnl3As/ONk5VEtTilldFAqQwweuvRf0XyW0i1toh6K0/3oVRUVhqjDCsKvJQAWLnnU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TIBI9Z9z; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TIBI9Z9z" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2D03B1F00896; Wed, 20 May 2026 11:20:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779276056; bh=SVJ6+52vVYr+ZxyKK0zem41gMQhL29rg6s4I6fvQIAU=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=TIBI9Z9zaF+qOaFizNPFPTpRekO4q3SPB8dyXN+R04XvlnLZnTyLgr+nqP+XHnw8V M9hCbG//BslyOM+HV71fKA7fSezZ52ZBlxfXF/nkWOg6M060UZ500wHpmdnI0OijLE dDd4Dg0MzzCi0M58aIb9r/vyQvFRZbYJmCkQ6nKzU844ifDsgKnrA7OfNCRgbZO4VC OdD203jHifB9sAw8U77eWn/JpEJ6mmmEyoq3ZLWWGc8XOPUVtI5FBQmxZTXqVAaSiQ dOy/94Q9NEKm60CRItA4VTOVzZ3Srw3KrB+l4gOa7N6h3cGhdkq5616UTBxeMLgU2L cagBr6DoyG5lw== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Zizhi Wo , Jens Axboe , Sasha Levin , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH AUTOSEL 7.0-6.18] io_uring: validate user-controlled cq.head in io_cqe_cache_refill() Date: Wed, 20 May 2026 07:19:22 -0400 Message-ID: <20260520111944.3424570-50-sashal@kernel.org> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260520111944.3424570-1-sashal@kernel.org> References: <20260520111944.3424570-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 7.0.9 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Zizhi Wo [ Upstream commit f44d38a31f1802b7222adaea9ee69f9d280f698a ] A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU: [root@fedora io_uring_stress]# ps -ef | grep io_uring root 1240 1 99 13:36 ? 00:01:35 [io_uring_stress] The task loops inside io_cqring_wait() and never returns to userspace, and SIGKILL has no effect. This is caused by the CQ ring exposing rings->cq.head to userspace as writable, while the authoritative tail lives in kernel-private ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an unsigned subtraction: free = ctx->cq_entries - min(tail - head, ctx->cq_entries); If userspace keeps head within [0, tail], the subtraction is well defined and min() just acts as a defensive clamp. But if userspace advances head past tail, (tail - head) wraps to a huge value, free becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set. The wait loop in io_cqring_wait() relies on an invariant: refill() only fails when the CQ is *physically* full, in which case rings->cq.tail has been advanced to iowq->cq_tail and io_should_wake() returns true. The tampered head breaks this: refill() fails while the ring is not full, no OCQE is copied in, rings->cq.tail never catches up, io_should_wake() stays false, and io_cqring_wait_schedule() keeps returning early because IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop that never returns to userspace. Introduce io_cqring_queued() as the single point that converts the (tail, head) pair into a trustworthy queued count. Since the real head/tail distance is bounded by cq_entries (far below 2^31), a signed comparison reliably detects userspace moving head past tail; in that case treat the queue as empty so callers see the full cache as free and forward progress is preserved. Suggested-by: Jens Axboe Signed-off-by: Zizhi Wo Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com [axboe: fixup commit message, kill 'queued' var, and keep it all in io_uring.c] Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: ## Phase 1: Commit Message Forensics Record: Subsystem `io_uring`; action verb `validate`; intent is to validate a user-controlled CQ head value used by `io_cqe_cache_refill()`. Record: Tags found: `Suggested-by: Jens Axboe`, `Signed-off-by: Zizhi Wo`, `Link: https://patch.msgid.link/20260514021847.4062782-1- wozizhi@huaweicloud.com`, maintainer edit note from Jens, `Signed-off- by: Jens Axboe`. No `Fixes:`, `Reported-by:`, `Tested-by:`, `Reviewed- by`, or `Cc: stable` tags were present. Record: The commit describes a fuzzed, reproducible user-visible failure: an `io_uring` task spins at about 100% CPU inside `io_cqring_wait()`, never returns to userspace, and ignores `SIGKILL`. Root cause is that userspace can write `rings->cq.head`; if it advances `head` past the kernel-private `ctx->cached_cq_tail`, unsigned subtraction wraps, `io_cqe_cache_refill()` sees no free space, overflow stays set, and the wait loop keeps retrying. Record: This is not hidden cleanup. It is an explicit bug fix for a userspace-triggerable livelock/unkillable task. ## Phase 2: Diff Analysis Record: One file changed: `io_uring/io_uring.c`, 17 insertions and 5 deletions. Modified/added functions: new `io_cqring_queued()`, modified `io_fill_nop_cqe()`, modified `io_cqe_cache_refill()`. Scope is a single-file surgical fix. Record: Before, free CQ space was computed as `ctx->cq_entries - min(__io_cqring_events(ctx), ctx->cq_entries)`, where `__io_cqring_events()` is `cached_cq_tail - user_head`. If `user_head > cached_cq_tail`, that unsigned subtraction wraps and is clamped to `cq_entries`, making `free` zero. Record: After, `io_cqring_queued()` casts the tail-head difference to signed `int`; non-negative values are clamped to `cq_entries`, while negative values are treated as zero queued entries. `io_fill_nop_cqe()` uses the same trusted queued-count helper. Record: Bug category is logic/correctness with user-controlled index validation failure, causing an overflow-path livelock. It is not a feature, API, refactor, or hardware enablement. Record: Fix quality is good: for valid rings it preserves existing behavior; for invalid `head > tail` it chooses forward progress. Regression risk is low because the helper is local and affects only CQ free-space calculation. The only semantic change is for corrupted user CQ head state. ## Phase 3: Git History Investigation Record: `git blame` shows the affected free-space calculation in `io_cqe_cache_refill()` comes from `faf88dde060f74` (`io_uring: don't inline __io_get_cqe()`), first contained in `v6.0-rc1~181^2~85`. The overflow ordering guard comes from `aa1df3a360a0c5` (`io_uring: fix CQE reordering`), first contained in `v6.1-rc1~135^2~10`. The later `cqe32`/NOP path comes from `e26dca67fde19`, first contained in `v6.18-rc1~137^2~45`. Record: No `Fixes:` tag is present, so there was no tagged introducing commit to follow. Record: Recent file history shows multiple `io_uring` fixes around CQ/ring handling, including `61a11cf481272` protecting lockless `ctx->rings` accesses and `a7d755ed9ce97` fixing overflow CQE reordering. No prerequisite specific to this helper was identified. Record: Author Zizhi Wo has other kernel commits, but no recent local `io_uring` commits found. Jens Axboe is the `IO_URING` maintainer in `MAINTAINERS` and applied the final patch with edits. Record: Dependencies: the fix depends only on existing `ctx->cached_cq_tail`, `ctx->cq_entries`, `READ_ONCE(rings->cq.head)`, and `min()`. It can be backported standalone, though older stable trees need context adjustment because the exact function signature and file layout differ. ## Phase 4: Mailing List And External Research Record: `b4 dig -c f44d38a31f1802b7222adaea9ee69f9d280f698a` found the original v2 submission at `https://patch.msgid.link/20260514021847.40627 82-1-wozizhi@huaweicloud.com`. Record: `b4 dig -a` found v1 and v2. v1 was `20260513063254.1122354-1-wozizhi@huaweicloud.com`; v2 was the submitted version that matches the final fix concept. Jens reviewed v1 and said snapshotting `tail` before a possible NOP fill looked wrong, and noted the refill path had the same unsigned issue. v2 addressed this by introducing a helper used by both paths. Record: `b4 dig -w` showed the right recipients: Jens Axboe, Pavel Begunkov, `io-uring@vger.kernel.org`, `linux-kernel@vger.kernel.org`, and related Huawei contacts. Record: The v2 mbox shows Jens applied it and then further edited it by moving the helper into `io_uring.c`, removing the now-unused `queued` variable, and trimming the comments/message. No NAK was found. No stable nomination was found in the fetched thread. Record: WebFetch access to lore search pages and git.kernel.org was blocked by Anubis, so stable-list web search could not be verified through WebFetch. Local `git log --grep` on sampled stable branches found no existing exact stable commit. ## Phase 5: Code Semantic Analysis Record: Key functions: `io_cqring_queued()`, `io_fill_nop_cqe()`, `io_cqe_cache_refill()`. Record: Callers: `io_cqe_cache_refill()` is called by `io_get_cqe_overflow()` in `io_uring/io_uring.h`, which feeds normal CQE posting, auxiliary CQEs, request completions, multishot completions, message-ring completions, and overflow flushing. `io_cqring_wait()` is reached from `SYSCALL_DEFINE6(io_uring_enter)` when `IORING_ENTER_GETEVENTS` is used. Record: Callees/side effects: the affected code reads the user-writable CQ head, computes queue occupancy/free space, sets `ctx->cqe_cached`/`ctx->cqe_sentinel`, and decides whether completions go directly to the CQ ring or the overflow list. Record: Reachability is verified from userspace through `io_uring_enter()`. The provided reproduction ran as root; unprivileged triggerability was not independently verified, but the affected state is controlled by the userspace owner of the mmaped CQ ring. Record: Similar pattern found: `__io_cqring_events()` in current code and stable branches computes `cached_cq_tail - READ_ONCE(cq.head)`, so the unsigned wrap condition is real in the relevant code paths. ## Phase 6: Cross-Referencing And Stable Tree Analysis Record: The buggy free-space logic exists in sampled stable trees: `stable/linux-6.1.y` has it in `__io_get_cqe()`, and `stable/linux-6.6.y`, `stable/linux-6.12.y`, `stable/linux-6.18.y`, and `stable/linux-6.19.y` have it in `io_cqe_cache_refill()` or equivalent. The specific min/free logic was introduced for v6.0-rc1, so v6.1+ stable trees are affected. Record: `stable/linux-5.15.y` has an older `io_get_cqe()` form using `__io_cqring_events(ctx) == ctx->cq_entries`, not the same `min(tail - head, cq_entries)` free-space calculation. I did not verify that the exact livelock fixed here applies to 5.15, so this decision is driven by verified v6.1+ evidence. Record: Expected backport difficulty: low to moderate. 6.18/6.19 are close but may lack the exact split into `wait.c`/`wait.h` seen in current 7.0; 6.6/6.12 need a smaller adaptation because there is no `cqe32`/NOP path; 6.1 needs the helper folded into the older `__io_get_cqe()` path. The semantic fix is standalone. Record: No related fix already present was found by exact subject search in sampled stable branches. ## Phase 7: Subsystem And Maintainer Context Record: Subsystem is `io_uring`, a core async I/O userspace API. Criticality is IMPORTANT/CORE-adjacent because it is syscall reachable and used by databases, storage/network software, runtimes, and fuzzers. Record: Subsystem activity is high: recent local history shows many `io_uring` fixes and refactors. The patch was handled by Jens Axboe, listed maintainer for `IO_URING`. ## Phase 8: Impact And Risk Assessment Record: Affected users are systems using `io_uring`; trigger requires a userspace process manipulating its CQ head and waiting for completions. The reproduction is a fuzzing/stress case with direct userspace control of the mapped CQ ring. Record: Trigger likelihood is not “everyday normal app behavior”, but it is syscall/userspace reachable and can create an unkillable high-CPU task. Unprivileged triggerability was not independently verified beyond normal `io_uring` userspace reachability. Record: Failure mode is HIGH severity: livelock/tight retry loop, 100% CPU, no return to userspace, and `SIGKILL` ineffective per the commit and mailing-list patch. Record: Benefit is high for affected stable trees because it prevents a userspace-triggered unkillable task. Risk is low because the change is small, local, and only changes behavior for invalid user-controlled CQ head state. Risk/benefit strongly favors backporting. ## Phase 9: Final Synthesis Record: Evidence for backporting: real fuzzed bug; clear root cause; userspace-reachable path; severe livelock/unkillable task; small local fix; maintainer-reviewed evolution from v1 to v2; final maintainer- applied version; verified affected code in v6.1+ stable branches. Record: Evidence against backporting: no explicit `Cc: stable`, no `Fixes:` tag, no `Tested-by`, and exact patch may need small branch- specific backport adjustments. These are not enough to outweigh the verified bug severity and small fix. Record: Unresolved questions: exact applicability to 5.15 was not established; exact clean-apply status on each stable branch was not tested; unprivileged triggerability beyond ordinary userspace `io_uring` access was not independently proven. Stable rules checklist: 1. Obviously correct and tested: mostly yes by inspection and maintainer review; no explicit `Tested-by` and no local runtime test. 2. Fixes a real bug: yes, fuzzed livelock/unkillable task. 3. Important issue: yes, high-severity CPU spin and unkillable wait. 4. Small and contained: yes, one file, 17 insertions and 5 deletions. 5. No new feature/API: yes, static helper only. 6. Can apply to stable: yes with likely minor backport adjustments for older branches. Exception category: none; this is a direct bug fix, not a device ID, quirk, DT, build, or documentation exception. Decision: backport to affected stable trees, especially v6.1+ where the buggy free-space calculation was verified. Avoid claiming 5.15 without a separate targeted analysis/backport. ## Verification - [Phase 1] Parsed commit object `f44d38a31f1802b7222adaea9ee69f9d280f698a` with `git show`; confirmed subject, tags, and 17/5 diffstat. - [Phase 2] Inspected the candidate diff with `git show`; confirmed new `io_cqring_queued()` and replacements in `io_fill_nop_cqe()` and `io_cqe_cache_refill()`. - [Phase 3] Ran `git blame` on affected lines; confirmed key code came from `faf88dde060f74`, `aa1df3a360a0c5`, and `e26dca67fde19`. - [Phase 3] Ran `git describe --contains`; confirmed first containment around v6.0-rc1, v6.1-rc1, and v6.18-rc1 respectively. - [Phase 4] Ran `b4 dig -c`, `b4 dig -a`, and `b4 dig -w`; confirmed v1/v2 patch history, recipients, and maintainer involvement. - [Phase 4] Fetched v1/v2 mboxes with `b4 mbox`; confirmed Jens’ v1 concern and v2 application with edits. - [Phase 5] Used `rg` and file reads to trace `io_cqe_cache_refill()` through CQE posting and `io_uring_enter()`/`IORING_ENTER_GETEVENTS`. - [Phase 6] Checked stable branches with `git blame`; verified affected logic in sampled v6.1, v6.6, v6.12, v6.18, and v6.19 branches. - [Phase 7] Checked `MAINTAINERS`; verified Jens Axboe is listed maintainer for `IO_URING`. - [Phase 8] Verified failure mode from commit message and mailing-list patch body; did not independently run the fuzzer or reproducer. - UNVERIFIED: exact clean apply on every stable tree, exact applicability to 5.15, and unprivileged triggerability. **YES** io_uring/io_uring.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index a72efb3a62bac..431d157e81595 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -680,13 +680,27 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx, return ocqe; } +/* + * Compute queued CQEs for free-space calculation, clamped to cq_entries. + */ +static unsigned int io_cqring_queued(struct io_ring_ctx *ctx) +{ + struct io_rings *rings = io_get_rings(ctx); + int diff; + + diff = (int)(ctx->cached_cq_tail - READ_ONCE(rings->cq.head)); + if (diff >= 0) + return min((unsigned int)diff, ctx->cq_entries); + return 0; +} + /* * Fill an empty dummy CQE, in case alignment is off for posting a 32b CQE * because the ring is a single 16b entry away from wrapping. */ static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off) { - if (__io_cqring_events(ctx) < ctx->cq_entries) { + if (io_cqring_queued(ctx) < ctx->cq_entries) { struct io_uring_cqe *cqe = &ctx->rings->cqes[off]; cqe->user_data = 0; @@ -707,7 +721,7 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32) { struct io_rings *rings = ctx->rings; unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); - unsigned int free, queued, len; + unsigned int free, len; /* * Posting into the CQ when there are pending overflowed CQEs may break @@ -727,9 +741,7 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32) off = 0; } - /* userspace may cheat modifying the tail, be safe and do min */ - queued = min(__io_cqring_events(ctx), ctx->cq_entries); - free = ctx->cq_entries - queued; + free = ctx->cq_entries - io_cqring_queued(ctx); /* we need a contiguous range, limit based on the current array offset */ len = min(free, ctx->cq_entries - off); if (len < (cqe32 + 1)) -- 2.53.0