From: Ming Lei <tom.leiming@gmail.com>
To: Keith Busch <kbusch@meta.com>
Cc: axboe@kernel.dk, hch@lst.de, linux-block@vger.kernel.org,
Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCHv2] blk-mq: check for stale cached request in blk_mq_submit_bio
Date: Tue, 19 May 2026 09:24:50 +0800 [thread overview]
Message-ID: <agu74lMg10kxPXrl@fedora> (raw)
In-Reply-To: <20260501174120.403960-1-kbusch@meta.com>
On Fri, May 01, 2026 at 10:41:19AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> When submitting a bio to blk-mq, if the task should sleep after peeking
> a cached request, but before it pops it, the plug flushes and calls
> blk_mq_free_plug_rqs, freeing the cached_rqs. This creates a
> use-after-free bug. Fix this by ensuring the cached_rqs still contains
> our peeked request, and retry the bio submission without it if the
> request had been freed.
Yeah, it is one real bug.
>
> The code had already warned of this possibility, and specifically popped
> the request before other known blocking calls, but it didn't handle a
> blocking GFP_NOIO alloc. Under memory pressure, allocating the split bio
> or the integrity payload are two such cases that can block. The blk-mq
> submit_bio function continues using the peeked request that was already
> freed and re-initialized, so the driver receives that request with a
> NULL'ed mq_hctx, and inevitably panics.
>
> Relevant kernel messages if you should encounter this condition, where
> the "WARNING" is the harbinger of the panic about to happen:
>
> ------------[ cut here ]------------
> WARNING: CPU: 4 PID: 80820 at block/blk-mq.c:3071 blk_mq_submit_bio+0x2cf/0x5b0
> ...
> BUG: kernel NULL pointer dereference, address: 0000000000000100
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 6b367b067 P4D 6b367b067 PUD 6bb5eb067 PMD 0
> Oops: Oops: 0000 [#1] SMP
> ...
> Call Trace:
> <TASK>
> blk_mq_dispatch_queue_requests+0x46/0x120
> blk_mq_flush_plug_list+0x38/0x130
> blk_add_rq_to_plug+0xa2/0x160
> blk_mq_submit_bio+0x3ab/0x5b0
> __submit_bio+0x3a/0x260
> submit_bio_noacct_nocheck+0xc6/0x2b0
> btrfs_submit_bbio+0x14d/0x520
> ? btrfs_get_extent+0x43f/0x640
> submit_extent_folio+0x31f/0x340
> btrfs_do_readpage+0x2d7/0xac0
> btrfs_readahead+0x142/0x200
> ? clear_state_bit+0x520/0x520
> read_pages+0x57/0x200
> ? folio_alloc_noprof+0x10c/0x310
> page_cache_ra_unbounded+0x28c/0x480
> ? asm_sysvec_call_function+0x16/0x20
> ? blk_cgroup_congested+0xa/0x50
> ? page_cache_sync_ra+0x41/0x2d0
> filemap_get_pages+0x347/0xd50
> filemap_read+0xd3/0x500
> ? 0xffffffff81000000
> __io_read+0x111/0x440
> io_read+0x23/0x90
> __io_issue_sqe+0x40/0x120
> io_issue_sqe+0x3f/0x3a0
> io_submit_sqes+0x2bd/0x790
> __se_sys_io_uring_enter+0x100/0xc10
> ? eventfd_read+0x100/0x1f0
> ? futex_wake+0x1b9/0x260
> ? syscall_trace_enter+0x34/0x1d0
> do_syscall_64+0x6a/0x250
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> Fixes: b0077e269f6c1 ("blk-mq: make sure active queue usage is held for bio_integrity_prep()")
> Fixes: 7b4f36cd22a65 ("block: ensure we hold a queue reference when using queue limits")
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
>
> Warn if the cached_rqs is not NULL when the peeked request isn't the
> top. This should never happen, but such a bug would be difficult to
> figure out was happening without the warning. The previous warn was
> essential to figure out the bug this patch is addressing.
>
> If the peeked request was freed, rerun the entire bio setup. The first
> version potentially performed operations outside the queue usage
> counter protection, so may have resulted in an invalid bio if it was
> racing against a driver updating queue limites. This retry also
> required clearing the integrity allocation if it was done since
> updated limits may make any previous setup invalid.
>
> block/blk-mq.c | 29 +++++++++++++++++++++--------
> 1 file changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8f..73ef3e4be5123 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3096,22 +3096,27 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
> return rq;
> }
>
> -static void blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
> +static bool blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
> struct bio *bio)
> {
> - if (rq_list_pop(&plug->cached_rqs) != rq)
> - WARN_ON_ONCE(1);
> -
> /*
> - * If any qos ->throttle() end up blocking, we will have flushed the
> - * plug and hence killed the cached_rq list as well. Pop this entry
> - * before we throttle.
> + * We will have flushed the plug and hence killed the cached_rq list as
> + * well if anything had scheduled. Pop this entry before we throttle if
> + * the entry is still valid.
> */
> + struct request *popped = rq_list_pop(&plug->cached_rqs);
> +
> + if (popped != rq) {
> + WARN_ON_ONCE(popped);
> + return false;
> + }
> +
> rq_qos_throttle(rq->q, bio);
>
> blk_mq_rq_time_init(rq, blk_time_get_ns());
> rq->cmd_flags = bio->bi_opf;
> INIT_LIST_HEAD(&rq->queuelist);
> + return true;
> }
>
> static bool bio_unaligned(const struct bio *bio, struct request_queue *q)
> @@ -3154,6 +3159,7 @@ void blk_mq_submit_bio(struct bio *bio)
> */
> rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
>
> +retry:
> /*
> * A BIO that was released from a zone write plug has already been
> * through the preparation in this function, already holds a reference
> @@ -3211,7 +3217,14 @@ void blk_mq_submit_bio(struct bio *bio)
>
> new_request:
> if (rq) {
> - blk_mq_use_cached_rq(rq, plug, bio);
> + if (!blk_mq_use_cached_rq(rq, plug, bio)) {
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> +
> + if (bip && (bip->bip_flags & BIP_BLOCK_INTEGRITY))
> + bio_integrity_free(bio);
> + rq = NULL;
> + goto retry;
> + }
This way still looks fragile, given rq isn't guaranteed to be live.
Just wondering why not pop it from plug->cached_rqs from beginning, and
move it back in case of `queue_exit:`?
Thanks,
Ming
next prev parent reply other threads:[~2026-05-19 1:24 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 17:41 [PATCHv2] blk-mq: check for stale cached request in blk_mq_submit_bio Keith Busch
2026-05-18 14:44 ` Keith Busch
2026-05-19 6:47 ` Christoph Hellwig
2026-05-19 1:24 ` Ming Lei [this message]
2026-05-19 3:07 ` Keith Busch
2026-05-19 3:36 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agu74lMg10kxPXrl@fedora \
--to=tom.leiming@gmail.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=kbusch@meta.com \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox