Re: [PATCHv2] blk-mq: check for stale cached request in blk_mq_submit_bio

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <tom.leiming@gmail.com>
To: Keith Busch <kbusch@meta.com>
Cc: axboe@kernel.dk, hch@lst.de, linux-block@vger.kernel.org,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCHv2] blk-mq: check for stale cached request in blk_mq_submit_bio
Date: Tue, 19 May 2026 09:24:50 +0800	[thread overview]
Message-ID: <agu74lMg10kxPXrl@fedora> (raw)
In-Reply-To: <20260501174120.403960-1-kbusch@meta.com>

On Fri, May 01, 2026 at 10:41:19AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> When submitting a bio to blk-mq, if the task should sleep after peeking
> a cached request, but before it pops it, the plug flushes and calls
> blk_mq_free_plug_rqs, freeing the cached_rqs. This creates a
> use-after-free bug. Fix this by ensuring the cached_rqs still contains
> our peeked request, and retry the bio submission without it if the
> request had been freed.

Yeah, it is one real bug.

> 
> The code had already warned of this possibility, and specifically popped
> the request before other known blocking calls, but it didn't handle a
> blocking GFP_NOIO alloc. Under memory pressure, allocating the split bio
> or the integrity payload are two such cases that can block. The blk-mq
> submit_bio function continues using the peeked request that was already
> freed and re-initialized, so the driver receives that request with a
> NULL'ed mq_hctx, and inevitably panics.
> 
> Relevant kernel messages if you should encounter this condition, where
> the "WARNING" is the harbinger of the panic about to happen:
> 
> ------------[ cut here ]------------
>  WARNING: CPU: 4 PID: 80820 at block/blk-mq.c:3071 blk_mq_submit_bio+0x2cf/0x5b0
> ...
>  BUG: kernel NULL pointer dereference, address: 0000000000000100
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 6b367b067 P4D 6b367b067 PUD 6bb5eb067 PMD 0
>  Oops: Oops: 0000 [#1] SMP
> ...
>  Call Trace:
>   <TASK>
>   blk_mq_dispatch_queue_requests+0x46/0x120
>   blk_mq_flush_plug_list+0x38/0x130
>   blk_add_rq_to_plug+0xa2/0x160
>   blk_mq_submit_bio+0x3ab/0x5b0
>   __submit_bio+0x3a/0x260
>   submit_bio_noacct_nocheck+0xc6/0x2b0
>   btrfs_submit_bbio+0x14d/0x520
>   ? btrfs_get_extent+0x43f/0x640
>   submit_extent_folio+0x31f/0x340
>   btrfs_do_readpage+0x2d7/0xac0
>   btrfs_readahead+0x142/0x200
>   ? clear_state_bit+0x520/0x520
>   read_pages+0x57/0x200
>   ? folio_alloc_noprof+0x10c/0x310
>   page_cache_ra_unbounded+0x28c/0x480
>   ? asm_sysvec_call_function+0x16/0x20
>   ? blk_cgroup_congested+0xa/0x50
>   ? page_cache_sync_ra+0x41/0x2d0
>   filemap_get_pages+0x347/0xd50
>   filemap_read+0xd3/0x500
>   ? 0xffffffff81000000
>   __io_read+0x111/0x440
>   io_read+0x23/0x90
>   __io_issue_sqe+0x40/0x120
>   io_issue_sqe+0x3f/0x3a0
>   io_submit_sqes+0x2bd/0x790
>   __se_sys_io_uring_enter+0x100/0xc10
>   ? eventfd_read+0x100/0x1f0
>   ? futex_wake+0x1b9/0x260
>   ? syscall_trace_enter+0x34/0x1d0
>   do_syscall_64+0x6a/0x250
>   entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> Fixes: b0077e269f6c1 ("blk-mq: make sure active queue usage is held for bio_integrity_prep()")
> Fixes: 7b4f36cd22a65 ("block: ensure we hold a queue reference when using queue limits")
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
> 
>   Warn if the cached_rqs is not NULL when the peeked request isn't the
>   top. This should never happen, but such a bug would be difficult to
>   figure out was happening without the warning. The previous warn was
>   essential to figure out the bug this patch is addressing.
> 
>   If the peeked request was freed, rerun the entire bio setup. The first
>   version potentially performed operations outside the queue usage
>   counter protection, so may have resulted in an invalid bio if it was
>   racing against a driver updating queue limites. This retry also
>   required clearing the integrity allocation if it was done since
>   updated limits may make any previous setup invalid.
> 
>  block/blk-mq.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8f..73ef3e4be5123 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3096,22 +3096,27 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
>  	return rq;
>  }
>  
> -static void blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
> +static bool blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
>  		struct bio *bio)
>  {
> -	if (rq_list_pop(&plug->cached_rqs) != rq)
> -		WARN_ON_ONCE(1);
> -
>  	/*
> -	 * If any qos ->throttle() end up blocking, we will have flushed the
> -	 * plug and hence killed the cached_rq list as well. Pop this entry
> -	 * before we throttle.
> +	 * We will have flushed the plug and hence killed the cached_rq list as
> +	 * well if anything had scheduled. Pop this entry before we throttle if
> +	 * the entry is still valid.
>  	 */
> +	struct request *popped = rq_list_pop(&plug->cached_rqs);
> +
> +	if (popped != rq) {
> +		WARN_ON_ONCE(popped);
> +		return false;
> +	}
> +
>  	rq_qos_throttle(rq->q, bio);
>  
>  	blk_mq_rq_time_init(rq, blk_time_get_ns());
>  	rq->cmd_flags = bio->bi_opf;
>  	INIT_LIST_HEAD(&rq->queuelist);
> +	return true;
>  }
>  
>  static bool bio_unaligned(const struct bio *bio, struct request_queue *q)
> @@ -3154,6 +3159,7 @@ void blk_mq_submit_bio(struct bio *bio)
>  	 */
>  	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
>  
> +retry:
>  	/*
>  	 * A BIO that was released from a zone write plug has already been
>  	 * through the preparation in this function, already holds a reference
> @@ -3211,7 +3217,14 @@ void blk_mq_submit_bio(struct bio *bio)
>  
>  new_request:
>  	if (rq) {
> -		blk_mq_use_cached_rq(rq, plug, bio);
> +		if (!blk_mq_use_cached_rq(rq, plug, bio)) {
> +			struct bio_integrity_payload *bip = bio_integrity(bio);
> +
> +			if (bip && (bip->bip_flags & BIP_BLOCK_INTEGRITY))
> +				bio_integrity_free(bio);
> +			rq = NULL;
> +			goto retry;
> +		}

This way still looks fragile, given rq isn't guaranteed to be live.

Just wondering why not pop it from plug->cached_rqs from beginning, and
move it back in case of `queue_exit:`?



Thanks,
Ming

next prev parent reply	other threads:[~2026-05-19  1:24 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01 17:41 [PATCHv2] blk-mq: check for stale cached request in blk_mq_submit_bio Keith Busch
2026-05-18 14:44 ` Keith Busch
2026-05-19  6:47   ` Christoph Hellwig
2026-05-19  1:24 ` Ming Lei [this message]
2026-05-19  3:07   ` Keith Busch
2026-05-19  3:36     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agu74lMg10kxPXrl@fedora \
    --to=tom.leiming@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kbusch@meta.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.