Re: [PATCH] blk-mq: avoid stall during boot due to synchronize_rcu_expedited

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Uladzislau Rezki <urezki@gmail.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	Fengnan Chang <fengnanchang@gmail.com>,
	Yu Kuai <yukuai3@huawei.com>,
	Fengnan Chang <changfengnan@bytedance.com>,
	Jens Axboe <axboe@kernel.dk>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	rcu@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: [PATCH] blk-mq: avoid stall during boot due to synchronize_rcu_expedited
Date: Wed, 7 Jan 2026 13:22:30 +0100	[thread overview]
Message-ID: <aV5QBv02mcAkzCL7@milan> (raw)
In-Reply-To: <7f873eba-7f74-6e74-10c8-dfc178cc2a72@redhat.com>

On Wed, Jan 07, 2026 at 01:05:14PM +0100, Mikulas Patocka wrote:
> 
> 
> On Wed, 7 Jan 2026, Uladzislau Rezki wrote:
> 
> > On Tue, Jan 06, 2026 at 05:59:16PM +0100, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 6 Jan 2026, Uladzislau Rezki wrote:
> > > 
> > > > On Tue, Jan 06, 2026 at 04:56:07PM +0100, Mikulas Patocka wrote:
> > > > > On the kernel 6.19-rc, I am experiencing 15-second boot stall in a
> > > > > virtual machine when probing a virtio-scsi disk:
> > > > > [    1.011641] SCSI subsystem initialized
> > > > > [    1.013972] virtio_scsi virtio6: 16/0/0 default/read/poll queues
> > > > > [    1.015983] scsi host0: Virtio SCSI HBA
> > > > > [    1.019578] ACPI: \_SB_.GSIA: Enabled at IRQ 16
> > > > > [    1.020225] ahci 0000:00:1f.2: AHCI vers 0001.0000, 32 command slots, 1.5 Gbps, SATA mode
> > > > > [    1.020228] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f)
> > > > > [    1.020230] ahci 0000:00:1f.2: flags: 64bit ncq only
> > > > > [    1.024688] scsi host1: ahci
> > > > > [    1.025432] scsi host2: ahci
> > > > > [    1.025966] scsi host3: ahci
> > > > > [    1.026511] scsi host4: ahci
> > > > > [    1.028371] scsi host5: ahci
> > > > > [    1.028918] scsi host6: ahci
> > > > > [    1.029266] ata1: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23100 irq 16 lpm-pol 1
> > > > > [    1.029305] ata2: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23180 irq 16 lpm-pol 1
> > > > > [    1.029316] ata3: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23200 irq 16 lpm-pol 1
> > > > > [    1.029327] ata4: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23280 irq 16 lpm-pol 1
> > > > > [    1.029341] ata5: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23300 irq 16 lpm-pol 1
> > > > > [    1.029356] ata6: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23380 irq 16 lpm-pol 1
> > > > > [    1.118111] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
> > > > > [    1.348916] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > [    1.350713] ata2: SATA link down (SStatus 0 SControl 300)
> > > > > [    1.351025] ata6: SATA link down (SStatus 0 SControl 300)
> > > > > [    1.351160] ata5: SATA link down (SStatus 0 SControl 300)
> > > > > [    1.351326] ata3: SATA link down (SStatus 0 SControl 300)
> > > > > [    1.351536] ata4: SATA link down (SStatus 0 SControl 300)
> > > > > [    1.449153] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2
> > > > > [   16.483477] sd 0:0:0:0: Power-on or device reset occurred
> > > > > [   16.483691] sd 0:0:0:0: [sda] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
> > > > > [   16.483762] sd 0:0:0:0: [sda] Write Protect is off
> > > > > [   16.483877] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > > > > [   16.569225] sd 0:0:0:0: [sda] Attached SCSI disk
> > > > > 
> > > > > I bisected it and it is caused by the commit 89e1fb7ceffd which
> > > > > introduces calls to synchronize_rcu_expedited.
> > > > > 
> > > > > This commit replaces synchronize_rcu_expedited and kfree with a call to 
> > > > > kfree_rcu_mightsleep, avoiding the 15-second delay.
> > > > > 
> > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > > > Fixes: 89e1fb7ceffd ("blk-mq: fix potential uaf for 'queue_hw_ctx'")
> > > > > 
> > > > > ---
> > > > >  block/blk-mq.c |    3 +--
> > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > 
> > > > > Index: linux-2.6/block/blk-mq.c
> > > > > ===================================================================
> > > > > --- linux-2.6.orig/block/blk-mq.c	2026-01-06 16:45:11.000000000 +0100
> > > > > +++ linux-2.6/block/blk-mq.c	2026-01-06 16:48:00.000000000 +0100
> > > > > @@ -4553,8 +4553,7 @@ static void __blk_mq_realloc_hw_ctxs(str
> > > > >  		 * Make sure reading the old queue_hw_ctx from other
> > > > >  		 * context concurrently won't trigger uaf.
> > > > >  		 */
> > > > > -		synchronize_rcu_expedited();
> > > > > -		kfree(hctxs);
> > > > > +		kfree_rcu_mightsleep(hctxs);
> > > > >
> > > > I agree, doing freeing that way is not optimal. But kfree_rcu_mightsleep()
> > > > also might not work. It has a fallback, if we can not place an object into
> > > > "page" due to memory allocation failure, it inlines freeing:
> > > > 
> > > > <snip>
> > > > synchronize_rcu();
> > > > free().
> > > > <snip>
> > > > 
> > > > Please note, synchronize_rcu() can easily be converted into expedited
> > > > version. See rcu_gp_is_expedited().
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Would this patch be better? It does GFP_KERNEL allocation which dones't 
> > > fail in practice.
> > > 
> > > > Inlining is a corner case but it can happen. The best way is to add
> > > > rcu_head to the blk_mq_hw_ctx structure and use kfree_rcu(). It never
> > > > blocks.
> > > 
> > > We are not protecting the blk_mq_hw_ctx structure with RCU, we are 
> > > protecting the q->queue_hw_ctx array. So, rcu_head cannot be added to an 
> > > array. We could cast the array to rcu_head (and make sure that the initial 
> > > allocation is at least sizeof(struct rcu_head)), but that is hacky.
> > > 
> > > Mikulas
> > > 
> > > ---
> > >  block/blk-mq.c |   23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > 
> > > Index: linux-2.6/block/blk-mq.c
> > > ===================================================================
> > > --- linux-2.6.orig/block/blk-mq.c	2026-01-06 15:55:41.000000000 +0100
> > > +++ linux-2.6/block/blk-mq.c	2026-01-06 16:22:40.000000000 +0100
> > > @@ -4531,6 +4531,18 @@ static struct blk_mq_hw_ctx *blk_mq_allo
> > >  	return NULL;
> > >  }
> > >  
> > > +struct rcu_free_hctxs {
> > > +	struct rcu_head head;
> > > +	struct blk_mq_hw_ctx **hctxs;
> > > +};
> > > +
> > > +static void rcu_free_hctxs(struct rcu_head *head)
> > > +{
> > > +	struct rcu_free_hctxs *r = container_of(head, struct rcu_free_hctxs, head);
> > > +	kfree(r->hctxs);
> > > +	kfree(r);
> > > +}
> > > +
> > >  static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > >  				     struct request_queue *q)
> > >  {
> > > @@ -4539,6 +4551,7 @@ static void __blk_mq_realloc_hw_ctxs(str
> > >  
> > >  	if (q->nr_hw_queues < set->nr_hw_queues) {
> > >  		struct blk_mq_hw_ctx **new_hctxs;
> > > +		struct rcu_free_hctxs *r;
> > >  
> > >  		new_hctxs = kcalloc_node(set->nr_hw_queues,
> > >  				       sizeof(*new_hctxs), GFP_KERNEL,
> > > @@ -4553,8 +4566,14 @@ static void __blk_mq_realloc_hw_ctxs(str
> > >  		 * Make sure reading the old queue_hw_ctx from other
> > >  		 * context concurrently won't trigger uaf.
> > >  		 */
> > > -		synchronize_rcu_expedited();
> > > -		kfree(hctxs);
> > > +		r = kmalloc(sizeof(struct rcu_free_hctxs), GFP_KERNEL);
> > > +		if (!r) {
> > > +			synchronize_rcu_expedited();
> > > +			kfree(hctxs);
> > > +		} else {
> > > +			r->hctxs = hctxs;
> > > +			call_rcu(&r->head, rcu_free_hctxs);
> > > +		}
> > >  		hctxs = new_hctxs;
> > >  	}
> > >  
> > > > 
> > > 
> > I see. That will work but this looks like a temporary fix. It would be
> > great to understand why synchronize_rcu_expedited() is blocked for so long.
> > 16 seconds is a way too long.
> 
> synchronize_rcu_expedited is called 257 times from the block layer. One 
> call is approximately 50ms.
> 
OK. I thought the _one_ call of synchronize_rcu_expedited() was stuck for
~15 seconds. Whereas you just have many of them.

Therefore you can easily just go back to your original patch and use
kfree_rcu_mightsleep(hctxs)!

--
Uladzislau Rezki

next prev parent reply	other threads:[~2026-01-07 12:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 15:56 [PATCH] blk-mq: avoid stall during boot due to synchronize_rcu_expedited Mikulas Patocka
2026-01-06 16:29 ` Uladzislau Rezki
2026-01-06 16:59   ` Mikulas Patocka
2026-01-07 11:01     ` Uladzislau Rezki
2026-01-07 12:05       ` Mikulas Patocka
2026-01-07 12:22         ` Uladzislau Rezki [this message]
2026-01-07 15:10     ` Jens Axboe
2026-01-07 12:23 ` Uladzislau Rezki
2026-01-07 16:49 ` Bart Van Assche
2026-01-07 16:50 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aV5QBv02mcAkzCL7@milan \
    --to=urezki@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=boqun.feng@gmail.com \
    --cc=changfengnan@bytedance.com \
    --cc=fengnanchang@gmail.com \
    --cc=frederic@kernel.org \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-block@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.