[PATCH v2 0/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock
@ 2026-02-10 20:49 Ionut Nechita (Wind River)
  2026-02-10 20:49 ` [PATCH v2 1/1] " Ionut Nechita (Wind River)
  0 siblings, 1 reply; 5+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-02-10 20:49 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, linux-rt-users, ming.lei, muchun.song,
	mkhalfella, sunlightlinux, chris.friesen, stable, ionut_n2001,
	bigeasy, ionut.nechita

Hi Jens,

This is v2 of the fix for the RT kernel performance regression caused by
commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding").

Changes since v1 (RESEND, Jan 9):
- Rebased on top of axboe/for-7.0/block
- No code changes

The problem: on PREEMPT_RT kernels, the spinlock_t queue_lock added in
blk_mq_run_hw_queue() converts to a sleeping rt_mutex, causing all IRQ
threads (one per MSI-X vector) to serialize. On megaraid_sas with 8
MSI-X vectors, throughput drops from 640 MB/s to 153 MB/s.

The fix introduces a dedicated raw_spinlock_t quiesce_sync_lock that
does not convert to rt_mutex on RT kernels. The critical section is
provably short (only flag and counter checks), making raw_spinlock safe.

In past used memory barriers but was rejected due to barrier pairing
complexity across multiple call sites (as noted by Muchun Song).

Ionut Nechita (1):
  block/blk-mq: fix RT kernel regression with dedicated
    quiesce_sync_lock

 block/blk-core.c       |  1 +
 block/blk-mq.c         | 27 ++++++++++++++++-----------
 include/linux/blkdev.h |  6 ++++++
 3 files changed, 23 insertions(+), 11 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 1/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock
  2026-02-10 20:49 [PATCH v2 0/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock Ionut Nechita (Wind River)
@ 2026-02-10 20:49 ` Ionut Nechita (Wind River)
  2026-02-10 21:39   ` Keith Busch
  2026-02-11 11:31   ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 5+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-02-10 20:49 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, linux-rt-users, ming.lei, muchun.song,
	mkhalfella, sunlightlinux, chris.friesen, stable, ionut_n2001,
	bigeasy, ionut.nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

In RT kernel (PREEMPT_RT), commit 679b1874eba7 ("block: fix ordering
between checking QUEUE_FLAG_QUIESCED request adding") causes severe
performance regression on systems with multiple MSI-X interrupt vectors.

The above change added spinlock_t queue_lock to blk_mq_run_hw_queue()
to synchronize QUEUE_FLAG_QUIESCED checks with blk_mq_unquiesce_queue().
While this works correctly in standard kernel, it causes catastrophic
serialization in RT kernel where spinlock_t converts to sleeping
rt_mutex.

Problem in RT kernel:
- blk_mq_run_hw_queue() is called from IRQ thread context (I/O completion)
- With 8 MSI-X vectors, all 8 IRQ threads contend on the same queue_lock
- queue_lock becomes rt_mutex (sleeping) in RT kernel
- IRQ threads serialize and enter D-state waiting for lock
- Throughput drops from 640 MB/s to 153 MB/s

The original commit message noted that memory barriers were considered
but rejected because "memory barrier is not easy to be maintained" -
barriers would need to be added at multiple call sites throughout the
block layer where work is added before calling blk_mq_run_hw_queue().

Solution:
Instead of using the general-purpose queue_lock or attempting complex
memory barrier pairing across many call sites, introduce a dedicated
raw_spinlock_t quiesce_sync_lock specifically for synchronizing the
quiesce state between:
- blk_mq_quiesce_queue_nowait()
- blk_mq_unquiesce_queue()
- blk_mq_run_hw_queue()

Why raw_spinlock is safe:
- Critical section is provably short (only flag and counter checks)
- No sleeping operations under lock
- raw_spinlock does not convert to rt_mutex in RT kernel
- Provides same ordering guarantees as original queue_lock approach

This approach:
- Maintains correctness of original synchronization
- Avoids sleeping in RT kernel's IRQ thread context
- Limits scope to only quiesce-related synchronization
- Simpler than auditing all call sites for memory barrier pairing

Additionally, change blk_freeze_queue_start to use async=true for better
performance in RT kernel by avoiding synchronous queue runs during freeze.

Test results on RT kernel (megaraid_sas with 8 MSI-X vectors):
- Before: 153 MB/s, 6-8 IRQ threads in D-state
- After:  640 MB/s, 0 IRQ threads blocked

Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-core.c       |  1 +
 block/blk-mq.c         | 27 ++++++++++++++++-----------
 include/linux/blkdev.h |  6 ++++++
 3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 474700ffaa1c8..fd615aeb5c463 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -434,6 +434,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 	mutex_init(&q->limits_lock);
 	mutex_init(&q->rq_qos_mutex);
 	spin_lock_init(&q->queue_lock);
+	raw_spin_lock_init(&q->quiesce_sync_lock);
 
 	init_waitqueue_head(&q->mq_freeze_wq);
 	mutex_init(&q->mq_freeze_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0ad3dd3329db7..888718a782f88 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -171,7 +171,7 @@ bool __blk_freeze_queue_start(struct request_queue *q,
 		percpu_ref_kill(&q->q_usage_counter);
 		mutex_unlock(&q->mq_freeze_lock);
 		if (queue_is_mq(q))
-			blk_mq_run_hw_queues(q, false);
+			blk_mq_run_hw_queues(q, true);
 	} else {
 		mutex_unlock(&q->mq_freeze_lock);
 	}
@@ -262,10 +262,10 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&q->queue_lock, flags);
+	raw_spin_lock_irqsave(&q->quiesce_sync_lock, flags);
 	if (!q->quiesce_depth++)
 		blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
-	spin_unlock_irqrestore(&q->queue_lock, flags);
+	raw_spin_unlock_irqrestore(&q->quiesce_sync_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
@@ -317,14 +317,14 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
 	unsigned long flags;
 	bool run_queue = false;
 
-	spin_lock_irqsave(&q->queue_lock, flags);
+	raw_spin_lock_irqsave(&q->quiesce_sync_lock, flags);
 	if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
 		;
 	} else if (!--q->quiesce_depth) {
 		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
 		run_queue = true;
 	}
-	spin_unlock_irqrestore(&q->queue_lock, flags);
+	raw_spin_unlock_irqrestore(&q->quiesce_sync_lock, flags);
 
 	/* dispatch requests which are inserted during quiescing */
 	if (run_queue)
@@ -2368,14 +2368,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		unsigned long flags;
 
 		/*
-		 * Synchronize with blk_mq_unquiesce_queue(), because we check
-		 * if hw queue is quiesced locklessly above, we need the use
-		 * ->queue_lock to make sure we see the up-to-date status to
-		 * not miss rerunning the hw queue.
+		 * Synchronize with blk_mq_unquiesce_queue(). We check if hw
+		 * queue is quiesced locklessly above, so we need to use
+		 * quiesce_sync_lock to ensure we see the up-to-date status
+		 * and don't miss rerunning the hw queue.
+		 *
+		 * Uses raw_spinlock to avoid sleeping in RT kernel's IRQ
+		 * thread context during I/O completion. Critical section is
+		 * short (only flag and counter checks), making raw_spinlock
+		 * safe.
 		 */
-		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		raw_spin_lock_irqsave(&hctx->queue->quiesce_sync_lock, flags);
 		need_run = blk_mq_hw_queue_need_run(hctx);
-		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
+		raw_spin_unlock_irqrestore(&hctx->queue->quiesce_sync_lock, flags);
 
 		if (!need_run)
 			return;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99ef8cd7673c2..ab9e62aa3ae42 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -514,6 +514,12 @@ struct request_queue {
 	struct request		*last_merge;
 
 	spinlock_t		queue_lock;
+	/*
+	 * Synchronizes quiesce state checks between blk_mq_run_hw_queue()
+	 * and blk_mq_unquiesce_queue(). Uses raw_spinlock to avoid sleeping
+	 * in RT kernel's IRQ thread context during I/O completion.
+	 */
+	raw_spinlock_t		quiesce_sync_lock;
 
 	int			quiesce_depth;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock
  2026-02-10 20:49 ` [PATCH v2 1/1] " Ionut Nechita (Wind River)
@ 2026-02-10 21:39   ` Keith Busch
  2026-02-11  7:28     ` Ionut Nechita (Wind River)
  2026-02-11 11:31   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2026-02-10 21:39 UTC (permalink / raw)
  To: Ionut Nechita (Wind River)
  Cc: axboe, linux-block, linux-kernel, linux-rt-users, ming.lei,
	muchun.song, mkhalfella, sunlightlinux, chris.friesen, stable,
	ionut_n2001, bigeasy

On Tue, Feb 10, 2026 at 10:49:46PM +0200, Ionut Nechita (Wind River) wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0ad3dd3329db7..888718a782f88 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -171,7 +171,7 @@ bool __blk_freeze_queue_start(struct request_queue *q,
>  		percpu_ref_kill(&q->q_usage_counter);
>  		mutex_unlock(&q->mq_freeze_lock);
>  		if (queue_is_mq(q))
> -			blk_mq_run_hw_queues(q, false);
> +			blk_mq_run_hw_queues(q, true);

I didn't see the reasoning for this path to run in async mode. Is this
change related to this patch?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock
  2026-02-10 21:39   ` Keith Busch
@ 2026-02-11  7:28     ` Ionut Nechita (Wind River)
  0 siblings, 0 replies; 5+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-02-11  7:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, linux-block, linux-kernel, linux-rt-users, ming.lei,
	muchun.song, mkhalfella, sunlightlinux, chris.friesen, stable,
	ionut_n2001, bigeasy

On Tue, Feb 10, 2026 at 02:39:36PM -0700, Keith Busch wrote:
> I didn't see the reasoning for this path to run in async mode. Is this
> change related to this patch?

Yes, this change is related. In __blk_freeze_queue_start(), the caller
will wait for freeze completion via blk_mq_freeze_queue_wait() regardless,
so synchronous dispatch is not required here.

On RT kernels, running the hw queues synchronously means the calling
thread directly enters the dispatch path and hits the lock contention
this patch addresses. Switching to async delegates the work to kblockd
workers that can run in parallel across CPUs, complementing the
quiesce_sync_lock fix by not introducing additional serialization in
the freeze path.

Thanks,
Ionut

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock
  2026-02-10 20:49 ` [PATCH v2 1/1] " Ionut Nechita (Wind River)
  2026-02-10 21:39   ` Keith Busch
@ 2026-02-11 11:31   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-11 11:31 UTC (permalink / raw)
  To: Ionut Nechita (Wind River)
  Cc: axboe, linux-block, linux-kernel, linux-rt-users, ming.lei,
	muchun.song, mkhalfella, sunlightlinux, chris.friesen, stable,
	ionut_n2001

On 2026-02-10 22:49:46 [+0200], Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> In RT kernel (PREEMPT_RT), commit 679b1874eba7 ("block: fix ordering
> between checking QUEUE_FLAG_QUIESCED request adding") causes severe
> performance regression on systems with multiple MSI-X interrupt vectors.
> 
> The above change added spinlock_t queue_lock to blk_mq_run_hw_queue()
> to synchronize QUEUE_FLAG_QUIESCED checks with blk_mq_unquiesce_queue().
> While this works correctly in standard kernel, it causes catastrophic
> serialization in RT kernel where spinlock_t converts to sleeping
> rt_mutex.

So !RT has the same synchronisation on the lock but spinning on the lock
makes it less dramatic?

> Problem in RT kernel:
> - blk_mq_run_hw_queue() is called from IRQ thread context (I/O completion)
> - With 8 MSI-X vectors, all 8 IRQ threads contend on the same queue_lock
> - queue_lock becomes rt_mutex (sleeping) in RT kernel
> - IRQ threads serialize and enter D-state waiting for lock
> - Throughput drops from 640 MB/s to 153 MB/s
> 
> The original commit message noted that memory barriers were considered
> but rejected because "memory barrier is not easy to be maintained" -
> barriers would need to be added at multiple call sites throughout the
> block layer where work is added before calling blk_mq_run_hw_queue().
> 
> Solution:
> Instead of using the general-purpose queue_lock or attempting complex
> memory barrier pairing across many call sites, introduce a dedicated
> raw_spinlock_t quiesce_sync_lock specifically for synchronizing the
> quiesce state between:
> - blk_mq_quiesce_queue_nowait()
> - blk_mq_unquiesce_queue()
> - blk_mq_run_hw_queue()
> 
> Why raw_spinlock is safe:
> - Critical section is provably short (only flag and counter checks)
> - No sleeping operations under lock
> - raw_spinlock does not convert to rt_mutex in RT kernel
> - Provides same ordering guarantees as original queue_lock approach

Okay.

> This approach:
> - Maintains correctness of original synchronization
> - Avoids sleeping in RT kernel's IRQ thread context
> - Limits scope to only quiesce-related synchronization
> - Simpler than auditing all call sites for memory barrier pairing
> 
> Additionally, change blk_freeze_queue_start to use async=true for better
> performance in RT kernel by avoiding synchronous queue runs during freeze.
> 
> Test results on RT kernel (megaraid_sas with 8 MSI-X vectors):
> - Before: 153 MB/s, 6-8 IRQ threads in D-state
> - After:  640 MB/s, 0 IRQ threads blocked
> 
> Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>  block/blk-core.c       |  1 +
>  block/blk-mq.c         | 27 ++++++++++++++++-----------
>  include/linux/blkdev.h |  6 ++++++
>  3 files changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 474700ffaa1c8..fd615aeb5c463 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -434,6 +434,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
>  	mutex_init(&q->limits_lock);
>  	mutex_init(&q->rq_qos_mutex);
>  	spin_lock_init(&q->queue_lock);
> +	raw_spin_lock_init(&q->quiesce_sync_lock);
>  
>  	init_waitqueue_head(&q->mq_freeze_wq);
>  	mutex_init(&q->mq_freeze_lock);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0ad3dd3329db7..888718a782f88 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -171,7 +171,7 @@ bool __blk_freeze_queue_start(struct request_queue *q,
>  		percpu_ref_kill(&q->q_usage_counter);
>  		mutex_unlock(&q->mq_freeze_lock);
>  		if (queue_is_mq(q))
> -			blk_mq_run_hw_queues(q, false);
> +			blk_mq_run_hw_queues(q, true);

I read what you wrote to Keith here and still don't get it. If the goal
is to have the same lock contention on RT as on !RT (where we spin on
the lock) why not keep everything else as-is? Why is important to spread
it across multiple CPUs? This looks unrelated. It could be added as a
second optimisation.

Another thing: If you have multiple interrupts, why don't you have one
queue per interrupt? Wouldn't this also avoid the spinning here?

>  	} else {
>  		mutex_unlock(&q->mq_freeze_lock);
>  	}
> @@ -262,10 +262,10 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
>  {
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&q->queue_lock, flags);
> +	raw_spin_lock_irqsave(&q->quiesce_sync_lock, flags);
>  	if (!q->quiesce_depth++)
>  		blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
> -	spin_unlock_irqrestore(&q->queue_lock, flags);
> +	raw_spin_unlock_irqrestore(&q->quiesce_sync_lock, flags);

Since you have only "inc and set bit if was zero" and below "dec and
clear bit if become zero" what about using atomic_t for quiesce_depth?
There is atomic_inc_return() mostly doing the same thing with one atomic
op. That flag could be avoided if the the blk_queue_quiesced() condition
is "quiesce_depth > 0". That could avoid the lock.

>  }
>  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>  
> @@ -317,14 +317,14 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>  	unsigned long flags;
>  	bool run_queue = false;
>  
> -	spin_lock_irqsave(&q->queue_lock, flags);
> +	raw_spin_lock_irqsave(&q->quiesce_sync_lock, flags);
>  	if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
>  		;
>  	} else if (!--q->quiesce_depth) {
>  		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
>  		run_queue = true;
>  	}
> -	spin_unlock_irqrestore(&q->queue_lock, flags);
> +	raw_spin_unlock_irqrestore(&q->quiesce_sync_lock, flags);
>  
>  	/* dispatch requests which are inserted during quiescing */
>  	if (run_queue)
> @@ -2368,14 +2368,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  		unsigned long flags;
>  
>  		/*
> -		 * Synchronize with blk_mq_unquiesce_queue(), because we check
> -		 * if hw queue is quiesced locklessly above, we need the use
> -		 * ->queue_lock to make sure we see the up-to-date status to
> -		 * not miss rerunning the hw queue.
> +		 * Synchronize with blk_mq_unquiesce_queue(). We check if hw
> +		 * queue is quiesced locklessly above, so we need to use
> +		 * quiesce_sync_lock to ensure we see the up-to-date status
> +		 * and don't miss rerunning the hw queue.
> +		 *
> +		 * Uses raw_spinlock to avoid sleeping in RT kernel's IRQ
> +		 * thread context during I/O completion. Critical section is
> +		 * short (only flag and counter checks), making raw_spinlock
> +		 * safe.
>  		 */
> -		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> +		raw_spin_lock_irqsave(&hctx->queue->quiesce_sync_lock, flags);
>  		need_run = blk_mq_hw_queue_need_run(hctx);
> -		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> +		raw_spin_unlock_irqrestore(&hctx->queue->quiesce_sync_lock, flags);

but here I am unsure. If the above operation (setting/ clearing the bit)
is lockless it might require a handshake if the counter goes back to 0
before it is visible here. Maybe not since it could be observed before
the lock was acquired. That is why I am unsure.

>  		if (!need_run)
>  			return;

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-11 11:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 20:49 [PATCH v2 0/1] block/blk-mq: fix RT kernel regression with dedicated quiesce_sync_lock Ionut Nechita (Wind River)
2026-02-10 20:49 ` [PATCH v2 1/1] " Ionut Nechita (Wind River)
2026-02-10 21:39   ` Keith Busch
2026-02-11  7:28     ` Ionut Nechita (Wind River)
2026-02-11 11:31   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox