[PATCH v6 0/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v6 0/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
@ 2026-05-06  6:56 Ionut Nechita (Wind River)
  2026-05-06  6:56 ` [PATCH v6 1/1] " Ionut Nechita (Wind River)
  0 siblings, 1 reply; 7+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-05-06  6:56 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: bigeasy, bvanassche, clrkwllms, rostedt, ming.lei, muchun.song,
	mkhalfella, chris.friesen, linux-kernel, linux-rt-devel,
	linux-rt-users, stable, ionut_n2001, sunlightlinux

Hi Jens,

This is v6 of the fix for the RT kernel performance regression caused by
commit 6bda857bcbb86 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding").

Changes since v5 (Mar 3):
- Rewrote the memory-ordering comments per Bart Van Assche's review.
  The previous wording incorrectly described smp_mb__after_atomic() as
  ordering against "subsequent loads in blk_mq_run_hw_queue()". The
  comments now describe the actual reader/writer pairing: writer-side
  smp_mb__after_atomic() in blk_mq_quiesce_queue_nowait() and
  blk_mq_unquiesce_queue() pairs with reader-side smp_rmb() in
  blk_mq_run_hw_queue() so the re-check observes the latest
  quiesce_depth value.
- Rebased on top of linux-next (next-20260505).
- No functional / code-generation changes.

Changes since v4 (Feb 13):
- Rebased on top of linux-next (20260302)
- No code changes

Changes since v3 (Feb 11):
- Rebased on top of axboe/for-7.0/block
- Fixed Fixes tag commit hash to match upstream (6bda857bcbb86)
- Added Reviewed-by from Sebastian Andrzej Siewior
- No code changes

Changes since v2 (Feb 10):
- Replaced raw_spinlock_t quiesce_sync_lock with atomic_t for
  quiesce_depth, as suggested by Sebastian Andrzej Siewior
- Eliminated QUEUE_FLAG_QUIESCED entirely; blk_queue_quiesced() now
  checks atomic_read(&q->quiesce_depth) > 0
- Use atomic_dec_if_positive() in blk_mq_unquiesce_queue() to avoid
  race between WARN check and decrement
- Removed the unrelated blk_mq_run_hw_queues() async=true change
- Removed blk-mq-debugfs.c QUIESCED flag entry
- Uses smp_mb__after_atomic() / smp_rmb() for memory ordering instead
  of any spinlock in the hot path

Changes since v1 (RESEND, Jan 9):
- Rebased on top of axboe/for-7.0/block
- No code changes

The problem: on PREEMPT_RT kernels, the spinlock_t queue_lock added in
blk_mq_run_hw_queue() converts to a sleeping rt_mutex, causing all IRQ
threads (one per MSI-X vector) to serialize. On megaraid_sas with 128
MSI-X vectors and 120 hw queues, throughput drops from 640 MB/s to
153 MB/s.

The fix converts quiesce_depth to atomic_t, which serves as both the
depth tracker and the quiesce indicator (depth > 0 means quiesced).
This eliminates QUEUE_FLAG_QUIESCED and removes the need for any lock
in the hot path. Memory ordering is ensured by smp_mb__after_atomic()
on the writer side (after modifying quiesce_depth) paired with
smp_rmb() on the reader side (before re-checking quiesce state in
blk_mq_run_hw_queue()).

Link: https://lore.kernel.org/linux-block/20260303073744.20585-1-ionut.nechita@windriver.com/

Ionut Nechita (1):
  block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention
    on RT

 block/blk-core.c       |  1 +
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c         | 53 +++++++++++++++++++++---------------------
 include/linux/blkdev.h |  9 ++++---
 4 files changed, 34 insertions(+), 30 deletions(-)

--
2.54.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v6 1/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
  2026-05-06  6:56 [PATCH v6 0/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT Ionut Nechita (Wind River)
@ 2026-05-06  6:56 ` Ionut Nechita (Wind River)
  2026-05-06  7:14   ` Bart Van Assche
  0 siblings, 1 reply; 7+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-05-06  6:56 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: bigeasy, bvanassche, clrkwllms, rostedt, ming.lei, muchun.song,
	mkhalfella, chris.friesen, linux-kernel, linux-rt-devel,
	linux-rt-users, stable, ionut_n2001, sunlightlinux

In RT kernel (PREEMPT_RT), commit 6bda857bcbb86 ("block: fix ordering
between checking QUEUE_FLAG_QUIESCED request adding") causes severe
performance regression on systems with multiple MSI-X interrupt
vectors.

The above change introduced spinlock_t queue_lock usage in
blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks
with blk_mq_unquiesce_queue(). While this works correctly in
standard kernel, it causes catastrophic serialization in RT kernel
where spinlock_t converts to sleeping rt_mutex.

Problem in RT kernel:
- blk_mq_run_hw_queue() is called from IRQ thread context
- With multiple MSI-X vectors, all IRQ threads contend on
  the same queue_lock
- queue_lock becomes rt_mutex (sleeping) in RT kernel
- IRQ threads serialize and enter D-state waiting for lock
- Throughput drops from 640 MB/s to 153 MB/s

Solution:
Convert quiesce_depth to atomic_t and use it directly for quiesce
state checking, eliminating QUEUE_FLAG_QUIESCED entirely. This
removes the need for any locking in the hot path.

The atomic counter serves as both the depth tracker and the quiesce
indicator (depth > 0 means quiesced). This eliminates the race
window that existed between updating the depth and the flag.

Memory ordering is ensured by:
- smp_mb__after_atomic() after modifying quiesce_depth in
  blk_mq_quiesce_queue_nowait() and blk_mq_unquiesce_queue()
- smp_rmb() in blk_mq_run_hw_queue() before re-checking the
  quiesce state, paired with the writer-side barriers above

Performance impact:
- RT kernel: eliminates lock contention, restores full throughput
- Non-RT kernel: atomic ops are similar cost to the previous
  spinlock acquire/release, no regression expected

Test results on RT kernel:
Hardware: Broadcom/LSI MegaRAID 12GSAS/PCIe Secure SAS39xx
  (megaraid_sas driver, 128 MSI-X vectors, 120 hw queues)
- Before: 153 MB/s, IRQ threads in D-state
- After:  640 MB/s, no IRQ threads blocked

Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 6bda857bcbb86 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-core.c       |  1 +
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c         | 53 +++++++++++++++++++++---------------------
 include/linux/blkdev.h |  9 ++++---
 4 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..1cafcca0975a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -434,6 +434,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 	mutex_init(&q->limits_lock);
 	mutex_init(&q->rq_qos_mutex);
 	spin_lock_init(&q->queue_lock);
+	atomic_set(&q->quiesce_depth, 0);
 
 	init_waitqueue_head(&q->mq_freeze_wq);
 	mutex_init(&q->mq_freeze_lock);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..1b0aec3036e6 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -89,7 +89,6 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(INIT_DONE),
 	QUEUE_FLAG_NAME(STATS),
 	QUEUE_FLAG_NAME(REGISTERED),
-	QUEUE_FLAG_NAME(QUIESCED),
 	QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
 	QUEUE_FLAG_NAME(HCTX_ACTIVE),
 	QUEUE_FLAG_NAME(SQ_SCHED),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..c1281e4d4d83 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -260,12 +260,13 @@ EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue_non_owner);
  */
 void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&q->queue_lock, flags);
-	if (!q->quiesce_depth++)
-		blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
-	spin_unlock_irqrestore(&q->queue_lock, flags);
+	atomic_inc(&q->quiesce_depth);
+	/*
+	 * Pairs with smp_rmb() in blk_mq_run_hw_queue(): make the
+	 * incremented quiesce_depth observable to readers re-checking
+	 * the quiesce state, so they don't dispatch on a quiesced queue.
+	 */
+	smp_mb__after_atomic();
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
@@ -314,21 +315,23 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
  */
 void blk_mq_unquiesce_queue(struct request_queue *q)
 {
-	unsigned long flags;
-	bool run_queue = false;
+	int depth;
 
-	spin_lock_irqsave(&q->queue_lock, flags);
-	if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
-		;
-	} else if (!--q->quiesce_depth) {
-		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
-		run_queue = true;
-	}
-	spin_unlock_irqrestore(&q->queue_lock, flags);
+	depth = atomic_dec_if_positive(&q->quiesce_depth);
+	if (WARN_ON_ONCE(depth < 0))
+		return;
 
-	/* dispatch requests which are inserted during quiescing */
-	if (run_queue)
+	if (depth == 0) {
+		/*
+		 * Pairs with smp_rmb() in blk_mq_run_hw_queue(): make the
+		 * decrement of quiesce_depth observable before we kick the
+		 * hw queues, so a concurrent blk_mq_run_hw_queue() that
+		 * re-checks the state sees the queue as no longer quiesced.
+		 */
+		smp_mb__after_atomic();
+		/* dispatch requests which are inserted during quiescing */
 		blk_mq_run_hw_queues(q, true);
+	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
@@ -2362,17 +2365,15 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 
 	need_run = blk_mq_hw_queue_need_run(hctx);
 	if (!need_run) {
-		unsigned long flags;
-
 		/*
-		 * Synchronize with blk_mq_unquiesce_queue(), because we check
-		 * if hw queue is quiesced locklessly above, we need the use
-		 * ->queue_lock to make sure we see the up-to-date status to
-		 * not miss rerunning the hw queue.
+		 * Re-check the quiesce state after a read barrier. Pairs with
+		 * smp_mb__after_atomic() in blk_mq_quiesce_queue_nowait() and
+		 * blk_mq_unquiesce_queue() so we don't miss rerunning the hw
+		 * queue when a concurrent unquiesce has just dropped the
+		 * quiesce_depth to zero.
 		 */
-		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		smp_rmb();
 		need_run = blk_mq_hw_queue_need_run(hctx);
-		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
 
 		if (!need_run)
 			return;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..5d582c70fb8a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -521,7 +521,8 @@ struct request_queue {
 
 	spinlock_t		queue_lock;
 
-	int			quiesce_depth;
+	/* Atomic quiesce depth - also serves as quiesced indicator (depth > 0) */
+	atomic_t		quiesce_depth;
 
 	struct gendisk		*disk;
 
@@ -666,7 +667,6 @@ enum {
 	QUEUE_FLAG_INIT_DONE,		/* queue is initialized */
 	QUEUE_FLAG_STATS,		/* track IO start and completion times */
 	QUEUE_FLAG_REGISTERED,		/* queue has been registered to a disk */
-	QUEUE_FLAG_QUIESCED,		/* queue has been quiesced */
 	QUEUE_FLAG_RQ_ALLOC_TIME,	/* record rq->alloc_time_ns */
 	QUEUE_FLAG_HCTX_ACTIVE,		/* at least one blk-mq hctx is active */
 	QUEUE_FLAG_SQ_SCHED,		/* single queue style io dispatch */
@@ -704,7 +704,10 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 			     REQ_FAILFAST_DRIVER))
-#define blk_queue_quiesced(q)	test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+static inline bool blk_queue_quiesced(struct request_queue *q)
+{
+	return atomic_read(&q->quiesce_depth) > 0;
+}
 #define blk_queue_pm_only(q)	atomic_read(&(q)->pm_only)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 1/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
  2026-05-06  6:56 ` [PATCH v6 1/1] " Ionut Nechita (Wind River)
@ 2026-05-06  7:14   ` Bart Van Assche
  2026-05-06  7:47     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Van Assche @ 2026-05-06  7:14 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), axboe, linux-block
  Cc: bigeasy, clrkwllms, rostedt, ming.lei, muchun.song, mkhalfella,
	chris.friesen, linux-kernel, linux-rt-devel, linux-rt-users,
	stable, ionut_n2001, sunlightlinux

On 5/6/26 8:56 AM, Ionut Nechita (Wind River) wrote:
>   void blk_mq_quiesce_queue_nowait(struct request_queue *q)
>   {
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&q->queue_lock, flags);
> -	if (!q->quiesce_depth++)
> -		blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
> -	spin_unlock_irqrestore(&q->queue_lock, flags);
> +	atomic_inc(&q->quiesce_depth);
> +	/*
> +	 * Pairs with smp_rmb() in blk_mq_run_hw_queue(): make the
> +	 * incremented quiesce_depth observable to readers re-checking
> +	 * the quiesce state, so they don't dispatch on a quiesced queue.
> +	 */
> +	smp_mb__after_atomic();
>   }

No, this is not sufficient to guarantee that blk_mq_run_hw_queue() sees
the latest value of q->quiesce_depth. If you want to achieve that I
think the only option is to protect the atomic_inc() above with
hctx->queue->queue_lock.

> @@ -2362,17 +2365,15 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>   
>   	need_run = blk_mq_hw_queue_need_run(hctx);
>   	if (!need_run) {
> -		unsigned long flags;
> -
>   		/*
> -		 * Synchronize with blk_mq_unquiesce_queue(), because we check
> -		 * if hw queue is quiesced locklessly above, we need the use
> -		 * ->queue_lock to make sure we see the up-to-date status to
> -		 * not miss rerunning the hw queue.
> +		 * Re-check the quiesce state after a read barrier. Pairs with
> +		 * smp_mb__after_atomic() in blk_mq_quiesce_queue_nowait() and
> +		 * blk_mq_unquiesce_queue() so we don't miss rerunning the hw
> +		 * queue when a concurrent unquiesce has just dropped the
> +		 * quiesce_depth to zero.
>   		 */
> -		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> +		smp_rmb();
>   		need_run = blk_mq_hw_queue_need_run(hctx);
> -		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);

If the atomic_inc() in blk_mq_quiesce_queue_nowait() is protected by
hctx->queue->queue_lock then the above code doesn't have to be modified.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 1/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
  2026-05-06  7:14   ` Bart Van Assche
@ 2026-05-06  7:47     ` Sebastian Andrzej Siewior
  2026-05-06  9:43       ` Bart Van Assche
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-06  7:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ionut Nechita (Wind River), axboe, linux-block, clrkwllms,
	rostedt, ming.lei, muchun.song, mkhalfella, chris.friesen,
	linux-kernel, linux-rt-devel, linux-rt-users, stable, ionut_n2001,
	sunlightlinux

On 2026-05-06 09:14:33 [+0200], Bart Van Assche wrote:
> On 5/6/26 8:56 AM, Ionut Nechita (Wind River) wrote:
> >   void blk_mq_quiesce_queue_nowait(struct request_queue *q)
> >   {
> > -	unsigned long flags;
> > -
> > -	spin_lock_irqsave(&q->queue_lock, flags);
> > -	if (!q->quiesce_depth++)
> > -		blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
> > -	spin_unlock_irqrestore(&q->queue_lock, flags);
> > +	atomic_inc(&q->quiesce_depth);
> > +	/*
> > +	 * Pairs with smp_rmb() in blk_mq_run_hw_queue(): make the
> > +	 * incremented quiesce_depth observable to readers re-checking
> > +	 * the quiesce state, so they don't dispatch on a quiesced queue.
> > +	 */
> > +	smp_mb__after_atomic();
> >   }
> 
> No, this is not sufficient to guarantee that blk_mq_run_hw_queue() sees
> the latest value of q->quiesce_depth. If you want to achieve that I
> think the only option is to protect the atomic_inc() above with
> hctx->queue->queue_lock.
> 
> > @@ -2362,17 +2365,15 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> >   	need_run = blk_mq_hw_queue_need_run(hctx);
> >   	if (!need_run) {
> > -		unsigned long flags;
> > -
> >   		/*
> > -		 * Synchronize with blk_mq_unquiesce_queue(), because we check
> > -		 * if hw queue is quiesced locklessly above, we need the use
> > -		 * ->queue_lock to make sure we see the up-to-date status to
> > -		 * not miss rerunning the hw queue.
> > +		 * Re-check the quiesce state after a read barrier. Pairs with
> > +		 * smp_mb__after_atomic() in blk_mq_quiesce_queue_nowait() and
> > +		 * blk_mq_unquiesce_queue() so we don't miss rerunning the hw
> > +		 * queue when a concurrent unquiesce has just dropped the
> > +		 * quiesce_depth to zero.
> >   		 */
> > -		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> > +		smp_rmb();
> >   		need_run = blk_mq_hw_queue_need_run(hctx);
> > -		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> 
> If the atomic_inc() in blk_mq_quiesce_queue_nowait() is protected by
> hctx->queue->queue_lock then the above code doesn't have to be modified.

But wouldn't the atomic_inc + barrier avoid the need to have the lock?
Isn't this a normal pattern? If the lock is kept, we could use
non-atomic ops here then. But this avoids having the lock.

> Thanks,
> 
> Bart.

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 1/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
  2026-05-06  7:47     ` Sebastian Andrzej Siewior
@ 2026-05-06  9:43       ` Bart Van Assche
  2026-05-07  7:45         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Van Assche @ 2026-05-06  9:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Ionut Nechita (Wind River), axboe, linux-block, clrkwllms,
	rostedt, ming.lei, muchun.song, mkhalfella, chris.friesen,
	linux-kernel, linux-rt-devel, linux-rt-users, stable, ionut_n2001,
	sunlightlinux

On 5/6/26 9:47 AM, Sebastian Andrzej Siewior wrote:
> On 2026-05-06 09:14:33 [+0200], Bart Van Assche wrote:
>> If the atomic_inc() in blk_mq_quiesce_queue_nowait() is protected by
>> hctx->queue->queue_lock then the above code doesn't have to be modified.
> 
> But wouldn't the atomic_inc + barrier avoid the need to have the lock?
> Isn't this a normal pattern? If the lock is kept, we could use
> non-atomic ops here then. But this avoids having the lock.

I strongly prefer a spinlock + non-atomic variables rather than using an
atomic variable and barriers because algorithms that use a spinlock are
easier to verify.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 1/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
  2026-05-06  9:43       ` Bart Van Assche
@ 2026-05-07  7:45         ` Sebastian Andrzej Siewior
  2026-05-07 10:41           ` Bart Van Assche
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-07  7:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ionut Nechita (Wind River), axboe, linux-block, clrkwllms,
	rostedt, ming.lei, muchun.song, mkhalfella, chris.friesen,
	linux-kernel, linux-rt-devel, linux-rt-users, stable, ionut_n2001,
	sunlightlinux

On 2026-05-06 11:43:32 [+0200], Bart Van Assche wrote:
> On 5/6/26 9:47 AM, Sebastian Andrzej Siewior wrote:
> > On 2026-05-06 09:14:33 [+0200], Bart Van Assche wrote:
> > > If the atomic_inc() in blk_mq_quiesce_queue_nowait() is protected by
> > > hctx->queue->queue_lock then the above code doesn't have to be modified.
> > 
> > But wouldn't the atomic_inc + barrier avoid the need to have the lock?
> > Isn't this a normal pattern? If the lock is kept, we could use
> > non-atomic ops here then. But this avoids having the lock.
> 
> I strongly prefer a spinlock + non-atomic variables rather than using an
> atomic variable and barriers because algorithms that use a spinlock are
> easier to verify.

Hmmm. If we keep the lock, then there is no need for the atomic and we
keep int counter. Then we are where we are right now with the lock
synchronizing everything.
Isn't this also improving the performance for the !RT case or is it
simply not that visible here?

> Thanks,
> 
> Bart.

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v6 1/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
  2026-05-07  7:45         ` Sebastian Andrzej Siewior
@ 2026-05-07 10:41           ` Bart Van Assche
  0 siblings, 0 replies; 7+ messages in thread
From: Bart Van Assche @ 2026-05-07 10:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Ionut Nechita (Wind River), axboe, linux-block, clrkwllms,
	rostedt, ming.lei, muchun.song, mkhalfella, chris.friesen,
	linux-kernel, linux-rt-devel, linux-rt-users, stable, ionut_n2001,
	sunlightlinux

On 5/7/26 9:45 AM, Sebastian Andrzej Siewior wrote:
> On 2026-05-06 11:43:32 [+0200], Bart Van Assche wrote:
>> On 5/6/26 9:47 AM, Sebastian Andrzej Siewior wrote:
>>> On 2026-05-06 09:14:33 [+0200], Bart Van Assche wrote:
>>>> If the atomic_inc() in blk_mq_quiesce_queue_nowait() is protected by
>>>> hctx->queue->queue_lock then the above code doesn't have to be modified.
>>>
>>> But wouldn't the atomic_inc + barrier avoid the need to have the lock?
>>> Isn't this a normal pattern? If the lock is kept, we could use
>>> non-atomic ops here then. But this avoids having the lock.
>>
>> I strongly prefer a spinlock + non-atomic variables rather than using an
>> atomic variable and barriers because algorithms that use a spinlock are
>> easier to verify.
> 
> Hmmm. If we keep the lock, then there is no need for the atomic and we
> keep int counter. Then we are where we are right now with the lock
> synchronizing everything.
> Isn't this also improving the performance for the !RT case or is it
> simply not that visible here?

Agreed that not obtaining the queue_lock from blk_mq_run_hw_queue() is 
an interesting improvement. But I'm not sure the new 
smp_mb__after_atomic() and smp_rmb() calls are needed. Block layer calls
of blk_mq_quiesce_queue_nowait() are followed by a 
blk_mq_wait_quiesce_done() call. The latter calls either 
synchronize_srcu() or synchronize_rcu(). Either is sufficient to 
guarantee global visibility of the change of the queue state to "quiesced".

This patch removes a spin_lock() call from
blk_mq_quiesce_queue_nowait(). That spin_lock() call guarantees that
other CPUs will observe the "quiesced" state after the store operations
that precede the blk_mq_quiesce_queue_nowait() call. I don't think that
any block layer code depends on this but I noticed that this change has
not been mentioned in the patch description. A similar comment applies
to the blk_mq_unquiesce_queue() changes: the ordering guarantees
provided by the removed spin_lock() call have not been preserved. There
is probably code in the block layer that depends on the "unquiesced"
state only being observed after prior stores performed by the same CPU
core.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-07 10:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06  6:56 [PATCH v6 0/1] block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT Ionut Nechita (Wind River)
2026-05-06  6:56 ` [PATCH v6 1/1] " Ionut Nechita (Wind River)
2026-05-06  7:14   ` Bart Van Assche
2026-05-06  7:47     ` Sebastian Andrzej Siewior
2026-05-06  9:43       ` Bart Van Assche
2026-05-07  7:45         ` Sebastian Andrzej Siewior
2026-05-07 10:41           ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox