All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT
       [not found] <20251220105448.8065-1-ionut.nechita@windriver.com>
@ 2025-12-20 10:54 ` Ionut Nechita (WindRiver)
  0 siblings, 0 replies; 7+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-20 10:54 UTC (permalink / raw)
  To: ionut_n2001; +Cc: Ionut Nechita, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Commit 58bf93580fec ("blk-mq: move cpuhp callback registering out of
q->sysfs_lock") introduced a global mutex blk_mq_cpuhp_lock to avoid
lockdep warnings between sysfs_lock and CPU hotplug lock.

On RT kernels (CONFIG_PREEMPT_RT), regular mutexes are converted to
rt_mutex (sleeping locks). When block layer operations need to acquire
blk_mq_cpuhp_lock, IRQ threads processing I/O completions may sleep,
causing additional contention on top of the queue_lock issue from
commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding").

Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- v6.6.68-rt with queue_lock fix: 640 MB/s (queue_lock fixed)
- v6.6.69-rt: still exhibits contention due to cpuhp_lock mutex

The functions protected by blk_mq_cpuhp_lock only perform fast,
non-sleeping operations:
- hlist_unhashed() checks
- cpuhp_state_add_instance_nocalls() - just hlist manipulation
- cpuhp_state_remove_instance_nocalls() - just hlist manipulation
- INIT_HLIST_NODE() initialization

The _nocalls variants do not invoke state callbacks and only manipulate
data structures, making them safe to call under raw_spinlock.

Convert blk_mq_cpuhp_lock from mutex to raw_spinlock to prevent it from
becoming a sleeping lock in RT kernel. This eliminates the contention
bottleneck while maintaining the lockdep fix's original intent.

Fixes: 58bf93580fec ("blk-mq: move cpuhp callback registering out of q->sysfs_lock")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-mq.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5fb8da4958d0..3982e24b1081 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,7 +43,7 @@
 
 static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 static DEFINE_PER_CPU(call_single_data_t, blk_cpu_csd);
-static DEFINE_MUTEX(blk_mq_cpuhp_lock);
+static DEFINE_RAW_SPINLOCK(blk_mq_cpuhp_lock);
 
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
 static void blk_mq_request_bypass_insert(struct request *rq,
@@ -3641,9 +3641,9 @@ static void __blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
-	mutex_lock(&blk_mq_cpuhp_lock);
+	raw_spin_lock(&blk_mq_cpuhp_lock);
 	__blk_mq_remove_cpuhp(hctx);
-	mutex_unlock(&blk_mq_cpuhp_lock);
+	raw_spin_unlock(&blk_mq_cpuhp_lock);
 }
 
 static void __blk_mq_add_cpuhp(struct blk_mq_hw_ctx *hctx)
@@ -3683,9 +3683,9 @@ static void blk_mq_remove_hw_queues_cpuhp(struct request_queue *q)
 	list_splice_init(&q->unused_hctx_list, &hctx_list);
 	spin_unlock(&q->unused_hctx_lock);
 
-	mutex_lock(&blk_mq_cpuhp_lock);
+	raw_spin_lock(&blk_mq_cpuhp_lock);
 	__blk_mq_remove_cpuhp_list(&hctx_list);
-	mutex_unlock(&blk_mq_cpuhp_lock);
+	raw_spin_unlock(&blk_mq_cpuhp_lock);
 
 	spin_lock(&q->unused_hctx_lock);
 	list_splice(&hctx_list, &q->unused_hctx_list);
@@ -3702,10 +3702,10 @@ static void blk_mq_add_hw_queues_cpuhp(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	mutex_lock(&blk_mq_cpuhp_lock);
+	raw_spin_lock(&blk_mq_cpuhp_lock);
 	queue_for_each_hw_ctx(q, hctx, i)
 		__blk_mq_add_cpuhp(hctx);
-	mutex_unlock(&blk_mq_cpuhp_lock);
+	raw_spin_unlock(&blk_mq_cpuhp_lock);
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions
@ 2025-12-20 11:02 Ionut Nechita (WindRiver)
  2025-12-20 11:02 ` [PATCH 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-20 11:02 UTC (permalink / raw)
  To: axboe, ming.lei
  Cc: gregkh, muchun.song, sashal, linux-block, linux-kernel, stable,
	Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

This series addresses two critical performance regressions in the block
layer multiqueue (blk-mq) subsystem when running on PREEMPT_RT kernels.

On RT kernels, regular spinlocks are converted to sleeping rt_mutex locks,
which can cause severe performance degradation in the I/O hot path. This
series converts two problematic locking patterns to prevent IRQ threads
from sleeping during I/O operations.

Testing on MegaRAID 12GSAS controller with 8 MSI-X vectors shows:
- v6.6.52-rt (before regression): 640 MB/s sequential read
- v6.6.64-rt (regression introduced): 153 MB/s (-76% regression)
- v6.6.68-rt with queue_lock fix only: 640 MB/s (performance restored)
- v6.6.69-rt with both fixes: expected similar or better performance

The first patch replaces queue_lock with memory barriers in the I/O
completion hot path, eliminating the contention that caused IRQ threads
to sleep. The second patch converts the global blk_mq_cpuhp_lock from
mutex to raw_spinlock to prevent sleeping during CPU hotplug operations.

Both conversions are safe because the protected code paths only perform
fast, non-blocking operations (memory barriers, list/hlist manipulation,
flag checks).

Ionut Nechita (2):
  block/blk-mq: fix RT kernel regression with queue_lock in hot path
  block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT

 block/blk-mq.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

--
2.52.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path
  2025-12-20 11:02 [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions Ionut Nechita (WindRiver)
@ 2025-12-20 11:02 ` Ionut Nechita (WindRiver)
  2025-12-20 11:02 ` [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT Ionut Nechita (WindRiver)
  2025-12-20 16:00 ` [syzbot ci] " syzbot ci
  2 siblings, 0 replies; 7+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-20 11:02 UTC (permalink / raw)
  To: axboe, ming.lei
  Cc: gregkh, muchun.song, sashal, linux-block, linux-kernel, stable,
	Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

Commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.

On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
completions concurrently, they contend on queue_lock in the hot path,
causing all IRQ threads to enter D (uninterruptible sleep) state. This
serializes interrupt processing completely.

Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- Good (v6.6.52-rt):  640 MB/s sequential read
- Bad  (v6.6.64-rt):  153 MB/s sequential read (-76% regression)
- 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock

The original commit message mentioned memory barriers as an alternative
approach. Use full memory barriers (smp_mb) instead of queue_lock to
provide the same ordering guarantees without sleeping in RT kernel.

Memory barriers ensure proper synchronization:
- CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
- CPU1 sees dispatch list/sw queue bitmap updates

This maintains correctness while avoiding lock contention that causes
RT kernel IRQ threads to sleep in the I/O completion path.

Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-mq.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5da948b07058..5fb8da4958d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 
 	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
 
+	/*
+	 * First lockless check to avoid unnecessary overhead.
+	 * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
+	 */
 	need_run = blk_mq_hw_queue_need_run(hctx);
 	if (!need_run) {
-		unsigned long flags;
-
-		/*
-		 * Synchronize with blk_mq_unquiesce_queue(), because we check
-		 * if hw queue is quiesced locklessly above, we need the use
-		 * ->queue_lock to make sure we see the up-to-date status to
-		 * not miss rerunning the hw queue.
-		 */
-		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		/* Synchronize with blk_mq_unquiesce_queue() */
+		smp_mb();
 		need_run = blk_mq_hw_queue_need_run(hctx);
-		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
-
 		if (!need_run)
 			return;
+		/* Ensure dispatch list/sw queue updates visible before execution */
+		smp_mb();
 	}
 
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT
  2025-12-20 11:02 [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions Ionut Nechita (WindRiver)
  2025-12-20 11:02 ` [PATCH 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
@ 2025-12-20 11:02 ` Ionut Nechita (WindRiver)
  2025-12-20 12:47   ` Ming Lei
  2025-12-20 16:00 ` [syzbot ci] " syzbot ci
  2 siblings, 1 reply; 7+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-20 11:02 UTC (permalink / raw)
  To: axboe, ming.lei
  Cc: gregkh, muchun.song, sashal, linux-block, linux-kernel, stable,
	Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

Commit 58bf93580fec ("blk-mq: move cpuhp callback registering out of
q->sysfs_lock") introduced a global mutex blk_mq_cpuhp_lock to avoid
lockdep warnings between sysfs_lock and CPU hotplug lock.

On RT kernels (CONFIG_PREEMPT_RT), regular mutexes are converted to
rt_mutex (sleeping locks). When block layer operations need to acquire
blk_mq_cpuhp_lock, IRQ threads processing I/O completions may sleep,
causing additional contention on top of the queue_lock issue from
commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding").

Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- v6.6.68-rt with queue_lock fix: 640 MB/s (queue_lock fixed)
- v6.6.69-rt: still exhibits contention due to cpuhp_lock mutex

The functions protected by blk_mq_cpuhp_lock only perform fast,
non-sleeping operations:
- hlist_unhashed() checks
- cpuhp_state_add_instance_nocalls() - just hlist manipulation
- cpuhp_state_remove_instance_nocalls() - just hlist manipulation
- INIT_HLIST_NODE() initialization

The _nocalls variants do not invoke state callbacks and only manipulate
data structures, making them safe to call under raw_spinlock.

Convert blk_mq_cpuhp_lock from mutex to raw_spinlock to prevent it from
becoming a sleeping lock in RT kernel. This eliminates the contention
bottleneck while maintaining the lockdep fix's original intent.

Fixes: 58bf93580fec ("blk-mq: move cpuhp callback registering out of q->sysfs_lock")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-mq.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5fb8da4958d0..3982e24b1081 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,7 +43,7 @@
 
 static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 static DEFINE_PER_CPU(call_single_data_t, blk_cpu_csd);
-static DEFINE_MUTEX(blk_mq_cpuhp_lock);
+static DEFINE_RAW_SPINLOCK(blk_mq_cpuhp_lock);
 
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
 static void blk_mq_request_bypass_insert(struct request *rq,
@@ -3641,9 +3641,9 @@ static void __blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
-	mutex_lock(&blk_mq_cpuhp_lock);
+	raw_spin_lock(&blk_mq_cpuhp_lock);
 	__blk_mq_remove_cpuhp(hctx);
-	mutex_unlock(&blk_mq_cpuhp_lock);
+	raw_spin_unlock(&blk_mq_cpuhp_lock);
 }
 
 static void __blk_mq_add_cpuhp(struct blk_mq_hw_ctx *hctx)
@@ -3683,9 +3683,9 @@ static void blk_mq_remove_hw_queues_cpuhp(struct request_queue *q)
 	list_splice_init(&q->unused_hctx_list, &hctx_list);
 	spin_unlock(&q->unused_hctx_lock);
 
-	mutex_lock(&blk_mq_cpuhp_lock);
+	raw_spin_lock(&blk_mq_cpuhp_lock);
 	__blk_mq_remove_cpuhp_list(&hctx_list);
-	mutex_unlock(&blk_mq_cpuhp_lock);
+	raw_spin_unlock(&blk_mq_cpuhp_lock);
 
 	spin_lock(&q->unused_hctx_lock);
 	list_splice(&hctx_list, &q->unused_hctx_list);
@@ -3702,10 +3702,10 @@ static void blk_mq_add_hw_queues_cpuhp(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	mutex_lock(&blk_mq_cpuhp_lock);
+	raw_spin_lock(&blk_mq_cpuhp_lock);
 	queue_for_each_hw_ctx(q, hctx, i)
 		__blk_mq_add_cpuhp(hctx);
-	mutex_unlock(&blk_mq_cpuhp_lock);
+	raw_spin_unlock(&blk_mq_cpuhp_lock);
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT
  2025-12-20 11:02 ` [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT Ionut Nechita (WindRiver)
@ 2025-12-20 12:47   ` Ming Lei
  2025-12-20 20:58     ` [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions Ionut Nechita (WindRiver)
  0 siblings, 1 reply; 7+ messages in thread
From: Ming Lei @ 2025-12-20 12:47 UTC (permalink / raw)
  To: Ionut Nechita (WindRiver)
  Cc: axboe, gregkh, muchun.song, sashal, linux-block, linux-kernel,
	stable, Ionut Nechita

On Sat, Dec 20, 2025 at 01:02:41PM +0200, Ionut Nechita (WindRiver) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> Commit 58bf93580fec ("blk-mq: move cpuhp callback registering out of
> q->sysfs_lock") introduced a global mutex blk_mq_cpuhp_lock to avoid
> lockdep warnings between sysfs_lock and CPU hotplug lock.
> 
> On RT kernels (CONFIG_PREEMPT_RT), regular mutexes are converted to
> rt_mutex (sleeping locks). When block layer operations need to acquire
> blk_mq_cpuhp_lock, IRQ threads processing I/O completions may sleep,
> causing additional contention on top of the queue_lock issue from
> commit 679b1874eba7 ("block: fix ordering between checking
> QUEUE_FLAG_QUIESCED request adding").
> 
> Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
> - v6.6.68-rt with queue_lock fix: 640 MB/s (queue_lock fixed)
> - v6.6.69-rt: still exhibits contention due to cpuhp_lock mutex
> 
> The functions protected by blk_mq_cpuhp_lock only perform fast,
> non-sleeping operations:
> - hlist_unhashed() checks
> - cpuhp_state_add_instance_nocalls() - just hlist manipulation
> - cpuhp_state_remove_instance_nocalls() - just hlist manipulation
> - INIT_HLIST_NODE() initialization
> 
> The _nocalls variants do not invoke state callbacks and only manipulate
> data structures, making them safe to call under raw_spinlock.
> 
> Convert blk_mq_cpuhp_lock from mutex to raw_spinlock to prevent it from
> becoming a sleeping lock in RT kernel. This eliminates the contention
> bottleneck while maintaining the lockdep fix's original intent.

What is the contention bottleneck? blk_mq_cpuhp_lock is only acquired in
slow code path, and it isn't required in fast io path.

> 
> Fixes: 58bf93580fec ("blk-mq: move cpuhp callback registering out of q->sysfs_lock")

With the 1st patch, the perf becomes 640MB/s, same with before regression.

So can you share what you try to fix with this patch?

> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>  block/blk-mq.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5fb8da4958d0..3982e24b1081 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -43,7 +43,7 @@
>  
>  static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
>  static DEFINE_PER_CPU(call_single_data_t, blk_cpu_csd);
> -static DEFINE_MUTEX(blk_mq_cpuhp_lock);
> +static DEFINE_RAW_SPINLOCK(blk_mq_cpuhp_lock);
>  
>  static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
>  static void blk_mq_request_bypass_insert(struct request *rq,
> @@ -3641,9 +3641,9 @@ static void __blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>  
>  static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>  {
> -	mutex_lock(&blk_mq_cpuhp_lock);
> +	raw_spin_lock(&blk_mq_cpuhp_lock);
>  	__blk_mq_remove_cpuhp(hctx);
> -	mutex_unlock(&blk_mq_cpuhp_lock);
> +	raw_spin_unlock(&blk_mq_cpuhp_lock);

__blk_mq_remove_cpuhp()
	->cpuhp_state_remove_instance_nocalls()
		->__cpuhp_state_remove_instance
			->cpus_read_lock
				->percpu_down_read
					->percpu_down_read_internal
						->might_sleep()


Thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [syzbot ci] Re: block/blk-mq: fix RT kernel performance regressions
  2025-12-20 11:02 [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions Ionut Nechita (WindRiver)
  2025-12-20 11:02 ` [PATCH 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
  2025-12-20 11:02 ` [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT Ionut Nechita (WindRiver)
@ 2025-12-20 16:00 ` syzbot ci
  2 siblings, 0 replies; 7+ messages in thread
From: syzbot ci @ 2025-12-20 16:00 UTC (permalink / raw)
  To: axboe, djiony2011, gregkh, ionut.nechita, linux-block,
	linux-kernel, ming.lei, muchun.song, sashal, stable
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] block/blk-mq: fix RT kernel performance regressions
https://lore.kernel.org/all/20251220110241.8435-1-ionut.nechita@windriver.com
* [PATCH 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path
* [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT

and found the following issues:
* BUG: sleeping function called from invalid context in __cpuhp_state_add_instance
* BUG: sleeping function called from invalid context in __cpuhp_state_remove_instance

Full report is available here:
https://ci.syzbot.org/series/632f4721-6256-44fd-83f5-bf439d5f33f9

***

BUG: sleeping function called from invalid context in __cpuhp_state_add_instance

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      dd9b004b7ff3289fb7bae35130c0a5c0537266af
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/9ad1c682-13b1-4626-b61a-a2156384698d/config
C repro:   https://ci.syzbot.org/findings/f999a055-07f3-4d7a-acfd-8bc0be61e2ec/c_repro
syz repro: https://ci.syzbot.org/findings/f999a055-07f3-4d7a-acfd-8bc0be61e2ec/syz_repro

BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:51
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 5982, name: syz.0.17
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 1 UID: 0 PID: 5982 Comm: syz.0.17 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 __might_resched+0x495/0x610 kernel/sched/core.c:8827
 percpu_down_read_internal include/linux/percpu-rwsem.h:51 [inline]
 percpu_down_read include/linux/percpu-rwsem.h:77 [inline]
 cpus_read_lock+0x1b/0x160 kernel/cpu.c:491
 __cpuhp_state_add_instance+0x19/0x40 kernel/cpu.c:2454
 cpuhp_state_add_instance_nocalls include/linux/cpuhotplug.h:401 [inline]
 __blk_mq_add_cpuhp block/blk-mq.c:3858 [inline]
 blk_mq_add_hw_queues_cpuhp+0x19a/0x250 block/blk-mq.c:3906
 blk_mq_realloc_hw_ctxs block/blk-mq.c:4611 [inline]
 blk_mq_init_allocated_queue+0x366/0x1350 block/blk-mq.c:4635
 blk_mq_alloc_queue block/blk-mq.c:4416 [inline]
 __blk_mq_alloc_disk+0x1f0/0x340 block/blk-mq.c:4459
 loop_add+0x411/0xad0 drivers/block/loop.c:2050
 loop_control_ioctl+0x128/0x5a0 drivers/block/loop.c:2216
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1dc598f7c9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffff2134d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f1dc5be5fa0 RCX: 00007f1dc598f7c9
RDX: 00000000004080f9 RSI: 0000000000004c80 RDI: 0000000000000003
RBP: 00007f1dc59f297f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1dc5be5fa0 R14: 00007f1dc5be5fa0 R15: 0000000000000003
 </TASK>


***

BUG: sleeping function called from invalid context in __cpuhp_state_remove_instance

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      dd9b004b7ff3289fb7bae35130c0a5c0537266af
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/9ad1c682-13b1-4626-b61a-a2156384698d/config
C repro:   https://ci.syzbot.org/findings/f39691bc-570a-4163-9791-31ce10e18fb6/c_repro
syz repro: https://ci.syzbot.org/findings/f39691bc-570a-4163-9791-31ce10e18fb6/syz_repro

BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:51
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 5975, name: syz.0.17
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 0 UID: 0 PID: 5975 Comm: syz.0.17 Tainted: G        W           syzkaller #0 PREEMPT(full) 
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 __might_resched+0x495/0x610 kernel/sched/core.c:8827
 percpu_down_read_internal include/linux/percpu-rwsem.h:51 [inline]
 percpu_down_read include/linux/percpu-rwsem.h:77 [inline]
 cpus_read_lock+0x1b/0x160 kernel/cpu.c:491
 __cpuhp_state_remove_instance+0x77/0x2e0 kernel/cpu.c:2565
 cpuhp_state_remove_instance_nocalls include/linux/cpuhotplug.h:502 [inline]
 __blk_mq_remove_cpuhp+0x140/0x1a0 block/blk-mq.c:3835
 blk_mq_remove_cpuhp block/blk-mq.c:3844 [inline]
 blk_mq_exit_hw_queues block/blk-mq.c:3974 [inline]
 blk_mq_exit_queue+0xe8/0x380 block/blk-mq.c:4670
 __del_gendisk+0x832/0x9e0 block/genhd.c:774
 del_gendisk+0xe8/0x160 block/genhd.c:823
 loop_remove+0x42/0xc0 drivers/block/loop.c:2121
 loop_control_remove drivers/block/loop.c:2180 [inline]
 loop_control_ioctl+0x4ac/0x5a0 drivers/block/loop.c:2218
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f911d78f7c9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd4f0e1fb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f911d9e5fa0 RCX: 00007f911d78f7c9
RDX: 0000000000000006 RSI: 0000000000004c81 RDI: 0000000000000003
RBP: 00007f911d7f297f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f911d9e5fa0 R14: 00007f911d9e5fa0 R15: 0000000000000003
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions
  2025-12-20 12:47   ` Ming Lei
@ 2025-12-20 20:58     ` Ionut Nechita (WindRiver)
  0 siblings, 0 replies; 7+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-20 20:58 UTC (permalink / raw)
  To: ming.lei
  Cc: axboe, djiony2011, gregkh, ionut.nechita, linux-block,
	linux-kernel, muchun.song, sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Ming,

Thank you for the feedback!

You're absolutely right - blk_mq_cpuhp_lock is only acquired in the slow
path (setup/cleanup operations during queue initialization/teardown), not
in the fast I/O path.

Looking at my testing results more carefully:
- The queue_lock patch (PATCH 1/2) alone restores performance to 640 MB/s
- The cpuhp_lock conversion (PATCH 2/2) doesn't contribute to fixing the
  I/O regression

The cpuhp_lock is used in:
- blk_mq_remove_cpuhp() - queue cleanup
- blk_mq_add_hw_queues_cpuhp() - queue setup
- blk_mq_remove_hw_queues_cpuhp() - queue cleanup

These are indeed slow path operations with no contention in the I/O hot
path.

I'll drop the second patch (cpuhp_lock conversion) and send v2 with only
the queue_lock fix, which addresses the actual bottleneck: removing the
sleeping lock from blk_mq_run_hw_queue() that was causing IRQ threads to
serialize and enter D-state during I/O completion.

Best regards,
Ionut

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-20 20:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-20 11:02 [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions Ionut Nechita (WindRiver)
2025-12-20 11:02 ` [PATCH 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
2025-12-20 11:02 ` [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT Ionut Nechita (WindRiver)
2025-12-20 12:47   ` Ming Lei
2025-12-20 20:58     ` [PATCH 0/2] block/blk-mq: fix RT kernel performance regressions Ionut Nechita (WindRiver)
2025-12-20 16:00 ` [syzbot ci] " syzbot ci
     [not found] <20251220105448.8065-1-ionut.nechita@windriver.com>
2025-12-20 10:54 ` [PATCH 2/2] block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT Ionut Nechita (WindRiver)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.