linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] block/blk-mq: fix RT kernel issues and interrupt context warnings
@ 2025-12-22 20:15 Ionut Nechita (WindRiver)
  2025-12-22 20:15 ` [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
  2025-12-22 20:15 ` [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context Ionut Nechita (WindRiver)
  0 siblings, 2 replies; 12+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-22 20:15 UTC (permalink / raw)
  To: ming.lei
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

This series addresses two critical issues in the block layer multiqueue
(blk-mq) subsystem when running on PREEMPT_RT kernels.

The first patch fixes a severe performance regression where queue_lock
contention in the I/O hot path causes IRQ threads to sleep on RT kernels.
Testing on MegaRAID 12GSAS controller showed a 76% performance drop
(640 MB/s -> 153 MB/s). The fix replaces spinlock with memory barriers
to maintain ordering without sleeping.

The second patch fixes a WARN_ON that triggers during SCSI device scanning
when blk_freeze_queue_start() calls blk_mq_run_hw_queues() synchronously
from interrupt context. The warning "WARN_ON_ONCE(!async && in_interrupt())"
is resolved by switching to asynchronous execution.

Changes in v2:
- Removed the blk_mq_cpuhp_lock patch (needs more investigation)
- Added fix for WARN_ON in interrupt context during queue freezing
- Updated commit messages for clarity

Ionut Nechita (2):
  block/blk-mq: fix RT kernel regression with queue_lock in hot path
  block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt
    context

 block/blk-mq.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path
  2025-12-22 20:15 [PATCH v2 0/2] block/blk-mq: fix RT kernel issues and interrupt context warnings Ionut Nechita (WindRiver)
@ 2025-12-22 20:15 ` Ionut Nechita (WindRiver)
  2025-12-23  2:15   ` Muchun Song
  2025-12-22 20:15 ` [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context Ionut Nechita (WindRiver)
  1 sibling, 1 reply; 12+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-22 20:15 UTC (permalink / raw)
  To: ming.lei
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.

On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
completions concurrently, they contend on queue_lock in the hot path,
causing all IRQ threads to enter D (uninterruptible sleep) state. This
serializes interrupt processing completely.

Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- Good (v6.6.52-rt):  640 MB/s sequential read
- Bad  (v6.6.64-rt):  153 MB/s sequential read (-76% regression)
- 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock

The original commit message mentioned memory barriers as an alternative
approach. Use full memory barriers (smp_mb) instead of queue_lock to
provide the same ordering guarantees without sleeping in RT kernel.

Memory barriers ensure proper synchronization:
- CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
- CPU1 sees dispatch list/sw queue bitmap updates

This maintains correctness while avoiding lock contention that causes
RT kernel IRQ threads to sleep in the I/O completion path.

Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-mq.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5da948b07058..5fb8da4958d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 
 	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
 
+	/*
+	 * First lockless check to avoid unnecessary overhead.
+	 * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
+	 */
 	need_run = blk_mq_hw_queue_need_run(hctx);
 	if (!need_run) {
-		unsigned long flags;
-
-		/*
-		 * Synchronize with blk_mq_unquiesce_queue(), because we check
-		 * if hw queue is quiesced locklessly above, we need the use
-		 * ->queue_lock to make sure we see the up-to-date status to
-		 * not miss rerunning the hw queue.
-		 */
-		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		/* Synchronize with blk_mq_unquiesce_queue() */
+		smp_mb();
 		need_run = blk_mq_hw_queue_need_run(hctx);
-		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
-
 		if (!need_run)
 			return;
+		/* Ensure dispatch list/sw queue updates visible before execution */
+		smp_mb();
 	}
 
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2025-12-22 20:15 [PATCH v2 0/2] block/blk-mq: fix RT kernel issues and interrupt context warnings Ionut Nechita (WindRiver)
  2025-12-22 20:15 ` [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
@ 2025-12-22 20:15 ` Ionut Nechita (WindRiver)
  2025-12-23  1:22   ` Ming Lei
  2025-12-23  2:18   ` Muchun Song
  1 sibling, 2 replies; 12+ messages in thread
From: Ionut Nechita (WindRiver) @ 2025-12-22 20:15 UTC (permalink / raw)
  To: ming.lei
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Fix warning "WARN_ON_ONCE(!async && in_interrupt())" that occurs during
SCSI device scanning when blk_freeze_queue_start() calls blk_mq_run_hw_queues()
synchronously from interrupt context.

The issue happens during device removal/scanning when:
1. blk_mq_destroy_queue() -> blk_queue_start_drain()
2. blk_freeze_queue_start() calls blk_mq_run_hw_queues(q, false)
3. This triggers the warning in blk_mq_run_hw_queue() when in interrupt context

Change the synchronous call to asynchronous to avoid running in interrupt context.

Fixes: Warning in blk_mq_run_hw_queue+0x1fa/0x260
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5fb8da4958d0..ae152f7a6933 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -128,7 +128,7 @@ void blk_freeze_queue_start(struct request_queue *q)
 		percpu_ref_kill(&q->q_usage_counter);
 		mutex_unlock(&q->mq_freeze_lock);
 		if (queue_is_mq(q))
-			blk_mq_run_hw_queues(q, false);
+			blk_mq_run_hw_queues(q, true);
 	} else {
 		mutex_unlock(&q->mq_freeze_lock);
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2025-12-22 20:15 ` [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context Ionut Nechita (WindRiver)
@ 2025-12-23  1:22   ` Ming Lei
  2026-01-06 11:14     ` djiony2011
  2025-12-23  2:18   ` Muchun Song
  1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-12-23  1:22 UTC (permalink / raw)
  To: Ionut Nechita (WindRiver)
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

On Mon, Dec 22, 2025 at 10:15:41PM +0200, Ionut Nechita (WindRiver) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> Fix warning "WARN_ON_ONCE(!async && in_interrupt())" that occurs during
> SCSI device scanning when blk_freeze_queue_start() calls blk_mq_run_hw_queues()
> synchronously from interrupt context.

Can you show the whole stack trace in the warning? The in-code doesn't
indicate that freeze queue can be called from scsi's interrupt context.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path
  2025-12-22 20:15 ` [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
@ 2025-12-23  2:15   ` Muchun Song
  2026-01-06 11:36     ` djiony2011
  0 siblings, 1 reply; 12+ messages in thread
From: Muchun Song @ 2025-12-23  2:15 UTC (permalink / raw)
  To: Ionut Nechita (WindRiver)
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel, sashal,
	stable, ming.lei



On 2025/12/23 04:15, Ionut Nechita (WindRiver) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
>
> Commit 679b1874eba7 ("block: fix ordering between checking
> QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
> in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.
>
> On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
> rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
> completions concurrently, they contend on queue_lock in the hot path,
> causing all IRQ threads to enter D (uninterruptible sleep) state. This
> serializes interrupt processing completely.
>
> Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
> - Good (v6.6.52-rt):  640 MB/s sequential read
> - Bad  (v6.6.64-rt):  153 MB/s sequential read (-76% regression)
> - 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock
>
> The original commit message mentioned memory barriers as an alternative
> approach. Use full memory barriers (smp_mb) instead of queue_lock to
> provide the same ordering guarantees without sleeping in RT kernel.
>
> Memory barriers ensure proper synchronization:
> - CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
> - CPU1 sees dispatch list/sw queue bitmap updates
>
> This maintains correctness while avoiding lock contention that causes
> RT kernel IRQ threads to sleep in the I/O completion path.
>
> Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>   block/blk-mq.c | 19 ++++++++-----------
>   1 file changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5da948b07058..5fb8da4958d0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>   
>   	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>   
> +	/*
> +	 * First lockless check to avoid unnecessary overhead.
> +	 * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
> +	 */
>   	need_run = blk_mq_hw_queue_need_run(hctx);
>   	if (!need_run) {
> -		unsigned long flags;
> -
> -		/*
> -		 * Synchronize with blk_mq_unquiesce_queue(), because we check
> -		 * if hw queue is quiesced locklessly above, we need the use
> -		 * ->queue_lock to make sure we see the up-to-date status to
> -		 * not miss rerunning the hw queue.
> -		 */
> -		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> +		/* Synchronize with blk_mq_unquiesce_queue() */

Memory barriers must be used in pairs. So how to synchronize?

> +		smp_mb();
>   		need_run = blk_mq_hw_queue_need_run(hctx);
> -		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> -
>   		if (!need_run)
>   			return;
> +		/* Ensure dispatch list/sw queue updates visible before execution */
> +		smp_mb();

Why we need another barrier? What order does this barrier guarantee?

Thanks.
>   	}
>   
>   	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2025-12-22 20:15 ` [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context Ionut Nechita (WindRiver)
  2025-12-23  1:22   ` Ming Lei
@ 2025-12-23  2:18   ` Muchun Song
  1 sibling, 0 replies; 12+ messages in thread
From: Muchun Song @ 2025-12-23  2:18 UTC (permalink / raw)
  To: Ionut Nechita (WindRiver)
  Cc: ming.lei, axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	sashal, stable



> On Dec 23, 2025, at 04:15, Ionut Nechita (WindRiver) <djiony2011@gmail.com> wrote:
> 
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> Fix warning "WARN_ON_ONCE(!async && in_interrupt())" that occurs during
> SCSI device scanning when blk_freeze_queue_start() calls blk_mq_run_hw_queues()
> synchronously from interrupt context.
> 
> The issue happens during device removal/scanning when:
> 1. blk_mq_destroy_queue() -> blk_queue_start_drain()
> 2. blk_freeze_queue_start() calls blk_mq_run_hw_queues(q, false)
> 3. This triggers the warning in blk_mq_run_hw_queue() when in interrupt context
> 
> Change the synchronous call to asynchronous to avoid running in interrupt context.
> 
> Fixes: Warning in blk_mq_run_hw_queue+0x1fa/0x260

You've added a wrong format of Fixes tag.

Thanks.

> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2025-12-23  1:22   ` Ming Lei
@ 2026-01-06 11:14     ` djiony2011
  2026-01-06 12:29       ` Bart Van Assche
  2026-01-06 15:04       ` Ming Lei
  0 siblings, 2 replies; 12+ messages in thread
From: djiony2011 @ 2026-01-06 11:14 UTC (permalink / raw)
  To: ming.lei
  Cc: axboe, djiony2011, gregkh, ionut.nechita, linux-block,
	linux-kernel, muchun.song, sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Ming,

Thank you for the review. You're absolutely right to ask for clarification - I need to
correct my commit message as it's misleading about the actual call path.

> Can you show the whole stack trace in the warning? The in-code doesn't
> indicate that freeze queue can be called from scsi's interrupt context.

Here's the complete stack trace from the WARNING at blk_mq_run_hw_queue:

[Mon Dec 22 10:18:18 2025] WARNING: CPU: 190 PID: 2041 at block/blk-mq.c:2291 blk_mq_run_hw_queue+0x1fa/0x260
[Mon Dec 22 10:18:18 2025] Modules linked in:
[Mon Dec 22 10:18:18 2025] CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1
[Mon Dec 22 10:18:18 2025] Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
[Mon Dec 22 10:18:18 2025] Workqueue: events_unbound async_run_entry_fn
[Mon Dec 22 10:18:18 2025] RIP: 0010:blk_mq_run_hw_queue+0x1fa/0x260
[Mon Dec 22 10:18:18 2025] Code: ff 75 68 44 89 f6 e8 e5 45 c0 ff e9 ac fe ff ff e8 2b 70 c0 ff 48 89 ef e8 b3 a0 00 00 5b 5d 41 5c 41 5d 41 5e e9 26 9e c0 ff <0f> 0b e9 43 fe ff ff e8 0a 70 c0 ff 48 8b 85 d0 00 00 00 48 8b 80
[Mon Dec 22 10:18:18 2025] RSP: 0018:ff630f098528fb98 EFLAGS: 00010206
[Mon Dec 22 10:18:18 2025] RAX: 0000000000ff0000 RBX: 0000000000000000 RCX: 0000000000000000
[Mon Dec 22 10:18:18 2025] RDX: 0000000000ff0000 RSI: 0000000000000000 RDI: ff3edc0247159400
[Mon Dec 22 10:18:18 2025] RBP: ff3edc0247159400 R08: ff3edc0247159400 R09: ff630f098528fb60
[Mon Dec 22 10:18:18 2025] R10: 0000000000000000 R11: 0000000045069ed3 R12: 0000000000000000
[Mon Dec 22 10:18:18 2025] R13: ff3edc024715a828 R14: 0000000000000000 R15: 0000000000000000
[Mon Dec 22 10:18:18 2025] FS:  0000000000000000(0000) GS:ff3edc10fd380000(0000) knlGS:0000000000000000
[Mon Dec 22 10:18:18 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Dec 22 10:18:18 2025] CR2: 0000000000000000 CR3: 000000073961a001 CR4: 0000000000771ee0
[Mon Dec 22 10:18:18 2025] PKRU: 55555554
[Mon Dec 22 10:18:18 2025] Call Trace:
[Mon Dec 22 10:18:18 2025]  <TASK>
[Mon Dec 22 10:18:18 2025]  ? __warn+0x89/0x140
[Mon Dec 22 10:18:18 2025]  ? blk_mq_run_hw_queue+0x1fa/0x260
[Mon Dec 22 10:18:18 2025]  ? report_bug+0x198/0x1b0
[Mon Dec 22 10:18:18 2025]  ? handle_bug+0x53/0x90
[Mon Dec 22 10:18:18 2025]  ? exc_invalid_op+0x18/0x70
[Mon Dec 22 10:18:18 2025]  ? asm_exc_invalid_op+0x1a/0x20
[Mon Dec 22 10:18:18 2025]  ? blk_mq_run_hw_queue+0x1fa/0x260
[Mon Dec 22 10:18:18 2025]  blk_mq_run_hw_queues+0x6c/0x130
[Mon Dec 22 10:18:18 2025]  blk_queue_start_drain+0x12/0x40
[Mon Dec 22 10:18:18 2025]  blk_mq_destroy_queue+0x37/0x70
[Mon Dec 22 10:18:18 2025]  __scsi_remove_device+0x6a/0x180
[Mon Dec 22 10:18:18 2025]  scsi_alloc_sdev+0x357/0x360
[Mon Dec 22 10:18:18 2025]  scsi_probe_and_add_lun+0x8ac/0xc00
[Mon Dec 22 10:18:18 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Mon Dec 22 10:18:18 2025]  ? dev_set_name+0x57/0x80
[Mon Dec 22 10:18:18 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Mon Dec 22 10:18:18 2025]  ? attribute_container_add_device+0x4d/0x130
[Mon Dec 22 10:18:18 2025]  __scsi_scan_target+0xf0/0x520
[Mon Dec 22 10:18:18 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Mon Dec 22 10:18:18 2025]  ? sched_clock_cpu+0x64/0x190
[Mon Dec 22 10:18:18 2025]  scsi_scan_channel+0x57/0x90
[Mon Dec 22 10:18:18 2025]  scsi_scan_host_selected+0xd4/0x110
[Mon Dec 22 10:18:18 2025]  do_scan_async+0x1c/0x190
[Mon Dec 22 10:18:18 2025]  async_run_entry_fn+0x2f/0x130
[Mon Dec 22 10:18:18 2025]  process_one_work+0x175/0x370
[Mon Dec 22 10:18:18 2025]  worker_thread+0x280/0x390
[Mon Dec 22 10:18:18 2025]  ? __pfx_worker_thread+0x10/0x10
[Mon Dec 22 10:18:18 2025]  kthread+0xdd/0x110
[Mon Dec 22 10:18:18 2025]  ? __pfx_kthread+0x10/0x10
[Mon Dec 22 10:18:18 2025]  ret_from_fork+0x31/0x50
[Mon Dec 22 10:18:18 2025]  ? __pfx_kthread+0x10/0x10
[Mon Dec 22 10:18:18 2025]  ret_from_fork_asm+0x1b/0x30
[Mon Dec 22 10:18:18 2025]  </TASK>
[Mon Dec 22 10:18:18 2025] ---[ end trace 0000000000000000 ]---

## Important clarifications:

1. **Not freeze queue, but drain during destroy**: My commit message was incorrect.
   The call path is:
   blk_mq_destroy_queue() -> blk_queue_start_drain() -> blk_mq_run_hw_queues(q, false)

   This is NOT during blk_freeze_queue_start(), but during queue destruction when a
   SCSI device probe fails and cleanup is triggered.

2. **Not true interrupt context**: You're correct that this isn't from an interrupt
   handler. The workqueue context is process context, not interrupt context.

3. **The actual problem on PREEMPT_RT**: There's a preceding "scheduling while atomic"
   error that provides the real context:

[Mon Dec 22 10:18:18 2025] BUG: scheduling while atomic: kworker/u385:1/2041/0x00000002
[Mon Dec 22 10:18:18 2025] Call Trace:
[Mon Dec 22 10:18:18 2025]  dump_stack_lvl+0x37/0x50
[Mon Dec 22 10:18:18 2025]  __schedule_bug+0x52/0x60
[Mon Dec 22 10:18:18 2025]  __schedule+0x87d/0xb10
[Mon Dec 22 10:18:18 2025]  rt_mutex_schedule+0x21/0x40
[Mon Dec 22 10:18:18 2025]  rt_mutex_slowlock_block.constprop.0+0x33/0x170
[Mon Dec 22 10:18:18 2025]  __rt_mutex_slowlock_locked.constprop.0+0xc4/0x1e0
[Mon Dec 22 10:18:18 2025]  mutex_lock+0x44/0x60
[Mon Dec 22 10:18:18 2025]  __cpuhp_state_add_instance_cpuslocked+0x41/0x110
[Mon Dec 22 10:18:18 2025]  __cpuhp_state_add_instance+0x48/0xd0
[Mon Dec 22 10:18:18 2025]  blk_mq_realloc_hw_ctxs+0x405/0x420
[Mon Dec 22 10:18:18 2025]  blk_mq_init_allocated_queue+0x10a/0x480

The context is atomic because on PREEMPT_RT, some spinlock earlier in the call chain has
been converted to an rt_mutex, and the code is holding that lock. When blk_mq_run_hw_queues()
is called with async=false, it triggers kblockd_mod_delayed_work_on(), which calls
in_interrupt(), and this returns true because preempt_count() is non-zero due to the
rt_mutex being held.

## What this means:

The issue is specific to PREEMPT_RT where:
- Spinlocks become sleeping mutexes (rt_mutex)
- Holding an rt_mutex sets preempt_count, making in_interrupt() return true
- blk_mq_run_hw_queues() with async=false hits WARN_ON_ONCE(!async && in_interrupt())

This is why the async parameter needs to be true when called in contexts that might
hold spinlocks on RT kernels.

I apologize for the confusion in my commit message. Should I:
1. Revise the commit message to accurately describe the blk_queue_start_drain() path?
2. Add details about the PREEMPT_RT context causing the atomic state?

Best regards,
Ionut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path
  2025-12-23  2:15   ` Muchun Song
@ 2026-01-06 11:36     ` djiony2011
  0 siblings, 0 replies; 12+ messages in thread
From: djiony2011 @ 2026-01-06 11:36 UTC (permalink / raw)
  To: muchun.song
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel, ming.lei,
	sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Muchun,

Thank you for the detailed review. Your questions about the memory barriers are
absolutely correct and highlight fundamental issues with my approach.

> Memory barriers must be used in pairs. So how to synchronize?

> Why we need another barrier? What order does this barrier guarantee?

You're right to ask these questions. After careful consideration and discussion
with Ming Lei, I've concluded that the memory barrier approach in this patch is
flawed and insufficient.

The fundamental problem is:
1. Memory barriers need proper pairing on both read and write sides
2. The write-side barriers would need to be inserted at MULTIPLE call sites
   throughout the block layer - everywhere work is added before calling
   blk_mq_run_hw_queue()
3. This is exactly why the original commit 679b1874eba7 chose the lock-based
   approach, noting that "memory barrier is not easy to be maintained"

My patch attempted to add barriers only in blk_mq_run_hw_queue(), but didn't
address the pairing barriers needed at all the call sites that add work to
dispatch lists/sw queues. This makes the synchronization incomplete.

## New approach: dedicated raw_spinlock

I'm abandoning the memory barrier approach and preparing a new patch that uses
a dedicated raw_spinlock_t (quiesce_sync_lock) instead of the general-purpose
queue_lock.

The key differences from the current problematic code:
- Current: Uses queue_lock (spinlock_t) which becomes rt_mutex in RT kernel
- New: Uses quiesce_sync_lock (raw_spinlock_t) which stays a real spinlock

Why raw_spinlock is safe:
- Critical section is provably short (only flag and counter checks)
- No sleeping operations under the lock
- Specific to quiesce synchronization, not general queue operations

This approach:
- Maintains the correct synchronization from 679b1874eba7
- Avoids sleeping in RT kernel's IRQ thread context
- Simpler and more maintainable than memory barrier pairing across many call sites

I'll send the new patch shortly. Thank you for catching these issues before
they made it into the kernel.

Best regards,
Ionut

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2026-01-06 11:14     ` djiony2011
@ 2026-01-06 12:29       ` Bart Van Assche
  2026-01-06 14:40         ` Ionut Nechita
  2026-01-06 15:04       ` Ming Lei
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2026-01-06 12:29 UTC (permalink / raw)
  To: djiony2011, ming.lei
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

On 1/6/26 3:14 AM, djiony2011@gmail.com wrote:
> [Mon Dec 22 10:18:18 2025] WARNING: CPU: 190 PID: 2041 at block/blk-mq.c:2291 blk_mq_run_hw_queue+0x1fa/0x260
> [Mon Dec 22 10:18:18 2025] Modules linked in:
> [Mon Dec 22 10:18:18 2025] CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1

6.6.71 is pretty far away from Jens' for-next branch. Please use Jens'
for-next branch for testing kernel patches intended for the upstream
kernel.

> [Mon Dec 22 10:18:18 2025] Call Trace:
> [Mon Dec 22 10:18:18 2025]  <TASK>
> [Mon Dec 22 10:18:18 2025]  blk_mq_run_hw_queues+0x6c/0x130
> [Mon Dec 22 10:18:18 2025]  blk_queue_start_drain+0x12/0x40
> [Mon Dec 22 10:18:18 2025]  blk_mq_destroy_queue+0x37/0x70
> [Mon Dec 22 10:18:18 2025]  __scsi_remove_device+0x6a/0x180
> [Mon Dec 22 10:18:18 2025]  scsi_alloc_sdev+0x357/0x360
> [Mon Dec 22 10:18:18 2025]  scsi_probe_and_add_lun+0x8ac/0xc00
> [Mon Dec 22 10:18:18 2025]  __scsi_scan_target+0xf0/0x520
> [Mon Dec 22 10:18:18 2025]  scsi_scan_channel+0x57/0x90
> [Mon Dec 22 10:18:18 2025]  scsi_scan_host_selected+0xd4/0x110
> [Mon Dec 22 10:18:18 2025]  do_scan_async+0x1c/0x190
> [Mon Dec 22 10:18:18 2025]  async_run_entry_fn+0x2f/0x130
> [Mon Dec 22 10:18:18 2025]  process_one_work+0x175/0x370
> [Mon Dec 22 10:18:18 2025]  worker_thread+0x280/0x390
> [Mon Dec 22 10:18:18 2025]  kthread+0xdd/0x110
> [Mon Dec 22 10:18:18 2025]  ret_from_fork+0x31/0x50
> [Mon Dec 22 10:18:18 2025]  ret_from_fork_asm+0x1b/0x30

Where in the above call stack is the code that disables interrupts?

> 3. **The actual problem on PREEMPT_RT**: There's a preceding "scheduling while atomic"
>     error that provides the real context:
> 
> [Mon Dec 22 10:18:18 2025] BUG: scheduling while atomic: kworker/u385:1/2041/0x00000002
> [Mon Dec 22 10:18:18 2025] Call Trace:
> [Mon Dec 22 10:18:18 2025]  dump_stack_lvl+0x37/0x50
> [Mon Dec 22 10:18:18 2025]  __schedule_bug+0x52/0x60
> [Mon Dec 22 10:18:18 2025]  __schedule+0x87d/0xb10
> [Mon Dec 22 10:18:18 2025]  rt_mutex_schedule+0x21/0x40
> [Mon Dec 22 10:18:18 2025]  rt_mutex_slowlock_block.constprop.0+0x33/0x170
> [Mon Dec 22 10:18:18 2025]  __rt_mutex_slowlock_locked.constprop.0+0xc4/0x1e0
> [Mon Dec 22 10:18:18 2025]  mutex_lock+0x44/0x60
> [Mon Dec 22 10:18:18 2025]  __cpuhp_state_add_instance_cpuslocked+0x41/0x110
> [Mon Dec 22 10:18:18 2025]  __cpuhp_state_add_instance+0x48/0xd0
> [Mon Dec 22 10:18:18 2025]  blk_mq_realloc_hw_ctxs+0x405/0x420
> [Mon Dec 22 10:18:18 2025]  blk_mq_init_allocated_queue+0x10a/0x480

How is the above call stack related to the reported problem? The above
call stack is about request queue allocation while the reported problem
happens during request queue destruction.

> I apologize for the confusion in my commit message. Should I:
> 1. Revise the commit message to accurately describe the blk_queue_start_drain() path?
> 2. Add details about the PREEMPT_RT context causing the atomic state?

The answer to both questions is yes.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2026-01-06 12:29       ` Bart Van Assche
@ 2026-01-06 14:40         ` Ionut Nechita
  0 siblings, 0 replies; 12+ messages in thread
From: Ionut Nechita @ 2026-01-06 14:40 UTC (permalink / raw)
  To: bvanassche
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel, ming.lei,
	muchun.song, sashal, stable

Hi Bart,

Thank you for the thorough and insightful review. You've identified several critical issues with my submission that I need to address.

> 6.6.71 is pretty far away from Jens' for-next branch. Please use Jens'
> for-next branch for testing kernel patches intended for the upstream kernel.

You're absolutely right. I was testing on the stable Debian kernel (6.6.71-rt) which was where the issue was originally reported. I will now fetch and test on Jens' for-next branch and ensure the issue reproduces there before resubmitting.

> Where in the above call stack is the code that disables interrupts?

This was poorly worded on my part, and I apologize for the confusion. The issue is NOT "interrupt context" in the hardirq sense.

What's actually happening:
- **Context:** kworker thread (async SCSI device scan)
- **State:** Running with preemption disabled (atomic context, not hardirq)
- **Path:** Queue destruction during device probe error cleanup
- **Trigger:** On PREEMPT_RT, in_interrupt() returns true when preemption is disabled, even in process context

The WARN_ON in blk_mq_run_hw_queue() at line 2291 is:
  WARN_ON_ONCE(!async && in_interrupt());

On PREEMPT_RT, this check fires because:
1. blk_freeze_queue_start() calls blk_mq_run_hw_queues(q, false) ← async=false
2. This eventually calls blk_mq_run_hw_queue() with async=false
3. in_interrupt() returns true (because preempt_count indicates atomic state)
4. WARN_ON triggers

So it's not "interrupt context" - it's atomic context (preemption disabled) being detected by in_interrupt() on RT kernel.

> How is the above call stack related to the reported problem? The above
> call stack is about request queue allocation while the reported problem
> happens during request queue destruction.

You're absolutely correct, and I apologize for the confusion. I mistakenly included two different call stacks in my commit message:

1. **"scheduling while atomic" during blk_mq_realloc_hw_ctxs** - This was from queue allocation and is a DIFFERENT issue. It should NOT have been included.

2. **WARN_ON during blk_queue_start_drain** - This is the ACTUAL issue that my patch addresses (queue destruction path).

I will revise the commit message to remove the unrelated allocation stack trace and focus solely on the queue destruction path.

> I apologize for the confusion in my commit message. Should I:
> 1. Revise the commit message to accurately describe the blk_queue_start_drain() path?
> 2. Add details about the PREEMPT_RT context causing the atomic state?
>
> The answer to both questions is yes.

Understood. I will prepare v3->v5 with the following corrections:

1. **Test on Jens' for-next branch** - Fetch, reproduce, and validate the fix on the upstream development tree

2. **Accurate context description** - Replace "IRQ thread context" with "kworker context with preemption disabled (atomic context on RT)"

3. **Single, clear call stack** - Remove the confusing allocation stack trace, focus only on the destruction path:
   ```
   scsi_alloc_sdev (error path)
   → __scsi_remove_device
   → blk_mq_destroy_queue
   → blk_queue_start_drain
   → blk_freeze_queue_start
   → blk_mq_run_hw_queues(q, false)  ← Problem: async=false
   ```

4. **Explain PREEMPT_RT specifics** - Clearly describe why in_interrupt() returns true in atomic context on RT kernel, and how changing to async=true avoids the problem

5. **Accurate problem statement** - This is about avoiding synchronous queue runs in atomic context on RT, not about MSI-X IRQ thread contention (that was a misunderstanding on my part)

I'll respond again once I've validated on for-next and have a corrected v3->v5 ready.

Thank you again for the detailed feedback.

Best regards,
Ionut
--
2.52.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2026-01-06 11:14     ` djiony2011
  2026-01-06 12:29       ` Bart Van Assche
@ 2026-01-06 15:04       ` Ming Lei
  2026-01-06 16:35         ` Ionut Nechita (WindRiver)
  1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2026-01-06 15:04 UTC (permalink / raw)
  To: djiony2011
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

On Tue, Jan 06, 2026 at 01:14:11PM +0200, djiony2011@gmail.com wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> Hi Ming,
> 
> Thank you for the review. You're absolutely right to ask for clarification - I need to
> correct my commit message as it's misleading about the actual call path.
> 
> > Can you show the whole stack trace in the warning? The in-code doesn't
> > indicate that freeze queue can be called from scsi's interrupt context.
> 
> Here's the complete stack trace from the WARNING at blk_mq_run_hw_queue:
> 
> [Mon Dec 22 10:18:18 2025] WARNING: CPU: 190 PID: 2041 at block/blk-mq.c:2291 blk_mq_run_hw_queue+0x1fa/0x260
> [Mon Dec 22 10:18:18 2025] Modules linked in:
> [Mon Dec 22 10:18:18 2025] CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1

There is so big change between 6.6.0-1-rt and 6.19, because Real-Time "PREEMPT_RT" Support Merged For Linux 6.12

https://www.phoronix.com/news/Linux-6.12-Does-Real-Time


> [Mon Dec 22 10:18:18 2025] Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
> [Mon Dec 22 10:18:18 2025] Workqueue: events_unbound async_run_entry_fn
> [Mon Dec 22 10:18:18 2025] RIP: 0010:blk_mq_run_hw_queue+0x1fa/0x260
> [Mon Dec 22 10:18:18 2025] Code: ff 75 68 44 89 f6 e8 e5 45 c0 ff e9 ac fe ff ff e8 2b 70 c0 ff 48 89 ef e8 b3 a0 00 00 5b 5d 41 5c 41 5d 41 5e e9 26 9e c0 ff <0f> 0b e9 43 fe ff ff e8 0a 70 c0 ff 48 8b 85 d0 00 00 00 48 8b 80
> [Mon Dec 22 10:18:18 2025] RSP: 0018:ff630f098528fb98 EFLAGS: 00010206
> [Mon Dec 22 10:18:18 2025] RAX: 0000000000ff0000 RBX: 0000000000000000 RCX: 0000000000000000
> [Mon Dec 22 10:18:18 2025] RDX: 0000000000ff0000 RSI: 0000000000000000 RDI: ff3edc0247159400
> [Mon Dec 22 10:18:18 2025] RBP: ff3edc0247159400 R08: ff3edc0247159400 R09: ff630f098528fb60
> [Mon Dec 22 10:18:18 2025] R10: 0000000000000000 R11: 0000000045069ed3 R12: 0000000000000000
> [Mon Dec 22 10:18:18 2025] R13: ff3edc024715a828 R14: 0000000000000000 R15: 0000000000000000
> [Mon Dec 22 10:18:18 2025] FS:  0000000000000000(0000) GS:ff3edc10fd380000(0000) knlGS:0000000000000000
> [Mon Dec 22 10:18:18 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Mon Dec 22 10:18:18 2025] CR2: 0000000000000000 CR3: 000000073961a001 CR4: 0000000000771ee0
> [Mon Dec 22 10:18:18 2025] PKRU: 55555554
> [Mon Dec 22 10:18:18 2025] Call Trace:
> [Mon Dec 22 10:18:18 2025]  <TASK>
> [Mon Dec 22 10:18:18 2025]  ? __warn+0x89/0x140
> [Mon Dec 22 10:18:18 2025]  ? blk_mq_run_hw_queue+0x1fa/0x260
> [Mon Dec 22 10:18:18 2025]  ? report_bug+0x198/0x1b0
> [Mon Dec 22 10:18:18 2025]  ? handle_bug+0x53/0x90
> [Mon Dec 22 10:18:18 2025]  ? exc_invalid_op+0x18/0x70
> [Mon Dec 22 10:18:18 2025]  ? asm_exc_invalid_op+0x1a/0x20
> [Mon Dec 22 10:18:18 2025]  ? blk_mq_run_hw_queue+0x1fa/0x260
> [Mon Dec 22 10:18:18 2025]  blk_mq_run_hw_queues+0x6c/0x130
> [Mon Dec 22 10:18:18 2025]  blk_queue_start_drain+0x12/0x40
> [Mon Dec 22 10:18:18 2025]  blk_mq_destroy_queue+0x37/0x70
> [Mon Dec 22 10:18:18 2025]  __scsi_remove_device+0x6a/0x180
> [Mon Dec 22 10:18:18 2025]  scsi_alloc_sdev+0x357/0x360
> [Mon Dec 22 10:18:18 2025]  scsi_probe_and_add_lun+0x8ac/0xc00
> [Mon Dec 22 10:18:18 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
> [Mon Dec 22 10:18:18 2025]  ? dev_set_name+0x57/0x80
> [Mon Dec 22 10:18:18 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
> [Mon Dec 22 10:18:18 2025]  ? attribute_container_add_device+0x4d/0x130
> [Mon Dec 22 10:18:18 2025]  __scsi_scan_target+0xf0/0x520
> [Mon Dec 22 10:18:18 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
> [Mon Dec 22 10:18:18 2025]  ? sched_clock_cpu+0x64/0x190
> [Mon Dec 22 10:18:18 2025]  scsi_scan_channel+0x57/0x90
> [Mon Dec 22 10:18:18 2025]  scsi_scan_host_selected+0xd4/0x110
> [Mon Dec 22 10:18:18 2025]  do_scan_async+0x1c/0x190
> [Mon Dec 22 10:18:18 2025]  async_run_entry_fn+0x2f/0x130
> [Mon Dec 22 10:18:18 2025]  process_one_work+0x175/0x370
> [Mon Dec 22 10:18:18 2025]  worker_thread+0x280/0x390
> [Mon Dec 22 10:18:18 2025]  ? __pfx_worker_thread+0x10/0x10
> [Mon Dec 22 10:18:18 2025]  kthread+0xdd/0x110
> [Mon Dec 22 10:18:18 2025]  ? __pfx_kthread+0x10/0x10
> [Mon Dec 22 10:18:18 2025]  ret_from_fork+0x31/0x50
> [Mon Dec 22 10:18:18 2025]  ? __pfx_kthread+0x10/0x10
> [Mon Dec 22 10:18:18 2025]  ret_from_fork_asm+0x1b/0x30
> [Mon Dec 22 10:18:18 2025]  </TASK>
> [Mon Dec 22 10:18:18 2025] ---[ end trace 0000000000000000 ]---
> 
> ## Important clarifications:
> 
> 1. **Not freeze queue, but drain during destroy**: My commit message was incorrect.
>    The call path is:
>    blk_mq_destroy_queue() -> blk_queue_start_drain() -> blk_mq_run_hw_queues(q, false)
> 
>    This is NOT during blk_freeze_queue_start(), but during queue destruction when a
>    SCSI device probe fails and cleanup is triggered.
> 
> 2. **Not true interrupt context**: You're correct that this isn't from an interrupt
>    handler. The workqueue context is process context, not interrupt context.
> 
> 3. **The actual problem on PREEMPT_RT**: There's a preceding "scheduling while atomic"
>    error that provides the real context:
> 
> [Mon Dec 22 10:18:18 2025] BUG: scheduling while atomic: kworker/u385:1/2041/0x00000002
> [Mon Dec 22 10:18:18 2025] Call Trace:
> [Mon Dec 22 10:18:18 2025]  dump_stack_lvl+0x37/0x50
> [Mon Dec 22 10:18:18 2025]  __schedule_bug+0x52/0x60
> [Mon Dec 22 10:18:18 2025]  __schedule+0x87d/0xb10
> [Mon Dec 22 10:18:18 2025]  rt_mutex_schedule+0x21/0x40
> [Mon Dec 22 10:18:18 2025]  rt_mutex_slowlock_block.constprop.0+0x33/0x170
> [Mon Dec 22 10:18:18 2025]  __rt_mutex_slowlock_locked.constprop.0+0xc4/0x1e0
> [Mon Dec 22 10:18:18 2025]  mutex_lock+0x44/0x60
> [Mon Dec 22 10:18:18 2025]  __cpuhp_state_add_instance_cpuslocked+0x41/0x110
> [Mon Dec 22 10:18:18 2025]  __cpuhp_state_add_instance+0x48/0xd0
> [Mon Dec 22 10:18:18 2025]  blk_mq_realloc_hw_ctxs+0x405/0x420

Why is the above warning related with your patch?

> [Mon Dec 22 10:18:18 2025]  blk_mq_init_allocated_queue+0x10a/0x480
> 
> The context is atomic because on PREEMPT_RT, some spinlock earlier in the call chain has
> been converted to an rt_mutex, and the code is holding that lock. When blk_mq_run_hw_queues()
> is called with async=false, it triggers kblockd_mod_delayed_work_on(), which calls
> in_interrupt(), and this returns true because preempt_count() is non-zero due to the
> rt_mutex being held.
> 
> ## What this means:
> 
> The issue is specific to PREEMPT_RT where:
> - Spinlocks become sleeping mutexes (rt_mutex)
> - Holding an rt_mutex sets preempt_count, making in_interrupt() return true
> - blk_mq_run_hw_queues() with async=false hits WARN_ON_ONCE(!async && in_interrupt())

If you think the same issue exists on recent kernel, show the stack trace.

Or please share how preempt is disabled in the above blk_mq_run_hw_queues code
path.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context
  2026-01-06 15:04       ` Ming Lei
@ 2026-01-06 16:35         ` Ionut Nechita (WindRiver)
  0 siblings, 0 replies; 12+ messages in thread
From: Ionut Nechita (WindRiver) @ 2026-01-06 16:35 UTC (permalink / raw)
  To: ming.lei
  Cc: axboe, gregkh, ionut.nechita, linux-block, linux-kernel,
	muchun.song, sashal, stable

From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Ming,

Thank you for the thorough review. You've identified critical issues with my analysis.

> There is so big change between 6.6.0-1-rt and 6.19, because Real-Time
> "PREEMPT_RT" Support Merged For Linux 6.12

You're absolutely right. I tested on Debian's 6.6.71-rt which uses the out-of-tree
RT patches. I will retest on Friday with both 6.12 (first kernel with merged RT
support) and linux-next to confirm whether this issue still exists in current upstream.

> Why is the above warning related with your patch?

After reviewing the complete dmesg log, I now see there are TWO separate errors
from the same process (PID 2041):

**Error #1** - Root cause (the one you highlighted):
```
BUG: scheduling while atomic: kworker/u385:1/2041/0x00000002
  mutex_lock
  → __cpuhp_state_add_instance
  → blk_mq_realloc_hw_ctxs
  → blk_mq_init_queue
  → scsi_alloc_sdev          ← Queue ALLOCATION
```

**Error #2** - Symptom (the one my patch addresses):
```
WARNING at blk_mq_run_hw_queue+0x1fa
  blk_mq_run_hw_queue
  → blk_mq_run_hw_queues
  → blk_queue_start_drain
  → blk_mq_destroy_queue
  → __scsi_remove_device
  → scsi_alloc_sdev          ← Queue DESTRUCTION (cleanup)
```

The sequence is:
1. Queue allocation in scsi_alloc_sdev() hits Error #1 (mutex in atomic context)
2. Allocation fails, enters cleanup path
3. Cleanup calls blk_mq_destroy_queue() while STILL in atomic context
4. blk_queue_start_drain() → blk_mq_run_hw_queues(q, false)
5. WARN_ON(!async && in_interrupt()) triggers → Error #2

> Or please share how preempt is disabled in the above blk_mq_run_hw_queues
> code path.

The atomic context (preempt_count = 0x00000002) is inherited from Error #1.
The code is already in atomic state when it enters the cleanup path.

> If you think the same issue exists on recent kernel, show the stack trace.

I don't have current data from upstream kernels. I will test on Friday and provide:
1. Results from 6.12-rt (first kernel with merged RT support)
2. Results from linux-next
3. Complete stack traces if the issue reproduces

If the issue still exists on current upstream, I need to address Error #1 (the
root cause) rather than Error #2 (the symptom). My current patch only suppresses
the warning during cleanup but doesn't fix the underlying atomic context problem.

I will report back with test results on Friday.

 - 

BUG: scheduling while atomic: kworker/u385:1/2041/0x00000002
Modules linked in:
CPU: 190 PID: 2041 Comm: kworker/u385:1 Not tainted 6.6.0-1-rt-amd64 #1  Debian 6.6.71-1
Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
Workqueue: events_unbound async_run_entry_fn
Call Trace:
 <TASK>
 dump_stack_lvl+0x37/0x50
 __schedule_bug+0x52/0x60
 __schedule+0x87d/0xb10
 rt_mutex_schedule+0x21/0x40
 rt_mutex_slowlock_block.constprop.0+0x33/0x170
 __rt_mutex_slowlock_locked.constprop.0+0xc4/0x1e0
 mutex_lock+0x44/0x60
 __cpuhp_state_add_instance_cpuslocked+0x41/0x110
 __cpuhp_state_add_instance+0x48/0xd0
 blk_mq_realloc_hw_ctxs+0x405/0x420
 blk_mq_init_allocated_queue+0x10a/0x480
intel_rapl_common: Found RAPL domain package
 ? srso_alias_return_thunk+0x5/0xfbef5
intel_rapl_common: Found RAPL domain core
 ? percpu_ref_init+0x6e/0x130
 blk_mq_init_queue+0x3c/0x70
 scsi_alloc_sdev+0x225/0x360
 scsi_probe_and_add_lun+0x8ac/0xc00
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? dev_set_name+0x57/0x80
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? attribute_container_add_device+0x4d/0x130
 __scsi_scan_target+0xf0/0x520
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? sched_clock_cpu+0x64/0x190
 scsi_scan_channel+0x57/0x90
 scsi_scan_host_selected+0xd4/0x110
 do_scan_async+0x1c/0x190
 async_run_entry_fn+0x2f/0x130
 process_one_work+0x175/0x370
 worker_thread+0x280/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdd/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
gnss: GNSS driver registered with major 241
------------[ cut here ]------------
WARNING: CPU: 190 PID: 2041 at block/blk-mq.c:2291 blk_mq_run_hw_queue+0x1fa/0x260
Modules linked in:
CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1
Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
Workqueue: events_unbound async_run_entry_fn
RIP: 0010:blk_mq_run_hw_queue+0x1fa/0x260
Code: ff 75 68 44 89 f6 e8 e5 45 c0 ff e9 ac fe ff ff e8 2b 70 c0 ff 48 89 ef e8 b3 a0 00 00 5b 5d 41 5c 41 5d 41 5e e9 26 9e c0 ff <0f> 0b e9 43 fe ff ff e8 0a 70 c0 ff 48 8b 85 d0 00 00 00 48 8b 80
RSP: 0018:ff630f098528fb98 EFLAGS: 00010206
RAX: 0000000000ff0000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000ff0000 RSI: 0000000000000000 RDI: ff3edc0247159400
RBP: ff3edc0247159400 R08: ff3edc0247159400 R09: ff630f098528fb60
R10: 0000000000000000 R11: 0000000045069ed3 R12: 0000000000000000
R13: ff3edc024715a828 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ff3edc10fd380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000073961a001 CR4: 0000000000771ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x89/0x140
 ? blk_mq_run_hw_queue+0x1fa/0x260
 ? report_bug+0x198/0x1b0
 ? handle_bug+0x53/0x90
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? blk_mq_run_hw_queue+0x1fa/0x260
 blk_mq_run_hw_queues+0x6c/0x130
 blk_queue_start_drain+0x12/0x40
 blk_mq_destroy_queue+0x37/0x70
 __scsi_remove_device+0x6a/0x180
 scsi_alloc_sdev+0x357/0x360
 scsi_probe_and_add_lun+0x8ac/0xc00
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? dev_set_name+0x57/0x80
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? attribute_container_add_device+0x4d/0x130
 __scsi_scan_target+0xf0/0x520
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? sched_clock_cpu+0x64/0x190
 scsi_scan_channel+0x57/0x90
 scsi_scan_host_selected+0xd4/0x110
 do_scan_async+0x1c/0x190
 async_run_entry_fn+0x2f/0x130
 process_one_work+0x175/0x370
 worker_thread+0x280/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdd/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 190 PID: 2041 at kernel/time/timer.c:1570 __timer_delete_sync+0x152/0x170
Modules linked in:
CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1
Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
Workqueue: events_unbound async_run_entry_fn
RIP: 0010:__timer_delete_sync+0x152/0x170
Code: 8b 04 24 4c 89 c7 e8 ad 11 b9 00 f0 ff 4d 30 4c 8b 04 24 4c 89 c7 e8 8d 03 b9 00 be 00 02 00 00 4c 89 ff e8 e0 83 f3 ff eb 93 <0f> 0b e9 e8 fe ff ff 49 8d 2c 16 eb a8 e8 5c 49 b8 00 66 66 2e 0f
RSP: 0018:ff630f098528fba8 EFLAGS: 00010246
RAX: 000000007fffffff RBX: ff3edc02829426d0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff3edc02829426d0
RBP: ff3edc02829425b0 R08: ff3edc0282942938 R09: ff630f098528fba0
R10: 0000000000000000 R11: 0000000045069ed3 R12: 0000000000000000
R13: ff3edc024715a828 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ff3edc10fd380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000073961a001 CR4: 0000000000771ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x89/0x140
 ? __timer_delete_sync+0x152/0x170
 ? report_bug+0x198/0x1b0
 ? handle_bug+0x53/0x90
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? __timer_delete_sync+0x152/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? percpu_ref_is_zero+0x3b/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 blk_sync_queue+0x19/0x30
 blk_mq_destroy_queue+0x47/0x70
 __scsi_remove_device+0x6a/0x180
 scsi_alloc_sdev+0x357/0x360
 scsi_probe_and_add_lun+0x8ac/0xc00
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? dev_set_name+0x57/0x80
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? attribute_container_add_device+0x4d/0x130
 __scsi_scan_target+0xf0/0x520
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? sched_clock_cpu+0x64/0x190
 scsi_scan_channel+0x57/0x90
 scsi_scan_host_selected+0xd4/0x110
 do_scan_async+0x1c/0x190
 async_run_entry_fn+0x2f/0x130
 process_one_work+0x175/0x370
 worker_thread+0x280/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdd/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
drop_monitor: Initializing network drop monitor service
------------[ cut here ]------------
WARNING: CPU: 190 PID: 2041 at kernel/time/timer.c:1570 __timer_delete_sync+0x152/0x170
Modules linked in:
CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1
Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
Workqueue: events_unbound async_run_entry_fn
RIP: 0010:__timer_delete_sync+0x152/0x170
Code: 8b 04 24 4c 89 c7 e8 ad 11 b9 00 f0 ff 4d 30 4c 8b 04 24 4c 89 c7 e8 8d 03 b9 00 be 00 02 00 00 4c 89 ff e8 e0 83 f3 ff eb 93 <0f> 0b e9 e8 fe ff ff 49 8d 2c 16 eb a8 e8 5c 49 b8 00 66 66 2e 0f
RSP: 0018:ff630f098528fba8 EFLAGS: 00010246
RAX: 000000007fffffff RBX: ff3edc0282943790 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff3edc0282943790
RBP: ff3edc0282943670 R08: ff3edc02829439f8 R09: ff630f098528fba0
R10: 0000000000000000 R11: 00000000b3b80e06 R12: 0000000000000000
R13: ff3edc02828dc428 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ff3edc10fd380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000073961a001 CR4: 0000000000771ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x89/0x140
 ? __timer_delete_sync+0x152/0x170
 ? report_bug+0x198/0x1b0
 ? handle_bug+0x53/0x90
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? __timer_delete_sync+0x152/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? percpu_ref_is_zero+0x3b/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 blk_sync_queue+0x19/0x30
 blk_mq_destroy_queue+0x47/0x70
 __scsi_remove_device+0x6a/0x180
 scsi_alloc_sdev+0x357/0x360
 scsi_probe_and_add_lun+0x8ac/0xc00
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? dev_set_name+0x57/0x80
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? attribute_container_add_device+0x4d/0x130
 __scsi_scan_target+0xf0/0x520
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? sched_clock_cpu+0x64/0x190
 scsi_scan_channel+0x57/0x90
 scsi_scan_host_selected+0xd4/0x110
 do_scan_async+0x1c/0x190
 async_run_entry_fn+0x2f/0x130
 process_one_work+0x175/0x370
 worker_thread+0x280/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdd/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 190 PID: 2041 at kernel/time/timer.c:1570 __timer_delete_sync+0x152/0x170
Modules linked in:
CPU: 190 PID: 2041 Comm: kworker/u385:1 Tainted: G        W          6.6.0-1-rt-amd64 #1  Debian 6.6.71-1
Hardware name: Dell Inc. PowerEdge R7615/09K9WP, BIOS 1.11.2 12/19/2024
Workqueue: events_unbound async_run_entry_fn
RIP: 0010:__timer_delete_sync+0x152/0x170
Code: 8b 04 24 4c 89 c7 e8 ad 11 b9 00 f0 ff 4d 30 4c 8b 04 24 4c 89 c7 e8 8d 03 b9 00 be 00 02 00 00 4c 89 ff e8 e0 83 f3 ff eb 93 <0f> 0b e9 e8 fe ff ff 49 8d 2c 16 eb a8 e8 5c 49 b8 00 66 66 2e 0f
RSP: 0018:ff630f098528fba8 EFLAGS: 00010246
RAX: 000000007fffffff RBX: ff3edc0282944420 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff3edc0282944420
RBP: ff3edc0282944300 R08: ff3edc0282944688 R09: ff630f098528fba0
R10: 0000000000000000 R11: 0000000043ba156d R12: 0000000000000000
R13: ff3edc02829ec028 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ff3edc10fd380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000073961a001 CR4: 0000000000771ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x89/0x140
 ? __timer_delete_sync+0x152/0x170
 ? report_bug+0x198/0x1b0
 ? handle_bug+0x53/0x90
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? __timer_delete_sync+0x152/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? percpu_ref_is_zero+0x3b/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 blk_sync_queue+0x19/0x30
 blk_mq_destroy_queue+0x47/0x70
 __scsi_remove_device+0x6a/0x180
 scsi_alloc_sdev+0x357/0x360
 scsi_probe_and_add_lun+0x8ac/0xc00
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? dev_set_name+0x57/0x80
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? attribute_container_add_device+0x4d/0x130
 __scsi_scan_target+0xf0/0x520
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? sched_clock_cpu+0x64/0x190
 scsi_scan_channel+0x57/0x90
 scsi_scan_host_selected+0xd4/0x110
 do_scan_async+0x1c/0x190
 async_run_entry_fn+0x2f/0x130
 process_one_work+0x175/0x370
 worker_thread+0x280/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdd/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
Initializing XFRM netlink socket


Thank you for the careful review,
Ionut

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-01-06 16:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-22 20:15 [PATCH v2 0/2] block/blk-mq: fix RT kernel issues and interrupt context warnings Ionut Nechita (WindRiver)
2025-12-22 20:15 ` [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path Ionut Nechita (WindRiver)
2025-12-23  2:15   ` Muchun Song
2026-01-06 11:36     ` djiony2011
2025-12-22 20:15 ` [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context Ionut Nechita (WindRiver)
2025-12-23  1:22   ` Ming Lei
2026-01-06 11:14     ` djiony2011
2026-01-06 12:29       ` Bart Van Assche
2026-01-06 14:40         ` Ionut Nechita
2026-01-06 15:04       ` Ming Lei
2026-01-06 16:35         ` Ionut Nechita (WindRiver)
2025-12-23  2:18   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).