public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Justin Sanders <jsanders.devel@gmail.com>,
	Valentin Kleibel <valentin@vrvis.at>,
	Jens Axboe <axboe@kernel.dk>, Sasha Levin <sashal@kernel.org>,
	justin@coraid.com
Subject: [PATCH AUTOSEL 6.15 15/20] aoe: defer rexmit timer downdev work to workqueue
Date: Tue, 24 Jun 2025 00:11:14 -0400	[thread overview]
Message-ID: <20250624041120.83191-15-sashal@kernel.org> (raw)
In-Reply-To: <20250624041120.83191-1-sashal@kernel.org>

From: Justin Sanders <jsanders.devel@gmail.com>

[ Upstream commit cffc873d68ab09a0432b8212008c5613f8a70a2c ]

When aoe's rexmit_timer() notices that an aoe target fails to respond to
commands for more than aoe_deadsecs, it calls aoedev_downdev() which
cleans the outstanding aoe and block queues. This can involve sleeping,
such as in blk_mq_freeze_queue(), which should not occur in irq context.

This patch defers that aoedev_downdev() call to the aoe device's
workqueue.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665
Signed-off-by: Justin Sanders <jsanders.devel@gmail.com>
Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com
Tested-By: Valentin Kleibel <valentin@vrvis.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

Based on my analysis of the commit and the kernel repository context,
here is my determination:

**YES**

This commit should be backported to stable kernel trees. Here's my
extensive explanation:

## 1. **Critical Bug Fix for Kernel Crash/Hang**

The commit fixes a serious "scheduling while atomic" bug that can cause
kernel crashes. The issue occurs when:
- The `rexmit_timer()` function (timer callback, runs in softirq
  context) detects a device timeout
- It directly calls `aoedev_downdev()` which contains
  `blk_mq_freeze_queue()` - a sleeping operation
- Calling a sleeping function from atomic context violates kernel rules
  and causes crashes

## 2. **Clear Root Cause and Minimal Fix**

The fix is minimal and targeted:
- Adds a new flag `DEVFL_DEAD` to mark devices that need to be downed
- Instead of calling `aoedev_downdev()` directly from timer context, it
  sets the flag and queues work
- The actual `aoedev_downdev()` call happens in workqueue context where
  sleeping is allowed

## 3. **Part of a Series Fixing Related Issues**

This is part of a series addressing bugzilla #212665, including:
- This commit: Fixes the timer context sleeping issue
- Commit 7f90d45e57cb: Fixes the hang in `blk_mq_freeze_queue()` by
  properly cleaning request lists

Both commits are needed to fully resolve the reported issue.

## 4. **Regression from blk-mq Conversion**

The issue was introduced by commit 3582dd291788 ("aoe: convert aoeblk to
blk-mq"), making this a regression fix. The blk-mq conversion added
sleeping operations that weren't safe to call from existing timer
contexts.

## 5. **Similar to Previously Backported Fixes**

Looking at the similar commits:
- Commit 430380b4637a (fixing atomic sleep in flush) was marked as
  suitable for backporting
- It addressed the same class of bug in a different code path
- This establishes precedent for backporting such fixes

## 6. **Minimal Risk of Regression**

The changes are:
- Well-contained to the aoe driver
- Follow established patterns (deferring to workqueue)
- Don't change any user-visible behavior
- Only change the execution context of an existing operation

## 7. **Meets Stable Criteria**

According to stable tree rules, this commit:
- ✓ Fixes a real bug (kernel crash/scheduling while atomic)
- ✓ Has been tested (per commit message)
- ✓ Is minimal and targeted
- ✓ Fixes a regression from previous changes
- ✓ Has clear explanation and reproducer (bugzilla link)

The only consideration is that it should be backported together with
commit 7f90d45e57cb to ensure the complete fix for bugzilla #212665 is
applied.

 drivers/block/aoe/aoe.h    | 1 +
 drivers/block/aoe/aoecmd.c | 8 ++++++--
 drivers/block/aoe/aoedev.c | 5 ++++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index 749ae1246f4cf..d35caa3c69e15 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -80,6 +80,7 @@ enum {
 	DEVFL_NEWSIZE = (1<<6),	/* need to update dev size in block layer */
 	DEVFL_FREEING = (1<<7),	/* set when device is being cleaned up */
 	DEVFL_FREED = (1<<8),	/* device has been cleaned up */
+	DEVFL_DEAD = (1<<9),	/* device has timed out of aoe_deadsecs */
 };
 
 enum {
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 92b06d1de4cc7..6c94cfd1c480e 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -754,7 +754,7 @@ rexmit_timer(struct timer_list *timer)
 
 	utgts = count_targets(d, NULL);
 
-	if (d->flags & DEVFL_TKILL) {
+	if (d->flags & (DEVFL_TKILL | DEVFL_DEAD)) {
 		spin_unlock_irqrestore(&d->lock, flags);
 		return;
 	}
@@ -786,7 +786,8 @@ rexmit_timer(struct timer_list *timer)
 			 * to clean up.
 			 */
 			list_splice(&flist, &d->factive[0]);
-			aoedev_downdev(d);
+			d->flags |= DEVFL_DEAD;
+			queue_work(aoe_wq, &d->work);
 			goto out;
 		}
 
@@ -898,6 +899,9 @@ aoecmd_sleepwork(struct work_struct *work)
 {
 	struct aoedev *d = container_of(work, struct aoedev, work);
 
+	if (d->flags & DEVFL_DEAD)
+		aoedev_downdev(d);
+
 	if (d->flags & DEVFL_GDALLOC)
 		aoeblk_gdalloc(d);
 
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 141b2a0e03f2c..0189fc3cbbefe 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -199,8 +199,11 @@ aoedev_downdev(struct aoedev *d)
 	struct aoetgt *t, **tt, **te;
 	struct list_head *head, *pos, *nx;
 	int i;
+	unsigned long flags;
 
-	d->flags &= ~DEVFL_UP;
+	spin_lock_irqsave(&d->lock, flags);
+	d->flags &= ~(DEVFL_UP | DEVFL_DEAD);
+	spin_unlock_irqrestore(&d->lock, flags);
 
 	/* clean out active and to-be-retransmitted buffers */
 	for (i = 0; i < NFACTIVE; i++) {
-- 
2.39.5


  parent reply	other threads:[~2025-06-24  4:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-24  4:11 [PATCH AUTOSEL 6.15 01/20] x86/platform/amd: move final timeout check to after final sleep Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 02/20] drm/msm: Fix a fence leak in submit error path Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 03/20] drm/msm: Fix another leak in the " Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 04/20] ALSA: sb: Don't allow changing the DMA mode during operations Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 05/20] ALSA: sb: Force to disable DMAs once when DMA mode is changed Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 06/20] ata: libata-acpi: Do not assume 40 wire cable if no devices are enabled Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 07/20] ata: pata_cs5536: fix build on 32-bit UML Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 08/20] ASoC: amd: yc: Add quirk for MSI Bravo 17 D7VF internal mic Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 09/20] platform/x86/amd/pmc: Add PCSpecialist Lafite Pro V 14M to 8042 quirks list Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 10/20] genirq/irq_sim: Initialize work context pointers properly Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 11/20] powerpc: Fix struct termio related ioctl macros Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 12/20] ASoC: amd: yc: update quirk data for HP Victus Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 13/20] regulator: fan53555: add enable_time support and soft-start times Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 14/20] scsi: target: Fix NULL pointer dereference in core_scsi3_decode_spec_i_port() Sasha Levin
2025-06-24  4:11 ` Sasha Levin [this message]
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 16/20] wifi: mac80211: drop invalid source address OCB frames Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 17/20] wifi: ath6kl: remove WARN on bad firmware input Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 18/20] ACPICA: Refuse to evaluate a method if arguments are missing Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 19/20] mtd: spinand: fix memory leak of ECC engine conf Sasha Levin
2025-06-24  4:11 ` [PATCH AUTOSEL 6.15 20/20] rcu: Return early if callback is not specified Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250624041120.83191-15-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=jsanders.devel@gmail.com \
    --cc=justin@coraid.com \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=valentin@vrvis.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox