From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Justin Sanders <jsanders.devel@gmail.com>,
Valentin Kleibel <valentin@vrvis.at>,
Jens Axboe <axboe@kernel.dk>, Sasha Levin <sashal@kernel.org>,
justin@coraid.com
Subject: [PATCH AUTOSEL 6.15 15/20] aoe: defer rexmit timer downdev work to workqueue
Date: Tue, 24 Jun 2025 00:11:14 -0400 [thread overview]
Message-ID: <20250624041120.83191-15-sashal@kernel.org> (raw)
In-Reply-To: <20250624041120.83191-1-sashal@kernel.org>
From: Justin Sanders <jsanders.devel@gmail.com>
[ Upstream commit cffc873d68ab09a0432b8212008c5613f8a70a2c ]
When aoe's rexmit_timer() notices that an aoe target fails to respond to
commands for more than aoe_deadsecs, it calls aoedev_downdev() which
cleans the outstanding aoe and block queues. This can involve sleeping,
such as in blk_mq_freeze_queue(), which should not occur in irq context.
This patch defers that aoedev_downdev() call to the aoe device's
workqueue.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665
Signed-off-by: Justin Sanders <jsanders.devel@gmail.com>
Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com
Tested-By: Valentin Kleibel <valentin@vrvis.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
Based on my analysis of the commit and the kernel repository context,
here is my determination:
**YES**
This commit should be backported to stable kernel trees. Here's my
extensive explanation:
## 1. **Critical Bug Fix for Kernel Crash/Hang**
The commit fixes a serious "scheduling while atomic" bug that can cause
kernel crashes. The issue occurs when:
- The `rexmit_timer()` function (timer callback, runs in softirq
context) detects a device timeout
- It directly calls `aoedev_downdev()` which contains
`blk_mq_freeze_queue()` - a sleeping operation
- Calling a sleeping function from atomic context violates kernel rules
and causes crashes
## 2. **Clear Root Cause and Minimal Fix**
The fix is minimal and targeted:
- Adds a new flag `DEVFL_DEAD` to mark devices that need to be downed
- Instead of calling `aoedev_downdev()` directly from timer context, it
sets the flag and queues work
- The actual `aoedev_downdev()` call happens in workqueue context where
sleeping is allowed
## 3. **Part of a Series Fixing Related Issues**
This is part of a series addressing bugzilla #212665, including:
- This commit: Fixes the timer context sleeping issue
- Commit 7f90d45e57cb: Fixes the hang in `blk_mq_freeze_queue()` by
properly cleaning request lists
Both commits are needed to fully resolve the reported issue.
## 4. **Regression from blk-mq Conversion**
The issue was introduced by commit 3582dd291788 ("aoe: convert aoeblk to
blk-mq"), making this a regression fix. The blk-mq conversion added
sleeping operations that weren't safe to call from existing timer
contexts.
## 5. **Similar to Previously Backported Fixes**
Looking at the similar commits:
- Commit 430380b4637a (fixing atomic sleep in flush) was marked as
suitable for backporting
- It addressed the same class of bug in a different code path
- This establishes precedent for backporting such fixes
## 6. **Minimal Risk of Regression**
The changes are:
- Well-contained to the aoe driver
- Follow established patterns (deferring to workqueue)
- Don't change any user-visible behavior
- Only change the execution context of an existing operation
## 7. **Meets Stable Criteria**
According to stable tree rules, this commit:
- ✓ Fixes a real bug (kernel crash/scheduling while atomic)
- ✓ Has been tested (per commit message)
- ✓ Is minimal and targeted
- ✓ Fixes a regression from previous changes
- ✓ Has clear explanation and reproducer (bugzilla link)
The only consideration is that it should be backported together with
commit 7f90d45e57cb to ensure the complete fix for bugzilla #212665 is
applied.
drivers/block/aoe/aoe.h | 1 +
drivers/block/aoe/aoecmd.c | 8 ++++++--
drivers/block/aoe/aoedev.c | 5 ++++-
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index 749ae1246f4cf..d35caa3c69e15 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -80,6 +80,7 @@ enum {
DEVFL_NEWSIZE = (1<<6), /* need to update dev size in block layer */
DEVFL_FREEING = (1<<7), /* set when device is being cleaned up */
DEVFL_FREED = (1<<8), /* device has been cleaned up */
+ DEVFL_DEAD = (1<<9), /* device has timed out of aoe_deadsecs */
};
enum {
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 92b06d1de4cc7..6c94cfd1c480e 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -754,7 +754,7 @@ rexmit_timer(struct timer_list *timer)
utgts = count_targets(d, NULL);
- if (d->flags & DEVFL_TKILL) {
+ if (d->flags & (DEVFL_TKILL | DEVFL_DEAD)) {
spin_unlock_irqrestore(&d->lock, flags);
return;
}
@@ -786,7 +786,8 @@ rexmit_timer(struct timer_list *timer)
* to clean up.
*/
list_splice(&flist, &d->factive[0]);
- aoedev_downdev(d);
+ d->flags |= DEVFL_DEAD;
+ queue_work(aoe_wq, &d->work);
goto out;
}
@@ -898,6 +899,9 @@ aoecmd_sleepwork(struct work_struct *work)
{
struct aoedev *d = container_of(work, struct aoedev, work);
+ if (d->flags & DEVFL_DEAD)
+ aoedev_downdev(d);
+
if (d->flags & DEVFL_GDALLOC)
aoeblk_gdalloc(d);
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 141b2a0e03f2c..0189fc3cbbefe 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -199,8 +199,11 @@ aoedev_downdev(struct aoedev *d)
struct aoetgt *t, **tt, **te;
struct list_head *head, *pos, *nx;
int i;
+ unsigned long flags;
- d->flags &= ~DEVFL_UP;
+ spin_lock_irqsave(&d->lock, flags);
+ d->flags &= ~(DEVFL_UP | DEVFL_DEAD);
+ spin_unlock_irqrestore(&d->lock, flags);
/* clean out active and to-be-retransmitted buffers */
for (i = 0; i < NFACTIVE; i++) {
--
2.39.5
next prev parent reply other threads:[~2025-06-24 4:11 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-24 4:11 [PATCH AUTOSEL 6.15 01/20] x86/platform/amd: move final timeout check to after final sleep Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 02/20] drm/msm: Fix a fence leak in submit error path Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 03/20] drm/msm: Fix another leak in the " Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 04/20] ALSA: sb: Don't allow changing the DMA mode during operations Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 05/20] ALSA: sb: Force to disable DMAs once when DMA mode is changed Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 06/20] ata: libata-acpi: Do not assume 40 wire cable if no devices are enabled Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 07/20] ata: pata_cs5536: fix build on 32-bit UML Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 08/20] ASoC: amd: yc: Add quirk for MSI Bravo 17 D7VF internal mic Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 09/20] platform/x86/amd/pmc: Add PCSpecialist Lafite Pro V 14M to 8042 quirks list Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 10/20] genirq/irq_sim: Initialize work context pointers properly Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 11/20] powerpc: Fix struct termio related ioctl macros Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 12/20] ASoC: amd: yc: update quirk data for HP Victus Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 13/20] regulator: fan53555: add enable_time support and soft-start times Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 14/20] scsi: target: Fix NULL pointer dereference in core_scsi3_decode_spec_i_port() Sasha Levin
2025-06-24 4:11 ` Sasha Levin [this message]
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 16/20] wifi: mac80211: drop invalid source address OCB frames Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 17/20] wifi: ath6kl: remove WARN on bad firmware input Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 18/20] ACPICA: Refuse to evaluate a method if arguments are missing Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 19/20] mtd: spinand: fix memory leak of ECC engine conf Sasha Levin
2025-06-24 4:11 ` [PATCH AUTOSEL 6.15 20/20] rcu: Return early if callback is not specified Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250624041120.83191-15-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=axboe@kernel.dk \
--cc=jsanders.devel@gmail.com \
--cc=justin@coraid.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
--cc=valentin@vrvis.at \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox