From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Brian Kao <powenkao@google.com>,
Bart Van Assche <bvanassche@acm.org>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
Sasha Levin <sashal@kernel.org>,
peter.wang@mediatek.com, avri.altman@sandisk.com,
beanhuo@micron.com, adrian.hunter@intel.com,
quic_nguyenb@quicinc.com, linux-scsi@vger.kernel.org
Subject: [PATCH AUTOSEL 6.18-6.1] scsi: ufs: core: Fix EH failure after W-LUN resume error
Date: Mon, 15 Dec 2025 03:55:26 -0500 [thread overview]
Message-ID: <20251215085533.2931615-3-sashal@kernel.org> (raw)
In-Reply-To: <20251215085533.2931615-1-sashal@kernel.org>
From: Brian Kao <powenkao@google.com>
[ Upstream commit b4bb6daf4ac4d4560044ecdd81e93aa2f6acbb06 ]
When a W-LUN resume fails, its parent devices in the SCSI hierarchy,
including the scsi_target, may be runtime suspended. Subsequently, the
error handler in ufshcd_recover_pm_error() fails to set the W-LUN device
back to active because the parent target is not active. This results in
the following errors:
google-ufshcd 3c2d0000.ufs: ufshcd_err_handler started; HBA state eh_fatal; ...
ufs_device_wlun 0:0:0:49488: START_STOP failed for power mode: 1, result 40000
ufs_device_wlun 0:0:0:49488: ufshcd_wl_runtime_resume failed: -5
...
ufs_device_wlun 0:0:0:49488: runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active
Address this by:
1. Ensuring the W-LUN's parent scsi_target is runtime resumed before
attempting to set the W-LUN to active within
ufshcd_recover_pm_error().
2. Explicitly checking for power.runtime_error on the HBA and W-LUN
devices before calling pm_runtime_set_active() to clear the error
state.
3. Adding pm_runtime_get_sync(hba->dev) in
ufshcd_err_handling_prepare() to ensure the HBA itself is active
during error recovery, even if a child device resume failed.
These changes ensure the device power states are managed correctly
during error recovery.
Signed-off-by: Brian Kao <powenkao@google.com>
Tested-by: Brian Kao <powenkao@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20251112063214.1195761-1-powenkao@google.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
## Analysis Summary
### Problem Description
This commit fixes a bug where UFS error recovery fails after a W-LUN
(Well Known Logical Unit Number) resume error. When a W-LUN resume
fails:
1. The parent devices in the SCSI hierarchy (including `scsi_target`)
remain runtime suspended
2. The error handler's `ufshcd_recover_pm_error()` attempts to set the
W-LUN device active
3. The PM core rejects this because the parent target is not active,
producing the error:
```
runtime PM trying to activate child device 0:0:0:49488 but parent
(target0:0:0) is not active
```
### Technical Analysis of the Fix
**Original (buggy) code:**
```c
ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev);
if (ret)
ret = pm_runtime_set_active(hba->dev);
```
This incorrectly tries to activate a child device while its parent is
suspended - violating the PM hierarchy requirement.
**The fix does three things:**
1. **Adds `pm_runtime_get_sync(hba->dev)` in
`ufshcd_err_handling_prepare()`** - Ensures HBA is active during
error recovery even if a child device resume failed
2. **Fixes PM hierarchy in `ufshcd_recover_pm_error()`** - Explicitly
resumes the parent `scsi_target` before trying to activate the child
W-LUN
3. **Checks `power.runtime_error`** - Only clears error state on devices
that actually have an error, rather than relying on return values
### Stable Tree Criteria Assessment
| Criterion | Assessment |
|-----------|------------|
| Fixes real bug | ✅ YES - Error handling completely fails without this
|
| Obviously correct | ✅ YES - Follows standard PM parent-before-child
rules |
| Small and contained | ✅ YES - ~35 lines in one file, error handling
path only |
| No new features | ✅ YES - Pure bugfix |
| Tested | ✅ YES - Has Tested-by and Reviewed-by: Bart Van Assche |
| Exists in stable | ✅ YES - `ufshcd_recover_pm_error()` introduced in
v5.10 |
| Dependencies | ✅ NONE - Uses standard, long-existing PM APIs |
### User Impact
- **Affected systems**: Android devices, embedded systems with UFS
storage
- **Severity**: Error handler failure can lead to I/O failures and
potential data loss
- **Frequency**: Occurs when W-LUN resume fails and error handler tries
to recover
### Risk Assessment
- **Low regression risk**: Changes are in error handling code path only
- **Well-understood fix**: Standard PM hierarchy handling pattern
- **Good review**: Bart Van Assche (experienced SCSI/block maintainer)
reviewed it
### Conclusion
This is a clear-cut bug fix for UFS error handling. The bug causes the
error handler to fail completely when a W-LUN resume error occurs, which
can leave the storage subsystem in an unrecoverable state. The fix
correctly implements PM hierarchy requirements (parent must be active
before child), is well-tested, has expert review, and uses standard APIs
that exist in all stable trees since v5.10. The changes are contained to
the error handling path with minimal regression risk.
**YES**
drivers/ufs/core/ufshcd.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index d6a060a724618..ce52c3bafbe8f 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -6498,6 +6498,11 @@ static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
{
+ /*
+ * A WLUN resume failure could potentially lead to the HBA being
+ * runtime suspended, so take an extra reference on hba->dev.
+ */
+ pm_runtime_get_sync(hba->dev);
ufshcd_rpm_get_sync(hba);
if (pm_runtime_status_suspended(&hba->ufs_device_wlun->sdev_gendev) ||
hba->is_sys_suspended) {
@@ -6537,6 +6542,7 @@ static void ufshcd_err_handling_unprepare(struct ufs_hba *hba)
if (ufshcd_is_clkscaling_supported(hba))
ufshcd_clk_scaling_suspend(hba, false);
ufshcd_rpm_put(hba);
+ pm_runtime_put(hba->dev);
}
static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba)
@@ -6551,28 +6557,42 @@ static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba)
#ifdef CONFIG_PM
static void ufshcd_recover_pm_error(struct ufs_hba *hba)
{
+ struct scsi_target *starget = hba->ufs_device_wlun->sdev_target;
struct Scsi_Host *shost = hba->host;
struct scsi_device *sdev;
struct request_queue *q;
- int ret;
+ bool resume_sdev_queues = false;
hba->is_sys_suspended = false;
+
/*
- * Set RPM status of wlun device to RPM_ACTIVE,
- * this also clears its runtime error.
+ * Ensure the parent's error status is cleared before proceeding
+ * to the child, as the parent must be active to activate the child.
*/
- ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev);
+ if (hba->dev->power.runtime_error) {
+ /* hba->dev has no functional parent thus simplily set RPM_ACTIVE */
+ pm_runtime_set_active(hba->dev);
+ resume_sdev_queues = true;
+ }
+
+ if (hba->ufs_device_wlun->sdev_gendev.power.runtime_error) {
+ /*
+ * starget, parent of wlun, might be suspended if wlun resume failed.
+ * Make sure parent is resumed before set child (wlun) active.
+ */
+ pm_runtime_get_sync(&starget->dev);
+ pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev);
+ pm_runtime_put_sync(&starget->dev);
+ resume_sdev_queues = true;
+ }
- /* hba device might have a runtime error otherwise */
- if (ret)
- ret = pm_runtime_set_active(hba->dev);
/*
* If wlun device had runtime error, we also need to resume those
* consumer scsi devices in case any of them has failed to be
* resumed due to supplier runtime resume failure. This is to unblock
* blk_queue_enter in case there are bios waiting inside it.
*/
- if (!ret) {
+ if (resume_sdev_queues) {
shost_for_each_device(sdev, shost) {
q = sdev->request_queue;
if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
--
2.51.0
next prev parent reply other threads:[~2025-12-15 8:55 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-15 8:55 [PATCH AUTOSEL 6.18] scsi: mpi3mr: Prevent duplicate SAS/SATA device entries in channel 1 Sasha Levin
2025-12-15 8:55 ` [PATCH AUTOSEL 6.18-5.10] scsi: Revert "scsi: libsas: Fix exp-attached device scan after probe failure scanned in again after probe failed" Sasha Levin
2025-12-15 8:55 ` Sasha Levin [this message]
2025-12-15 8:55 ` [PATCH AUTOSEL 6.18-5.10] scsi: ipr: Enable/disable IRQD_NO_BALANCING during reset Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251215085533.2931615-3-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=avri.altman@sandisk.com \
--cc=beanhuo@micron.com \
--cc=bvanassche@acm.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=patches@lists.linux.dev \
--cc=peter.wang@mediatek.com \
--cc=powenkao@google.com \
--cc=quic_nguyenb@quicinc.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox