[PATCH AUTOSEL 6.18-6.1] scsi: ufs: core: Fix EH failure after W-LUN resume error

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Brian Kao <powenkao@google.com>,
	Bart Van Assche <bvanassche@acm.org>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Sasha Levin <sashal@kernel.org>,
	peter.wang@mediatek.com, avri.altman@sandisk.com,
	beanhuo@micron.com, adrian.hunter@intel.com,
	quic_nguyenb@quicinc.com, linux-scsi@vger.kernel.org
Subject: [PATCH AUTOSEL 6.18-6.1] scsi: ufs: core: Fix EH failure after W-LUN resume error
Date: Mon, 15 Dec 2025 03:55:26 -0500	[thread overview]
Message-ID: <20251215085533.2931615-3-sashal@kernel.org> (raw)
In-Reply-To: <20251215085533.2931615-1-sashal@kernel.org>

From: Brian Kao <powenkao@google.com>

[ Upstream commit b4bb6daf4ac4d4560044ecdd81e93aa2f6acbb06 ]

When a W-LUN resume fails, its parent devices in the SCSI hierarchy,
including the scsi_target, may be runtime suspended. Subsequently, the
error handler in ufshcd_recover_pm_error() fails to set the W-LUN device
back to active because the parent target is not active.  This results in
the following errors:

  google-ufshcd 3c2d0000.ufs: ufshcd_err_handler started; HBA state eh_fatal; ...
  ufs_device_wlun 0:0:0:49488: START_STOP failed for power mode: 1, result 40000
  ufs_device_wlun 0:0:0:49488: ufshcd_wl_runtime_resume failed: -5
  ...
  ufs_device_wlun 0:0:0:49488: runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active

Address this by:

 1. Ensuring the W-LUN's parent scsi_target is runtime resumed before
    attempting to set the W-LUN to active within
    ufshcd_recover_pm_error().

 2. Explicitly checking for power.runtime_error on the HBA and W-LUN
    devices before calling pm_runtime_set_active() to clear the error
    state.

 3. Adding pm_runtime_get_sync(hba->dev) in
    ufshcd_err_handling_prepare() to ensure the HBA itself is active
    during error recovery, even if a child device resume failed.

These changes ensure the device power states are managed correctly
during error recovery.

Signed-off-by: Brian Kao <powenkao@google.com>
Tested-by: Brian Kao <powenkao@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20251112063214.1195761-1-powenkao@google.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis Summary

### Problem Description
This commit fixes a bug where UFS error recovery fails after a W-LUN
(Well Known Logical Unit Number) resume error. When a W-LUN resume
fails:

1. The parent devices in the SCSI hierarchy (including `scsi_target`)
   remain runtime suspended
2. The error handler's `ufshcd_recover_pm_error()` attempts to set the
   W-LUN device active
3. The PM core rejects this because the parent target is not active,
   producing the error:
  ```
  runtime PM trying to activate child device 0:0:0:49488 but parent
  (target0:0:0) is not active
  ```

### Technical Analysis of the Fix

**Original (buggy) code:**
```c
ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev);
if (ret)
    ret = pm_runtime_set_active(hba->dev);
```

This incorrectly tries to activate a child device while its parent is
suspended - violating the PM hierarchy requirement.

**The fix does three things:**
1. **Adds `pm_runtime_get_sync(hba->dev)` in
   `ufshcd_err_handling_prepare()`** - Ensures HBA is active during
   error recovery even if a child device resume failed
2. **Fixes PM hierarchy in `ufshcd_recover_pm_error()`** - Explicitly
   resumes the parent `scsi_target` before trying to activate the child
   W-LUN
3. **Checks `power.runtime_error`** - Only clears error state on devices
   that actually have an error, rather than relying on return values

### Stable Tree Criteria Assessment

| Criterion | Assessment |
|-----------|------------|
| Fixes real bug | ✅ YES - Error handling completely fails without this
|
| Obviously correct | ✅ YES - Follows standard PM parent-before-child
rules |
| Small and contained | ✅ YES - ~35 lines in one file, error handling
path only |
| No new features | ✅ YES - Pure bugfix |
| Tested | ✅ YES - Has Tested-by and Reviewed-by: Bart Van Assche |
| Exists in stable | ✅ YES - `ufshcd_recover_pm_error()` introduced in
v5.10 |
| Dependencies | ✅ NONE - Uses standard, long-existing PM APIs |

### User Impact
- **Affected systems**: Android devices, embedded systems with UFS
  storage
- **Severity**: Error handler failure can lead to I/O failures and
  potential data loss
- **Frequency**: Occurs when W-LUN resume fails and error handler tries
  to recover

### Risk Assessment
- **Low regression risk**: Changes are in error handling code path only
- **Well-understood fix**: Standard PM hierarchy handling pattern
- **Good review**: Bart Van Assche (experienced SCSI/block maintainer)
  reviewed it

### Conclusion

This is a clear-cut bug fix for UFS error handling. The bug causes the
error handler to fail completely when a W-LUN resume error occurs, which
can leave the storage subsystem in an unrecoverable state. The fix
correctly implements PM hierarchy requirements (parent must be active
before child), is well-tested, has expert review, and uses standard APIs
that exist in all stable trees since v5.10. The changes are contained to
the error handling path with minimal regression risk.

**YES**

 drivers/ufs/core/ufshcd.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index d6a060a724618..ce52c3bafbe8f 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -6498,6 +6498,11 @@ static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
 
 static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
 {
+	/*
+	 * A WLUN resume failure could potentially lead to the HBA being
+	 * runtime suspended, so take an extra reference on hba->dev.
+	 */
+	pm_runtime_get_sync(hba->dev);
 	ufshcd_rpm_get_sync(hba);
 	if (pm_runtime_status_suspended(&hba->ufs_device_wlun->sdev_gendev) ||
 	    hba->is_sys_suspended) {
@@ -6537,6 +6542,7 @@ static void ufshcd_err_handling_unprepare(struct ufs_hba *hba)
 	if (ufshcd_is_clkscaling_supported(hba))
 		ufshcd_clk_scaling_suspend(hba, false);
 	ufshcd_rpm_put(hba);
+	pm_runtime_put(hba->dev);
 }
 
 static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba)
@@ -6551,28 +6557,42 @@ static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba)
 #ifdef CONFIG_PM
 static void ufshcd_recover_pm_error(struct ufs_hba *hba)
 {
+	struct scsi_target *starget = hba->ufs_device_wlun->sdev_target;
 	struct Scsi_Host *shost = hba->host;
 	struct scsi_device *sdev;
 	struct request_queue *q;
-	int ret;
+	bool resume_sdev_queues = false;
 
 	hba->is_sys_suspended = false;
+
 	/*
-	 * Set RPM status of wlun device to RPM_ACTIVE,
-	 * this also clears its runtime error.
+	 * Ensure the parent's error status is cleared before proceeding
+	 * to the child, as the parent must be active to activate the child.
 	 */
-	ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev);
+	if (hba->dev->power.runtime_error) {
+		/* hba->dev has no functional parent thus simplily set RPM_ACTIVE */
+		pm_runtime_set_active(hba->dev);
+		resume_sdev_queues = true;
+	}
+
+	if (hba->ufs_device_wlun->sdev_gendev.power.runtime_error) {
+		/*
+		 * starget, parent of wlun, might be suspended if wlun resume failed.
+		 * Make sure parent is resumed before set child (wlun) active.
+		 */
+		pm_runtime_get_sync(&starget->dev);
+		pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev);
+		pm_runtime_put_sync(&starget->dev);
+		resume_sdev_queues = true;
+	}
 
-	/* hba device might have a runtime error otherwise */
-	if (ret)
-		ret = pm_runtime_set_active(hba->dev);
 	/*
 	 * If wlun device had runtime error, we also need to resume those
 	 * consumer scsi devices in case any of them has failed to be
 	 * resumed due to supplier runtime resume failure. This is to unblock
 	 * blk_queue_enter in case there are bios waiting inside it.
 	 */
-	if (!ret) {
+	if (resume_sdev_queues) {
 		shost_for_each_device(sdev, shost) {
 			q = sdev->request_queue;
 			if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
-- 
2.51.0

next prev parent reply	other threads:[~2025-12-15  8:55 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-15  8:55 [PATCH AUTOSEL 6.18] scsi: mpi3mr: Prevent duplicate SAS/SATA device entries in channel 1 Sasha Levin
2025-12-15  8:55 ` [PATCH AUTOSEL 6.18-5.10] scsi: Revert "scsi: libsas: Fix exp-attached device scan after probe failure scanned in again after probe failed" Sasha Levin
2025-12-15  8:55 ` Sasha Levin [this message]
2025-12-15  8:55 ` [PATCH AUTOSEL 6.18-5.10] scsi: ipr: Enable/disable IRQD_NO_BALANCING during reset Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d6a060a72461 dfblob:ce52c3bafbe8 )
 OR (
bs:"[PATCH AUTOSEL 6.18-6.1] scsi: ufs: core: Fix EH failure after W-LUN resume error" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251215085533.2931615-3-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=avri.altman@sandisk.com \
    --cc=beanhuo@micron.com \
    --cc=bvanassche@acm.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=patches@lists.linux.dev \
    --cc=peter.wang@mediatek.com \
    --cc=powenkao@google.com \
    --cc=quic_nguyenb@quicinc.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.