[PATCH 0/2] bus: mhi: host: pci_generic: Couple of recovery fixes

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] bus: mhi: host: pci_generic: Couple of recovery fixes
@ 2025-01-08 13:39 Manivannan Sadhasivam via B4 Relay
  2025-01-08 13:39 ` [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock Manivannan Sadhasivam via B4 Relay
  2025-01-08 13:39 ` [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume() Manivannan Sadhasivam via B4 Relay
  0 siblings, 2 replies; 14+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2025-01-08 13:39 UTC (permalink / raw)
  To: mhi, Loic Poulain
  Cc: Johan Hovold, linux-arm-msm, linux-kernel, Manivannan Sadhasivam,
	stable

Hi,

This series fixes a couple of issues reported by Johan in [1]. First one fixes
a deadlock that happens during shutdown and suspend. Second one fixes the
driver's PM behavior.

[1] https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---
Manivannan Sadhasivam (2):
      bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
      bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()

 drivers/bus/mhi/host/pci_generic.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)
---
base-commit: fc033cf25e612e840e545f8d5ad2edd6ba613ed5
change-id: 20250108-mhi_recovery_fix-a8f37168f91c

Best regards,
-- 
Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
  2025-01-08 13:39 [PATCH 0/2] bus: mhi: host: pci_generic: Couple of recovery fixes Manivannan Sadhasivam via B4 Relay
@ 2025-01-08 13:39 ` Manivannan Sadhasivam via B4 Relay
  2025-01-08 14:46   ` Loic Poulain
  2025-01-22 15:11   ` Johan Hovold
  2025-01-08 13:39 ` [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume() Manivannan Sadhasivam via B4 Relay
  1 sibling, 2 replies; 14+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2025-01-08 13:39 UTC (permalink / raw)
  To: mhi, Loic Poulain
  Cc: Johan Hovold, linux-arm-msm, linux-kernel, Manivannan Sadhasivam,
	stable

From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>

There are multiple places from where the recovery work gets scheduled
asynchronously. Also, there are multiple places where the caller waits
synchronously for the recovery to be completed. One such place is during
the PM shutdown() callback.

If the device is not alive during recovery_work, it will try to reset the
device using pci_reset_function(). This function internally will take the
device_lock() first before resetting the device. By this time, if the lock
has already been acquired, then recovery_work will get stalled while
waiting for the lock. And if the lock was already acquired by the caller
which waits for the recovery_work to be completed, it will lead to
deadlock.

This is what happened on the X1E80100 CRD device when the device died
before shutdown() callback. Driver core calls the driver's shutdown()
callback while holding the device_lock() leading to deadlock.

And this deadlock scenario can occur on other paths as well, like during
the PM suspend() callback, where the driver core would hold the
device_lock() before calling driver's suspend() callback. And if the
recovery_work was already started, it could lead to deadlock. This is also
observed on the X1E80100 CRD.

So to fix both issues, use pci_try_reset_function() in recovery_work. This
function first checks for the availability of the device_lock() before
trying to reset the device. If the lock is available, it will acquire it
and reset the device. Otherwise, it will return -EAGAIN. If that happens,
recovery_work will fail with the error message "Recovery failed" as not
much could be done.

Cc: stable@vger.kernel.org # 5.12
Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com
Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---
 drivers/bus/mhi/host/pci_generic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c
index 07645ce2119a..e92df380c785 100644
--- a/drivers/bus/mhi/host/pci_generic.c
+++ b/drivers/bus/mhi/host/pci_generic.c
@@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work)
 err_unprepare:
 	mhi_unprepare_after_power_down(mhi_cntrl);
 err_try_reset:
-	if (pci_reset_function(pdev))
+	if (pci_try_reset_function(pdev))
 		dev_err(&pdev->dev, "Recovery failed\n");
 }

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-08 13:39 [PATCH 0/2] bus: mhi: host: pci_generic: Couple of recovery fixes Manivannan Sadhasivam via B4 Relay
  2025-01-08 13:39 ` [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock Manivannan Sadhasivam via B4 Relay
@ 2025-01-08 13:39 ` Manivannan Sadhasivam via B4 Relay
  2025-01-08 15:19   ` Loic Poulain
  2025-01-22 15:24   ` Johan Hovold
  1 sibling, 2 replies; 14+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2025-01-08 13:39 UTC (permalink / raw)
  To: mhi, Loic Poulain
  Cc: Johan Hovold, linux-arm-msm, linux-kernel, Manivannan Sadhasivam,
	stable

From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>

Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
is started asynchronously and success is returned. But this doesn't align
with what PM core expects as documented in
Documentation/power/runtime_pm.rst:

"Once the subsystem-level resume callback (or the driver resume callback,
if invoked directly) has completed successfully, the PM core regards the
device as fully operational, which means that the device _must_ be able to
complete I/O operations as needed.  The runtime PM status of the device is
then 'active'."

So the PM core ends up marking the runtime PM status of the device as
'active', even though the device is not able to handle the I/O operations.
This same condition more or less applies to system resume as well.

So to avoid this ambiguity, try to recover the device synchronously from
mhi_pci_runtime_resume() and return the actual error code in the case of
recovery failure.

For doing so, move the recovery code to __mhi_pci_recovery_work() helper
and call that from both mhi_pci_recovery_work() and
mhi_pci_runtime_resume(). Former still ignores the return value, while the
latter passes it to PM core.

Cc: stable@vger.kernel.org # 5.13
Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---
 drivers/bus/mhi/host/pci_generic.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c
index e92df380c785..f6de407e077e 100644
--- a/drivers/bus/mhi/host/pci_generic.c
+++ b/drivers/bus/mhi/host/pci_generic.c
@@ -997,10 +997,8 @@ static void mhi_pci_runtime_put(struct mhi_controller *mhi_cntrl)
 	pm_runtime_put(mhi_cntrl->cntrl_dev);
 }
 
-static void mhi_pci_recovery_work(struct work_struct *work)
+static int __mhi_pci_recovery_work(struct mhi_pci_device *mhi_pdev)
 {
-	struct mhi_pci_device *mhi_pdev = container_of(work, struct mhi_pci_device,
-						       recovery_work);
 	struct mhi_controller *mhi_cntrl = &mhi_pdev->mhi_cntrl;
 	struct pci_dev *pdev = to_pci_dev(mhi_cntrl->cntrl_dev);
 	int err;
@@ -1035,13 +1033,25 @@ static void mhi_pci_recovery_work(struct work_struct *work)
 
 	set_bit(MHI_PCI_DEV_STARTED, &mhi_pdev->status);
 	mod_timer(&mhi_pdev->health_check_timer, jiffies + HEALTH_CHECK_PERIOD);
-	return;
+
+	return 0;
 
 err_unprepare:
 	mhi_unprepare_after_power_down(mhi_cntrl);
 err_try_reset:
-	if (pci_try_reset_function(pdev))
+	err = pci_try_reset_function(pdev);
+	if (err)
 		dev_err(&pdev->dev, "Recovery failed\n");
+
+	return err;
+}
+
+static void mhi_pci_recovery_work(struct work_struct *work)
+{
+	struct mhi_pci_device *mhi_pdev = container_of(work, struct mhi_pci_device,
+						       recovery_work);
+
+	__mhi_pci_recovery_work(mhi_pdev);
 }
 
 static void health_check(struct timer_list *t)
@@ -1400,15 +1410,10 @@ static int __maybe_unused mhi_pci_runtime_resume(struct device *dev)
 	return 0;
 
 err_recovery:
-	/* Do not fail to not mess up our PCI device state, the device likely
-	 * lost power (d3cold) and we simply need to reset it from the recovery
-	 * procedure, trigger the recovery asynchronously to prevent system
-	 * suspend exit delaying.
-	 */
-	queue_work(system_long_wq, &mhi_pdev->recovery_work);
+	err = __mhi_pci_recovery_work(mhi_pdev);
 	pm_runtime_mark_last_busy(dev);
 
-	return 0;
+	return err;
 }
 
 static int  __maybe_unused mhi_pci_suspend(struct device *dev)

-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
  2025-01-08 13:39 ` [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock Manivannan Sadhasivam via B4 Relay
@ 2025-01-08 14:46   ` Loic Poulain
  2025-01-22 15:11   ` Johan Hovold
  1 sibling, 0 replies; 14+ messages in thread
From: Loic Poulain @ 2025-01-08 14:46 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: mhi, Johan Hovold, linux-arm-msm, linux-kernel, stable

On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay
<devnull+manivannan.sadhasivam.linaro.org@kernel.org> wrote:
>
> From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
>
> There are multiple places from where the recovery work gets scheduled
> asynchronously. Also, there are multiple places where the caller waits
> synchronously for the recovery to be completed. One such place is during
> the PM shutdown() callback.
>
> If the device is not alive during recovery_work, it will try to reset the
> device using pci_reset_function(). This function internally will take the
> device_lock() first before resetting the device. By this time, if the lock
> has already been acquired, then recovery_work will get stalled while
> waiting for the lock. And if the lock was already acquired by the caller
> which waits for the recovery_work to be completed, it will lead to
> deadlock.
>
> This is what happened on the X1E80100 CRD device when the device died
> before shutdown() callback. Driver core calls the driver's shutdown()
> callback while holding the device_lock() leading to deadlock.
>
> And this deadlock scenario can occur on other paths as well, like during
> the PM suspend() callback, where the driver core would hold the
> device_lock() before calling driver's suspend() callback. And if the
> recovery_work was already started, it could lead to deadlock. This is also
> observed on the X1E80100 CRD.
>
> So to fix both issues, use pci_try_reset_function() in recovery_work. This
> function first checks for the availability of the device_lock() before
> trying to reset the device. If the lock is available, it will acquire it
> and reset the device. Otherwise, it will return -EAGAIN. If that happens,
> recovery_work will fail with the error message "Recovery failed" as not
> much could be done.
>
> Cc: stable@vger.kernel.org # 5.12
> Reported-by: Johan Hovold <johan@kernel.org>
> Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com
> Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>

Reviewed-by: Loic Poulain <loic.poulain@linaro.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-08 13:39 ` [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume() Manivannan Sadhasivam via B4 Relay
@ 2025-01-08 15:19   ` Loic Poulain
  2025-01-08 16:02     ` Manivannan Sadhasivam
  2025-01-22 15:24   ` Johan Hovold
  1 sibling, 1 reply; 14+ messages in thread
From: Loic Poulain @ 2025-01-08 15:19 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: mhi, Johan Hovold, linux-arm-msm, linux-kernel, stable

On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay
<devnull+manivannan.sadhasivam.linaro.org@kernel.org> wrote:
>
> From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
>
> Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> is started asynchronously and success is returned. But this doesn't align
> with what PM core expects as documented in
> Documentation/power/runtime_pm.rst:
>
> "Once the subsystem-level resume callback (or the driver resume callback,
> if invoked directly) has completed successfully, the PM core regards the
> device as fully operational, which means that the device _must_ be able to
> complete I/O operations as needed.  The runtime PM status of the device is
> then 'active'."
>
> So the PM core ends up marking the runtime PM status of the device as
> 'active', even though the device is not able to handle the I/O operations.
> This same condition more or less applies to system resume as well.
>
> So to avoid this ambiguity, try to recover the device synchronously from
> mhi_pci_runtime_resume() and return the actual error code in the case of
> recovery failure.
>
> For doing so, move the recovery code to __mhi_pci_recovery_work() helper
> and call that from both mhi_pci_recovery_work() and
> mhi_pci_runtime_resume(). Former still ignores the return value, while the
> latter passes it to PM core.
>
> Cc: stable@vger.kernel.org # 5.13
> Reported-by: Johan Hovold <johan@kernel.org>
> Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>

Note that it will noticeably impact the user experience on system-wide
resume (mhi_pci_resume), because MHI devices usually take a while (a
few seconds) to cold boot and reach a ready state (or time out in the
worst case). So we may have people complaining about delayed resume
regression on their laptop even if they are not using the MHI
device/modem function. Are we ok with that?

Regards,
Loic

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-08 15:19   ` Loic Poulain
@ 2025-01-08 16:02     ` Manivannan Sadhasivam
  2025-01-09 20:50       ` Loic Poulain
  0 siblings, 1 reply; 14+ messages in thread
From: Manivannan Sadhasivam @ 2025-01-08 16:02 UTC (permalink / raw)
  To: Loic Poulain; +Cc: mhi, Johan Hovold, linux-arm-msm, linux-kernel, stable

On Wed, Jan 08, 2025 at 04:19:06PM +0100, Loic Poulain wrote:
> On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay
> <devnull+manivannan.sadhasivam.linaro.org@kernel.org> wrote:
> >
> > From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> >
> > Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> > is started asynchronously and success is returned. But this doesn't align
> > with what PM core expects as documented in
> > Documentation/power/runtime_pm.rst:
> >
> > "Once the subsystem-level resume callback (or the driver resume callback,
> > if invoked directly) has completed successfully, the PM core regards the
> > device as fully operational, which means that the device _must_ be able to
> > complete I/O operations as needed.  The runtime PM status of the device is
> > then 'active'."
> >
> > So the PM core ends up marking the runtime PM status of the device as
> > 'active', even though the device is not able to handle the I/O operations.
> > This same condition more or less applies to system resume as well.
> >
> > So to avoid this ambiguity, try to recover the device synchronously from
> > mhi_pci_runtime_resume() and return the actual error code in the case of
> > recovery failure.
> >
> > For doing so, move the recovery code to __mhi_pci_recovery_work() helper
> > and call that from both mhi_pci_recovery_work() and
> > mhi_pci_runtime_resume(). Former still ignores the return value, while the
> > latter passes it to PM core.
> >
> > Cc: stable@vger.kernel.org # 5.13
> > Reported-by: Johan Hovold <johan@kernel.org>
> > Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> > Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> 
> Note that it will noticeably impact the user experience on system-wide
> resume (mhi_pci_resume), because MHI devices usually take a while (a
> few seconds) to cold boot and reach a ready state (or time out in the
> worst case). So we may have people complaining about delayed resume
> regression on their laptop even if they are not using the MHI
> device/modem function. Are we ok with that?
> 

Are you saying that the modem will enter D3Cold all the time during system
suspend? I think you are referring to x86 host machines here.

If that is the case, we should not be using mhi_pci_runtime_*() calls in
mhi_pci_suspend/resume(). Rather the MHI stack should be powered down during
suspend and powered ON during resume.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-08 16:02     ` Manivannan Sadhasivam
@ 2025-01-09 20:50       ` Loic Poulain
  2025-01-12  4:23         ` Manivannan Sadhasivam
  0 siblings, 1 reply; 14+ messages in thread
From: Loic Poulain @ 2025-01-09 20:50 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: mhi, Johan Hovold, linux-arm-msm, linux-kernel, stable

On Wed, 8 Jan 2025 at 17:02, Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Wed, Jan 08, 2025 at 04:19:06PM +0100, Loic Poulain wrote:
> > On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay
> > <devnull+manivannan.sadhasivam.linaro.org@kernel.org> wrote:
> > >
> > > From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > >
> > > Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> > > is started asynchronously and success is returned. But this doesn't align
> > > with what PM core expects as documented in
> > > Documentation/power/runtime_pm.rst:
> > >
> > > "Once the subsystem-level resume callback (or the driver resume callback,
> > > if invoked directly) has completed successfully, the PM core regards the
> > > device as fully operational, which means that the device _must_ be able to
> > > complete I/O operations as needed.  The runtime PM status of the device is
> > > then 'active'."
> > >
> > > So the PM core ends up marking the runtime PM status of the device as
> > > 'active', even though the device is not able to handle the I/O operations.
> > > This same condition more or less applies to system resume as well.
> > >
> > > So to avoid this ambiguity, try to recover the device synchronously from
> > > mhi_pci_runtime_resume() and return the actual error code in the case of
> > > recovery failure.
> > >
> > > For doing so, move the recovery code to __mhi_pci_recovery_work() helper
> > > and call that from both mhi_pci_recovery_work() and
> > > mhi_pci_runtime_resume(). Former still ignores the return value, while the
> > > latter passes it to PM core.
> > >
> > > Cc: stable@vger.kernel.org # 5.13
> > > Reported-by: Johan Hovold <johan@kernel.org>
> > > Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> > > Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> >
> > Note that it will noticeably impact the user experience on system-wide
> > resume (mhi_pci_resume), because MHI devices usually take a while (a
> > few seconds) to cold boot and reach a ready state (or time out in the
> > worst case). So we may have people complaining about delayed resume
> > regression on their laptop even if they are not using the MHI
> > device/modem function. Are we ok with that?
> >
>
> Are you saying that the modem will enter D3Cold all the time during system
> suspend? I think you are referring to x86 host machines here.

It depends on the host and its firmware implementation, but yes I
observed that x86_64 based laptops are powering off the mPCIe slot
while suspended.

> If that is the case, we should not be using mhi_pci_runtime_*() calls in
> mhi_pci_suspend/resume(). Rather the MHI stack should be powered down during
> suspend and powered ON during resume.

Yes, but what about the hosts keeping power in suspend state? we can
not really know that programmatically AFAIK.

Regards,
Loic

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-09 20:50       ` Loic Poulain
@ 2025-01-12  4:23         ` Manivannan Sadhasivam
  0 siblings, 0 replies; 14+ messages in thread
From: Manivannan Sadhasivam @ 2025-01-12  4:23 UTC (permalink / raw)
  To: Loic Poulain; +Cc: mhi, Johan Hovold, linux-arm-msm, linux-kernel, stable

On Thu, Jan 09, 2025 at 09:50:55PM +0100, Loic Poulain wrote:
> On Wed, 8 Jan 2025 at 17:02, Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Wed, Jan 08, 2025 at 04:19:06PM +0100, Loic Poulain wrote:
> > > On Wed, 8 Jan 2025 at 14:39, Manivannan Sadhasivam via B4 Relay
> > > <devnull+manivannan.sadhasivam.linaro.org@kernel.org> wrote:
> > > >
> > > > From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > >
> > > > Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> > > > is started asynchronously and success is returned. But this doesn't align
> > > > with what PM core expects as documented in
> > > > Documentation/power/runtime_pm.rst:
> > > >
> > > > "Once the subsystem-level resume callback (or the driver resume callback,
> > > > if invoked directly) has completed successfully, the PM core regards the
> > > > device as fully operational, which means that the device _must_ be able to
> > > > complete I/O operations as needed.  The runtime PM status of the device is
> > > > then 'active'."
> > > >
> > > > So the PM core ends up marking the runtime PM status of the device as
> > > > 'active', even though the device is not able to handle the I/O operations.
> > > > This same condition more or less applies to system resume as well.
> > > >
> > > > So to avoid this ambiguity, try to recover the device synchronously from
> > > > mhi_pci_runtime_resume() and return the actual error code in the case of
> > > > recovery failure.
> > > >
> > > > For doing so, move the recovery code to __mhi_pci_recovery_work() helper
> > > > and call that from both mhi_pci_recovery_work() and
> > > > mhi_pci_runtime_resume(). Former still ignores the return value, while the
> > > > latter passes it to PM core.
> > > >
> > > > Cc: stable@vger.kernel.org # 5.13
> > > > Reported-by: Johan Hovold <johan@kernel.org>
> > > > Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> > > > Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > >
> > > Note that it will noticeably impact the user experience on system-wide
> > > resume (mhi_pci_resume), because MHI devices usually take a while (a
> > > few seconds) to cold boot and reach a ready state (or time out in the
> > > worst case). So we may have people complaining about delayed resume
> > > regression on their laptop even if they are not using the MHI
> > > device/modem function. Are we ok with that?
> > >
> >
> > Are you saying that the modem will enter D3Cold all the time during system
> > suspend? I think you are referring to x86 host machines here.
> 
> It depends on the host and its firmware implementation, but yes I
> observed that x86_64 based laptops are powering off the mPCIe slot
> while suspended.
> 

Then the default behavior should be to power down the MHI stack during suspend.

> > If that is the case, we should not be using mhi_pci_runtime_*() calls in
> > mhi_pci_suspend/resume(). Rather the MHI stack should be powered down during
> > suspend and powered ON during resume.
> 
> Yes, but what about the hosts keeping power in suspend state? we can
> not really know that programmatically AFAIK.
> 

Well, there are a few APIs we can rely on, but they are not reliable atleast on
DT platforms. However, powering down the MHI stack should be safe irrespective
of what the platform decides to do.

Regarding your comment on device taking time to resume, we can opt for async PM
to let the device come up without affecting overall system resume.

Let me know if both of the above options make sense to you. I'll submit
patches to incorporate them.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
  2025-01-08 13:39 ` [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock Manivannan Sadhasivam via B4 Relay
  2025-01-08 14:46   ` Loic Poulain
@ 2025-01-22 15:11   ` Johan Hovold
  2025-02-19 13:13     ` Manivannan Sadhasivam
  1 sibling, 1 reply; 14+ messages in thread
From: Johan Hovold @ 2025-01-22 15:11 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: mhi, Loic Poulain, linux-arm-msm, linux-kernel, stable

On Wed, Jan 08, 2025 at 07:09:27PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> 
> There are multiple places from where the recovery work gets scheduled
> asynchronously. Also, there are multiple places where the caller waits
> synchronously for the recovery to be completed. One such place is during
> the PM shutdown() callback.
> 
> If the device is not alive during recovery_work, it will try to reset the
> device using pci_reset_function(). This function internally will take the
> device_lock() first before resetting the device. By this time, if the lock
> has already been acquired, then recovery_work will get stalled while
> waiting for the lock. And if the lock was already acquired by the caller
> which waits for the recovery_work to be completed, it will lead to
> deadlock.
> 
> This is what happened on the X1E80100 CRD device when the device died
> before shutdown() callback. Driver core calls the driver's shutdown()
> callback while holding the device_lock() leading to deadlock.
> 
> And this deadlock scenario can occur on other paths as well, like during
> the PM suspend() callback, where the driver core would hold the
> device_lock() before calling driver's suspend() callback. And if the
> recovery_work was already started, it could lead to deadlock. This is also
> observed on the X1E80100 CRD.
> 
> So to fix both issues, use pci_try_reset_function() in recovery_work. This
> function first checks for the availability of the device_lock() before
> trying to reset the device. If the lock is available, it will acquire it
> and reset the device. Otherwise, it will return -EAGAIN. If that happens,
> recovery_work will fail with the error message "Recovery failed" as not
> much could be done.

I can confirm that this patch (alone) fixes the deadlock on shutdown
and suspend as expected, but it does leave the system state that blocks
further suspend (this is with patches that tear down the PCI link).
Looks like the modem is also hosed.

[  267.454616] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
[  267.461165] mhi mhi0: Resuming from non M3 state (SYS ERROR)
[  267.467102] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
[  267.473968] mhi-pci-generic 0005:01:00.0: device recovery started
[  267.477460] mhi-pci-generic 0005:01:00.0: mhi_pci_suspend
[  267.480331] mhi-pci-generic 0005:01:00.0: __mhi_power_down
[  267.485917] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_suspend
[  267.498339] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
[  267.505618] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
[  267.513372] wwan wwan0: port wwan0qcdm0 disconnected
[  267.519556] wwan wwan0: port wwan0mbim0 disconnected
[  267.525544] wwan wwan0: port wwan0qmi0 disconnected
[  267.573773] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
[  267.591199] mhi mhi0: Requested to power ON
[  267.914688] mhi mhi0: Power on setup success
[  267.914875] mhi mhi0: Wait for device to enter SBL or Mission mode
[  267.919179] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
[  276.189399] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110
[  276.198240] mhi-pci-generic 0005:01:00.0: __mhi_power_down
[  276.203990] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
[  276.211269] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
[  276.220024] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
[  276.226680] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
[  276.233417] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down
[  276.242604] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset
[  276.249881] mhi-pci-generic 0005:01:00.0: Recovery failed
[  276.255536] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - returns
[  276.328878] qcom-pcie 1bf8000.pci: qcom_pcie_suspend_noirq
[  276.368851] qcom-pcie 1c00000.pci: qcom_pcie_suspend_noirq
[  276.477900] qcom-pcie 1c00000.pci: Failed to enter L2/L3
[  276.483535] qcom-pcie 1c00000.pci: PM: dpm_run_callback(): genpd_suspend_noirq returns -110
[  276.492292] qcom-pcie 1c00000.pci: PM: failed to suspend noirq: error -110
[  276.500218] qcom-pcie 1bf8000.pci: qcom_pcie_resume_noirq
[  276.721401] qcom-pcie 1bf8000.pci: PCIe Gen.4 x4 link up
[  276.730639] PM: noirq suspend of devices failed
[  276.818943] mhi-pci-generic 0005:01:00.0: mhi_pci_resume
[  276.824582] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume

Still better than hanging the machine, but perhaps not much point in
having recovery code that can't recover the device.

We obviously need to track down what is causing the modem to fail to
resume since 6.13-rc1 too.

> Cc: stable@vger.kernel.org # 5.12
> Reported-by: Johan Hovold <johan@kernel.org>
> Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com

Reviewed-by: Johan Hovold <johan+linaro@kernel.org>
Tested-by: Johan Hovold <johan+linaro@kernel.org>

And since I've spent way to much time debugging this and provided the
analysis of the deadlock:

Analyzed-by: Johan Hovold <johan@kernel.org>
Link: https://lore.kernel.org/mhi/Z2KKjWY2mPen6GPL@hovoldconsulting.com/

> Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> ---
>  drivers/bus/mhi/host/pci_generic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c
> index 07645ce2119a..e92df380c785 100644
> --- a/drivers/bus/mhi/host/pci_generic.c
> +++ b/drivers/bus/mhi/host/pci_generic.c
> @@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work)
>  err_unprepare:
>  	mhi_unprepare_after_power_down(mhi_cntrl);
>  err_try_reset:
> -	if (pci_reset_function(pdev))
> +	if (pci_try_reset_function(pdev))
>  		dev_err(&pdev->dev, "Recovery failed\n");

Perhaps you should log the returned error here as a part of this patch
so we can tell when the recovery code failed due to the device lock
being held.

>  }

Johan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-08 13:39 ` [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume() Manivannan Sadhasivam via B4 Relay
  2025-01-08 15:19   ` Loic Poulain
@ 2025-01-22 15:24   ` Johan Hovold
  2025-01-22 17:25     ` Johan Hovold
  1 sibling, 1 reply; 14+ messages in thread
From: Johan Hovold @ 2025-01-22 15:24 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: mhi, Loic Poulain, linux-arm-msm, linux-kernel, stable

On Wed, Jan 08, 2025 at 07:09:28PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> 
> Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> is started asynchronously and success is returned. But this doesn't align
> with what PM core expects as documented in
> Documentation/power/runtime_pm.rst:
> 
> "Once the subsystem-level resume callback (or the driver resume callback,
> if invoked directly) has completed successfully, the PM core regards the
> device as fully operational, which means that the device _must_ be able to
> complete I/O operations as needed.  The runtime PM status of the device is
> then 'active'."
> 
> So the PM core ends up marking the runtime PM status of the device as
> 'active', even though the device is not able to handle the I/O operations.
> This same condition more or less applies to system resume as well.
> 
> So to avoid this ambiguity, try to recover the device synchronously from
> mhi_pci_runtime_resume() and return the actual error code in the case of
> recovery failure.
> 
> For doing so, move the recovery code to __mhi_pci_recovery_work() helper
> and call that from both mhi_pci_recovery_work() and
> mhi_pci_runtime_resume(). Former still ignores the return value, while the
> latter passes it to PM core.
> 
> Cc: stable@vger.kernel.org # 5.13
> Reported-by: Johan Hovold <johan@kernel.org>
> Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>

Reasoning above makes sense, and I do indeed see resume taking five
seconds longer with this patch as Loic suggested it would.

Unfortunately, something else is broken as the recovery code now
deadlocks again when the modem fails to resume (with both patches
applied):

[  729.833701] PM: suspend entry (deep)
[  729.841377] Filesystems sync: 0.000 seconds
[  729.867672] Freezing user space processes
[  729.869494] Freezing user space processes completed (elapsed 0.001 seconds)
[  729.869499] OOM killer disabled.
[  729.869501] Freezing remaining freezable tasks
[  729.870882] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  730.184254] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
[  730.190643] mhi mhi0: Resuming from non M3 state (SYS ERROR)
[  730.196587] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
[  730.203412] mhi-pci-generic 0005:01:00.0: device recovery started

I've reproduced this three times in three different paths (runtime
resume before suspend; runtime resume during suspend; and during system
resume).

I didn't try to figure what causes the deadlock this time (and lockdep
does not trigger), but you should be able to reproduce this by
instrumenting a resume failure.

Johan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume()
  2025-01-22 15:24   ` Johan Hovold
@ 2025-01-22 17:25     ` Johan Hovold
  0 siblings, 0 replies; 14+ messages in thread
From: Johan Hovold @ 2025-01-22 17:25 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: mhi, Loic Poulain, linux-arm-msm, linux-kernel, stable

On Wed, Jan 22, 2025 at 04:24:27PM +0100, Johan Hovold wrote:
> On Wed, Jan 08, 2025 at 07:09:28PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > 
> > Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> > is started asynchronously and success is returned. But this doesn't align
> > with what PM core expects as documented in
> > Documentation/power/runtime_pm.rst:

> > Cc: stable@vger.kernel.org # 5.13
> > Reported-by: Johan Hovold <johan@kernel.org>
> > Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> > Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> 
> Reasoning above makes sense, and I do indeed see resume taking five
> seconds longer with this patch as Loic suggested it would.

I forgot to mention the following warnings that now show up when system
resume succeeds. Recovery was run also before this patch but the "parent
mhi0 should not be sleeping" warnings are new:

[   68.753288] qcom_mhi_qrtr mhi0_IPCR: failed to prepare for autoqueue transfer -22
[   68.761109] qcom_mhi_qrtr mhi0_IPCR: PM: dpm_run_callback(): qcom_mhi_qrtr_pm_resume_early [qrtr_mhi] returns -22
[   68.771804] qcom_mhi_qrtr mhi0_IPCR: PM: failed to resume early: error -22
[   68.795053] mhi-pci-generic 0005:01:00.0: mhi_pci_resume
[   68.800709] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
[   68.800794] mhi mhi0: Resuming from non M3 state (RESET)
[   68.800804] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
[   68.819517] mhi-pci-generic 0005:01:00.0: device recovery started
[   68.819532] mhi-pci-generic 0005:01:00.0: __mhi_power_down
[   68.819543] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
[   68.819554] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
[   68.820060] wwan wwan0: port wwan0qcdm0 disconnected
[   68.824839] nvme nvme0: 12/0/0 default/read/poll queues
[   68.857908] wwan wwan0: port wwan0mbim0 disconnected
[   68.864012] wwan wwan0: port wwan0qmi0 disconnected
[   68.943307] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
[   68.956253] mhi mhi0: Requested to power ON
[   68.960753] mhi mhi0: Power on setup success
[   68.965262] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
[   73.183086] mhi mhi0: Wait for device to enter SBL or Mission mode
[   73.653462] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = 0
[   73.653752] mhi mhi0_DIAG: PM: parent mhi0 should not be sleeping
[   73.661955] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
[   73.668461] mhi mhi0_MBIM: PM: parent mhi0 should not be sleeping
[   73.674950] mhi-pci-generic 0005:01:00.0: Recovery completed
[   73.681428] mhi mhi0_QMI: PM: parent mhi0 should not be sleeping
[   74.315919] OOM killer enabled.
[   74.316475] wwan wwan0: port wwan0qcdm0 attached
[   74.319206] Restarting tasks ...
[   74.322825] done.
[   74.322870] random: crng reseeded on system resumption
[   74.325956] wwan wwan0: port wwan0mbim0 attached
[   74.334467] wwan wwan0: port wwan0qmi0 attached

> Unfortunately, something else is broken as the recovery code now
> deadlocks again when the modem fails to resume (with both patches
> applied):
> 
> [  729.833701] PM: suspend entry (deep)
> [  729.841377] Filesystems sync: 0.000 seconds
> [  729.867672] Freezing user space processes
> [  729.869494] Freezing user space processes completed (elapsed 0.001 seconds)
> [  729.869499] OOM killer disabled.
> [  729.869501] Freezing remaining freezable tasks
> [  729.870882] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [  730.184254] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
> [  730.190643] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> [  730.196587] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> [  730.203412] mhi-pci-generic 0005:01:00.0: device recovery started
> 
> I've reproduced this three times in three different paths (runtime
> resume before suspend; runtime resume during suspend; and during system
> resume).
> 
> I didn't try to figure what causes the deadlock this time (and lockdep
> does not trigger), but you should be able to reproduce this by
> instrumenting a resume failure.

Johan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
  2025-01-22 15:11   ` Johan Hovold
@ 2025-02-19 13:13     ` Manivannan Sadhasivam
  2025-02-19 13:52       ` Johan Hovold
  0 siblings, 1 reply; 14+ messages in thread
From: Manivannan Sadhasivam @ 2025-02-19 13:13 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, Loic Poulain, linux-arm-msm, linux-kernel, stable

On Wed, Jan 22, 2025 at 04:11:41PM +0100, Johan Hovold wrote:
> On Wed, Jan 08, 2025 at 07:09:27PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > 
> > There are multiple places from where the recovery work gets scheduled
> > asynchronously. Also, there are multiple places where the caller waits
> > synchronously for the recovery to be completed. One such place is during
> > the PM shutdown() callback.
> > 
> > If the device is not alive during recovery_work, it will try to reset the
> > device using pci_reset_function(). This function internally will take the
> > device_lock() first before resetting the device. By this time, if the lock
> > has already been acquired, then recovery_work will get stalled while
> > waiting for the lock. And if the lock was already acquired by the caller
> > which waits for the recovery_work to be completed, it will lead to
> > deadlock.
> > 
> > This is what happened on the X1E80100 CRD device when the device died
> > before shutdown() callback. Driver core calls the driver's shutdown()
> > callback while holding the device_lock() leading to deadlock.
> > 
> > And this deadlock scenario can occur on other paths as well, like during
> > the PM suspend() callback, where the driver core would hold the
> > device_lock() before calling driver's suspend() callback. And if the
> > recovery_work was already started, it could lead to deadlock. This is also
> > observed on the X1E80100 CRD.
> > 
> > So to fix both issues, use pci_try_reset_function() in recovery_work. This
> > function first checks for the availability of the device_lock() before
> > trying to reset the device. If the lock is available, it will acquire it
> > and reset the device. Otherwise, it will return -EAGAIN. If that happens,
> > recovery_work will fail with the error message "Recovery failed" as not
> > much could be done.
> 
> I can confirm that this patch (alone) fixes the deadlock on shutdown
> and suspend as expected, but it does leave the system state that blocks
> further suspend (this is with patches that tear down the PCI link).

Yeah, we wouldn't know when not to return an error to unblock the suspend. So
this is acceptable.

> Looks like the modem is also hosed.
> 
> [  267.454616] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
> [  267.461165] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> [  267.467102] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> [  267.473968] mhi-pci-generic 0005:01:00.0: device recovery started
> [  267.477460] mhi-pci-generic 0005:01:00.0: mhi_pci_suspend
> [  267.480331] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [  267.485917] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_suspend
> [  267.498339] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
> [  267.505618] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
> [  267.513372] wwan wwan0: port wwan0qcdm0 disconnected
> [  267.519556] wwan wwan0: port wwan0mbim0 disconnected
> [  267.525544] wwan wwan0: port wwan0qmi0 disconnected
> [  267.573773] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [  267.591199] mhi mhi0: Requested to power ON
> [  267.914688] mhi mhi0: Power on setup success
> [  267.914875] mhi mhi0: Wait for device to enter SBL or Mission mode
> [  267.919179] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
> [  276.189399] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110
> [  276.198240] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [  276.203990] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
> [  276.211269] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
> [  276.220024] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [  276.226680] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
> [  276.233417] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down
> [  276.242604] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset
> [  276.249881] mhi-pci-generic 0005:01:00.0: Recovery failed
> [  276.255536] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - returns
> [  276.328878] qcom-pcie 1bf8000.pci: qcom_pcie_suspend_noirq
> [  276.368851] qcom-pcie 1c00000.pci: qcom_pcie_suspend_noirq
> [  276.477900] qcom-pcie 1c00000.pci: Failed to enter L2/L3
> [  276.483535] qcom-pcie 1c00000.pci: PM: dpm_run_callback(): genpd_suspend_noirq returns -110
> [  276.492292] qcom-pcie 1c00000.pci: PM: failed to suspend noirq: error -110
> [  276.500218] qcom-pcie 1bf8000.pci: qcom_pcie_resume_noirq
> [  276.721401] qcom-pcie 1bf8000.pci: PCIe Gen.4 x4 link up
> [  276.730639] PM: noirq suspend of devices failed
> [  276.818943] mhi-pci-generic 0005:01:00.0: mhi_pci_resume
> [  276.824582] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
> 
> Still better than hanging the machine, but perhaps not much point in
> having recovery code that can't recover the device.
> 

Unfortunately, we cannot know if we could not recover the device.

> We obviously need to track down what is causing the modem to fail to
> resume since 6.13-rc1 too.
> 

Yeah, this is worth tracing down.

> > Cc: stable@vger.kernel.org # 5.12
> > Reported-by: Johan Hovold <johan@kernel.org>
> > Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com
> 
> Reviewed-by: Johan Hovold <johan+linaro@kernel.org>
> Tested-by: Johan Hovold <johan+linaro@kernel.org>
> 
> And since I've spent way to much time debugging this and provided the
> analysis of the deadlock:
> 
> Analyzed-by: Johan Hovold <johan@kernel.org>
> Link: https://lore.kernel.org/mhi/Z2KKjWY2mPen6GPL@hovoldconsulting.com/
> 

Sure.

> > Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > ---
> >  drivers/bus/mhi/host/pci_generic.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c
> > index 07645ce2119a..e92df380c785 100644
> > --- a/drivers/bus/mhi/host/pci_generic.c
> > +++ b/drivers/bus/mhi/host/pci_generic.c
> > @@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work)
> >  err_unprepare:
> >  	mhi_unprepare_after_power_down(mhi_cntrl);
> >  err_try_reset:
> > -	if (pci_reset_function(pdev))
> > +	if (pci_try_reset_function(pdev))
> >  		dev_err(&pdev->dev, "Recovery failed\n");
> 
> Perhaps you should log the returned error here as a part of this patch
> so we can tell when the recovery code failed due to the device lock
> being held.
> 

Makes sense. Added the errno to the log and applied to patch to mhi/next with
your tags. Thanks a lot!

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
  2025-02-19 13:13     ` Manivannan Sadhasivam
@ 2025-02-19 13:52       ` Johan Hovold
  2025-02-19 14:14         ` Manivannan Sadhasivam
  0 siblings, 1 reply; 14+ messages in thread
From: Johan Hovold @ 2025-02-19 13:52 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: mhi, Loic Poulain, linux-arm-msm, linux-kernel, stable

On Wed, Feb 19, 2025 at 06:43:24PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Jan 22, 2025 at 04:11:41PM +0100, Johan Hovold wrote:

> > I can confirm that this patch (alone) fixes the deadlock on shutdown
> > and suspend as expected, but it does leave the system state that blocks
> > further suspend (this is with patches that tear down the PCI link).

> > > Cc: stable@vger.kernel.org # 5.12
> > > Reported-by: Johan Hovold <johan@kernel.org>
> > > Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com

> > > Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
> > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>

> Makes sense. Added the errno to the log and applied to patch to mhi/next with
> your tags. Thanks a lot!

Since this fixes a severe issue that hangs the machine on suspend and
shutdown, please try to get this fixed already in 6.14-rc.

Johan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock
  2025-02-19 13:52       ` Johan Hovold
@ 2025-02-19 14:14         ` Manivannan Sadhasivam
  0 siblings, 0 replies; 14+ messages in thread
From: Manivannan Sadhasivam @ 2025-02-19 14:14 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, Loic Poulain, linux-arm-msm, linux-kernel, stable

On Wed, Feb 19, 2025 at 02:52:40PM +0100, Johan Hovold wrote:
> On Wed, Feb 19, 2025 at 06:43:24PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Jan 22, 2025 at 04:11:41PM +0100, Johan Hovold wrote:
> 
> > > I can confirm that this patch (alone) fixes the deadlock on shutdown
> > > and suspend as expected, but it does leave the system state that blocks
> > > further suspend (this is with patches that tear down the PCI link).
> 
> > > > Cc: stable@vger.kernel.org # 5.12
> > > > Reported-by: Johan Hovold <johan@kernel.org>
> > > > Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com
> 
> > > > Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
> > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> 
> > Makes sense. Added the errno to the log and applied to patch to mhi/next with
> > your tags. Thanks a lot!
> 
> Since this fixes a severe issue that hangs the machine on suspend and
> shutdown, please try to get this fixed already in 6.14-rc.
> 

I usually send fixes PR for bugs introduced in the current cycle. But yeah,
since this is a blocker, I will push it to current rcS.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-02-19 14:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08 13:39 [PATCH 0/2] bus: mhi: host: pci_generic: Couple of recovery fixes Manivannan Sadhasivam via B4 Relay
2025-01-08 13:39 ` [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock Manivannan Sadhasivam via B4 Relay
2025-01-08 14:46   ` Loic Poulain
2025-01-22 15:11   ` Johan Hovold
2025-02-19 13:13     ` Manivannan Sadhasivam
2025-02-19 13:52       ` Johan Hovold
2025-02-19 14:14         ` Manivannan Sadhasivam
2025-01-08 13:39 ` [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime_resume() Manivannan Sadhasivam via B4 Relay
2025-01-08 15:19   ` Loic Poulain
2025-01-08 16:02     ` Manivannan Sadhasivam
2025-01-09 20:50       ` Loic Poulain
2025-01-12  4:23         ` Manivannan Sadhasivam
2025-01-22 15:24   ` Johan Hovold
2025-01-22 17:25     ` Johan Hovold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox