linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mhi resume failure on reboot with 6.13-rc2
@ 2024-12-11 14:17 Johan Hovold
  2024-12-11 14:53 ` Manivannan Sadhasivam
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-11 14:17 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel

Hi Mani,

I just hit the following modem related error on reboot of the x1e80100
CRD for the second time with 6.13-rc2:

	[  138.348724] shutdown[1]: Rebooting.
        [  138.545683] arm-smmu 3da0000.iommu: disabling translation
        [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
        [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
        [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
        [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
        [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
        [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
        [  138.702604] mhi mhi0: Requested to power ON
        [  139.027494] mhi mhi0: Power on setup success
        [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode

and then the machine hangs.

Do you know if there are any changes since 6.12 that could cause this?

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-11 14:17 mhi resume failure on reboot with 6.13-rc2 Johan Hovold
@ 2024-12-11 14:53 ` Manivannan Sadhasivam
  2024-12-11 15:03   ` Johan Hovold
  0 siblings, 1 reply; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-11 14:53 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel

Hi,

On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> Hi Mani,
> 
> I just hit the following modem related error on reboot of the x1e80100
> CRD for the second time with 6.13-rc2:
> 
> 	[  138.348724] shutdown[1]: Rebooting.
>         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
>         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
>         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
>         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
>         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
>         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
>         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
>         [  138.702604] mhi mhi0: Requested to power ON
>         [  139.027494] mhi mhi0: Power on setup success
>         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> 
> and then the machine hangs.
> 
> Do you know if there are any changes since 6.12 that could cause this?
> 

Only 3 changes went in for 6.13-rc1 and they shouldn't cause any issues. One
caused the regression with pcim_iomap_region(), but you submitted a fix for
that and other two were trivial.

From the log, 'mhi mhi0: Resuming from non M3 state (SYS ERROR)' indicates that
the firmware got crashed while resuming. So maybe you should check with ath12k
folks.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-11 14:53 ` Manivannan Sadhasivam
@ 2024-12-11 15:03   ` Johan Hovold
  2024-12-16  7:40     ` Manivannan Sadhasivam
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-11 15:03 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel

On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:

> > I just hit the following modem related error on reboot of the x1e80100
> > CRD for the second time with 6.13-rc2:
> > 
> > 	[  138.348724] shutdown[1]: Rebooting.
> >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> >         [  138.702604] mhi mhi0: Requested to power ON
> >         [  139.027494] mhi mhi0: Power on setup success
> >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > 
> > and then the machine hangs.
> > 
> > Do you know if there are any changes since 6.12 that could cause this?
> 
> Only 3 changes went in for 6.13-rc1 and they shouldn't cause any issues. One
> caused the regression with pcim_iomap_region(), but you submitted a fix for
> that and other two were trivial.

Ok, thanks.

> From the log, 'mhi mhi0: Resuming from non M3 state (SYS ERROR)' indicates that
> the firmware got crashed while resuming. So maybe you should check with ath12k
> folks.

This is the modem so I don't think the ath12k wifi folks are to blame
here.

It may be an older, existing issue that started triggering due to
changes in timing or something.

Is there anything you can do on the mhi side to prevent it from blocking
reboot/power off?

I'm guessing the mhi timeout, which I've hit in other paths like resume,
may trigger after a minute or two even if I never waited that long
before hitting reset during reboot.

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-11 15:03   ` Johan Hovold
@ 2024-12-16  7:40     ` Manivannan Sadhasivam
  2024-12-16  7:43       ` Manivannan Sadhasivam
  2024-12-16 13:20       ` Johan Hovold
  0 siblings, 2 replies; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16  7:40 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel

On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> 
> > > I just hit the following modem related error on reboot of the x1e80100
> > > CRD for the second time with 6.13-rc2:
> > > 
> > > 	[  138.348724] shutdown[1]: Rebooting.
> > >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> > >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> > >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> > >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> > >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> > >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> > >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> > >         [  138.702604] mhi mhi0: Requested to power ON
> > >         [  139.027494] mhi mhi0: Power on setup success
> > >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > > 
> > > and then the machine hangs.
> > > 
> > > Do you know if there are any changes since 6.12 that could cause this?
> > 
> > Only 3 changes went in for 6.13-rc1 and they shouldn't cause any issues. One
> > caused the regression with pcim_iomap_region(), but you submitted a fix for
> > that and other two were trivial.
> 
> Ok, thanks.
> 
> > From the log, 'mhi mhi0: Resuming from non M3 state (SYS ERROR)' indicates that
> > the firmware got crashed while resuming. So maybe you should check with ath12k
> > folks.
> 
> This is the modem so I don't think the ath12k wifi folks are to blame
> here.
> 

Ah, even after these years I always confuse between WLAN and WWAN :)

> It may be an older, existing issue that started triggering due to
> changes in timing or something.
>

Could be. But the issue seems to be stemming from the modem crash while exiting
M3. You can try removing the modem autosuspend by skipping the if condition
block:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pci_generic.c?h=v6.13-rc1#n1184

If you no longer see the crash, then the issue might be with modem not coping
up with autosuspend. If you still see the crash, then something else going wrong
during reboot/power off.

> Is there anything you can do on the mhi side to prevent it from blocking
> reboot/power off?
> 

It should not block the reboot/power off forever. There is a timeout waiting for
SBL/Mission mode and the max time is 24s (depending on the modem). Can you share
the modem VID:PID?

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-16  7:40     ` Manivannan Sadhasivam
@ 2024-12-16  7:43       ` Manivannan Sadhasivam
  2024-12-16 13:20       ` Johan Hovold
  1 sibling, 0 replies; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16  7:43 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

Adding Loic, since he authored the pci_generic driver and knows more about
modems than me.

- Mani

On Mon, Dec 16, 2024 at 01:10:34PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> > On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> > 
> > > > I just hit the following modem related error on reboot of the x1e80100
> > > > CRD for the second time with 6.13-rc2:
> > > > 
> > > > 	[  138.348724] shutdown[1]: Rebooting.
> > > >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> > > >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> > > >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> > > >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> > > >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> > > >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> > > >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> > > >         [  138.702604] mhi mhi0: Requested to power ON
> > > >         [  139.027494] mhi mhi0: Power on setup success
> > > >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > > > 
> > > > and then the machine hangs.
> > > > 
> > > > Do you know if there are any changes since 6.12 that could cause this?
> > > 
> > > Only 3 changes went in for 6.13-rc1 and they shouldn't cause any issues. One
> > > caused the regression with pcim_iomap_region(), but you submitted a fix for
> > > that and other two were trivial.
> > 
> > Ok, thanks.
> > 
> > > From the log, 'mhi mhi0: Resuming from non M3 state (SYS ERROR)' indicates that
> > > the firmware got crashed while resuming. So maybe you should check with ath12k
> > > folks.
> > 
> > This is the modem so I don't think the ath12k wifi folks are to blame
> > here.
> > 
> 
> Ah, even after these years I always confuse between WLAN and WWAN :)
> 
> > It may be an older, existing issue that started triggering due to
> > changes in timing or something.
> >
> 
> Could be. But the issue seems to be stemming from the modem crash while exiting
> M3. You can try removing the modem autosuspend by skipping the if condition
> block:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pci_generic.c?h=v6.13-rc1#n1184
> 
> If you no longer see the crash, then the issue might be with modem not coping
> up with autosuspend. If you still see the crash, then something else going wrong
> during reboot/power off.
> 
> > Is there anything you can do on the mhi side to prevent it from blocking
> > reboot/power off?
> > 
> 
> It should not block the reboot/power off forever. There is a timeout waiting for
> SBL/Mission mode and the max time is 24s (depending on the modem). Can you share
> the modem VID:PID?
> 
> - Mani
> 
> -- 
> மணிவண்ணன் சதாசிவம்

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-16  7:40     ` Manivannan Sadhasivam
  2024-12-16  7:43       ` Manivannan Sadhasivam
@ 2024-12-16 13:20       ` Johan Hovold
  2024-12-16 14:13         ` Manivannan Sadhasivam
  1 sibling, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-16 13:20 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> > On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> > 
> > > > I just hit the following modem related error on reboot of the x1e80100
> > > > CRD for the second time with 6.13-rc2:
> > > > 
> > > > 	[  138.348724] shutdown[1]: Rebooting.
> > > >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> > > >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> > > >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> > > >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> > > >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> > > >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> > > >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> > > >         [  138.702604] mhi mhi0: Requested to power ON
> > > >         [  139.027494] mhi mhi0: Power on setup success
> > > >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > > > 
> > > > and then the machine hangs.

> Could be. But the issue seems to be stemming from the modem crash while exiting
> M3. You can try removing the modem autosuspend by skipping the if condition
> block:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pci_generic.c?h=v6.13-rc1#n1184
> 
> If you no longer see the crash, then the issue might be with modem not coping
> up with autosuspend. If you still see the crash, then something else going wrong
> during reboot/power off.

I've only hit this issue three times and only since 6.13-rc2. So not
sure how useful that sort of experiment would be.

> > Is there anything you can do on the mhi side to prevent it from blocking
> > reboot/power off?
> 
> It should not block the reboot/power off forever. There is a timeout waiting for
> SBL/Mission mode and the max time is 24s (depending on the modem). Can you share
> the modem VID:PID?

I just hit the issue again and can confirm that it does block
reboot/shutdown forever (I've been waiting for 20 minutes now).

Judging from a quick look at the code, "Wait for device to enter SBL or
Mission mode" is printed by mhi_fw_load_handler(), which in turn is only
called from the mhi_pm_st_worker() state machine.

I can't seem to find anything that makes sure that the next state is
ever reached, so regardless of the cause of the modem fw crash (if
that's what it is) the hung reboot appears to be a bug in mhi.

This is with the SDX65 modem in the x1e80100 CRD:
	
	17cb:0308

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-16 13:20       ` Johan Hovold
@ 2024-12-16 14:13         ` Manivannan Sadhasivam
  2024-12-16 16:25           ` Loic Poulain
  2024-12-18  8:40           ` Johan Hovold
  0 siblings, 2 replies; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16 14:13 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Mon, Dec 16, 2024 at 02:20:09PM +0100, Johan Hovold wrote:
> On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> > > On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> > > 
> > > > > I just hit the following modem related error on reboot of the x1e80100
> > > > > CRD for the second time with 6.13-rc2:
> > > > > 
> > > > > 	[  138.348724] shutdown[1]: Rebooting.
> > > > >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> > > > >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> > > > >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> > > > >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> > > > >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> > > > >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> > > > >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> > > > >         [  138.702604] mhi mhi0: Requested to power ON
> > > > >         [  139.027494] mhi mhi0: Power on setup success
> > > > >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > > > > 
> > > > > and then the machine hangs.
> 
> > Could be. But the issue seems to be stemming from the modem crash while exiting
> > M3. You can try removing the modem autosuspend by skipping the if condition
> > block:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pci_generic.c?h=v6.13-rc1#n1184
> > 
> > If you no longer see the crash, then the issue might be with modem not coping
> > up with autosuspend. If you still see the crash, then something else going wrong
> > during reboot/power off.
> 
> I've only hit this issue three times and only since 6.13-rc2. So not
> sure how useful that sort of experiment would be.
> 

I do not have access to the device. So if you cannot spend time on debugging the
reason for crash, then I'll have to rely on Qcom to do it (which I've asked
anyway).

> > > Is there anything you can do on the mhi side to prevent it from blocking
> > > reboot/power off?
> > 
> > It should not block the reboot/power off forever. There is a timeout waiting for
> > SBL/Mission mode and the max time is 24s (depending on the modem). Can you share
> > the modem VID:PID?
> 
> I just hit the issue again and can confirm that it does block
> reboot/shutdown forever (I've been waiting for 20 minutes now).
> 

Ah, that's bad.

> Judging from a quick look at the code, "Wait for device to enter SBL or
> Mission mode" is printed by mhi_fw_load_handler(), which in turn is only
> called from the mhi_pm_st_worker() state machine.
> 
> I can't seem to find anything that makes sure that the next state is
> ever reached, so regardless of the cause of the modem fw crash

This code will make sure:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pm.c?h=v6.13-rc1#n1264

But then it doesn't print the error and returns -ETIMEDOUT to the caller after
powering down MHI. The caller (mhi_pci_recovery_work), in the case of failure,
unprepares MHI and starts function level recovery.

> (if
> that's what it is) the hung reboot appears to be a bug in mhi.
> 

I'm not sure where exactly it got stuck. I've asked Qcom folks to reproduce this
issue. We will investigate and hopefully get back with a fix asap.

> This is with the SDX65 modem in the x1e80100 CRD:
> 	
> 	17cb:0308

Okay thanks!

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-16 14:13         ` Manivannan Sadhasivam
@ 2024-12-16 16:25           ` Loic Poulain
  2024-12-17  9:57             ` Johan Hovold
  2024-12-18  8:40           ` Johan Hovold
  1 sibling, 1 reply; 20+ messages in thread
From: Loic Poulain @ 2024-12-16 16:25 UTC (permalink / raw)
  To: Johan Hovold, Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel

On Mon, 16 Dec 2024 at 15:13, Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Dec 16, 2024 at 02:20:09PM +0100, Johan Hovold wrote:
> > On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> > > > On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> > > >
> > > > > > I just hit the following modem related error on reboot of the x1e80100
> > > > > > CRD for the second time with 6.13-rc2:
> > > > > >
> > > > > >       [  138.348724] shutdown[1]: Rebooting.
> > > > > >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> > > > > >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> > > > > >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> > > > > >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> > > > > >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> > > > > >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> > > > > >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> > > > > >         [  138.702604] mhi mhi0: Requested to power ON
> > > > > >         [  139.027494] mhi mhi0: Power on setup success
> > > > > >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > > > > >
> > > > > > and then the machine hangs.
> >
> > > Could be. But the issue seems to be stemming from the modem crash while exiting
> > > M3. You can try removing the modem autosuspend by skipping the if condition
> > > block:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pci_generic.c?h=v6.13-rc1#n1184
> > >
> > > If you no longer see the crash, then the issue might be with modem not coping
> > > up with autosuspend. If you still see the crash, then something else going wrong
> > > during reboot/power off.
> >
> > I've only hit this issue three times and only since 6.13-rc2. So not
> > sure how useful that sort of experiment would be.
> >
>
> I do not have access to the device. So if you cannot spend time on debugging the
> reason for crash, then I'll have to rely on Qcom to do it (which I've asked
> anyway).
>
> > > > Is there anything you can do on the mhi side to prevent it from blocking
> > > > reboot/power off?
> > >
> > > It should not block the reboot/power off forever. There is a timeout waiting for
> > > SBL/Mission mode and the max time is 24s (depending on the modem). Can you share
> > > the modem VID:PID?
> >
> > I just hit the issue again and can confirm that it does block
> > reboot/shutdown forever (I've been waiting for 20 minutes now).
> >
>
> Ah, that's bad.
>
> > Judging from a quick look at the code, "Wait for device to enter SBL or
> > Mission mode" is printed by mhi_fw_load_handler(), which in turn is only
> > called from the mhi_pm_st_worker() state machine.
> >
> > I can't seem to find anything that makes sure that the next state is
> > ever reached, so regardless of the cause of the modem fw crash
>
> This code will make sure:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pm.c?h=v6.13-rc1#n1264
>
> But then it doesn't print the error and returns -ETIMEDOUT to the caller after
> powering down MHI. The caller (mhi_pci_recovery_work), in the case of failure,
> unprepares MHI and starts function level recovery.
>
> > (if
> > that's what it is) the hung reboot appears to be a bug in mhi.
> >
>
> I'm not sure where exactly it got stuck. I've asked Qcom folks to reproduce this
> issue. We will investigate and hopefully get back with a fix asap.
>
> > This is with the SDX65 modem in the x1e80100 CRD:
> >
> >       17cb:0308

I have another MHI modem model, but will try to reproduce during the
week, any idea on the bug rate?

Regards,
Loic

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-16 16:25           ` Loic Poulain
@ 2024-12-17  9:57             ` Johan Hovold
  2024-12-18  8:48               ` Johan Hovold
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-17  9:57 UTC (permalink / raw)
  To: Loic Poulain; +Cc: Manivannan Sadhasivam, mhi, linux-arm-msm, linux-kernel

On Mon, Dec 16, 2024 at 05:25:23PM +0100, Loic Poulain wrote:
> On Mon, 16 Dec 2024 at 15:13, Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> > On Mon, Dec 16, 2024 at 02:20:09PM +0100, Johan Hovold wrote:
> > > On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> > > > > On Wed, Dec 11, 2024 at 08:23:15PM +0530, Manivannan Sadhasivam wrote:
> > > > > > On Wed, Dec 11, 2024 at 03:17:22PM +0100, Johan Hovold wrote:
> > > > >
> > > > > > > I just hit the following modem related error on reboot of the x1e80100
> > > > > > > CRD for the second time with 6.13-rc2:
> > > > > > >
> > > > > > >       [  138.348724] shutdown[1]: Rebooting.
> > > > > > >         [  138.545683] arm-smmu 3da0000.iommu: disabling translation
> > > > > > >         [  138.582505] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> > > > > > >         [  138.588516] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> > > > > > >         [  138.595375] mhi-pci-generic 0005:01:00.0: device recovery started
> > > > > > >         [  138.603841] wwan wwan0: port wwan0qcdm0 disconnected
> > > > > > >         [  138.609508] wwan wwan0: port wwan0mbim0 disconnected
> > > > > > >         [  138.615137] wwan wwan0: port wwan0qmi0 disconnected
> > > > > > >         [  138.702604] mhi mhi0: Requested to power ON
> > > > > > >         [  139.027494] mhi mhi0: Power on setup success
> > > > > > >         [  139.027640] mhi mhi0: Wait for device to enter SBL or Mission mode
> > > > > > >
> > > > > > > and then the machine hangs.

> > > I've only hit this issue three times and only since 6.13-rc2. So not
> > > sure how useful that sort of experiment would be.

> > I'm not sure where exactly it got stuck. I've asked Qcom folks to reproduce this
> > issue. We will investigate and hopefully get back with a fix asap.
> >
> > > This is with the SDX65 modem in the x1e80100 CRD:
> > >
> > >       17cb:0308
> 
> I have another MHI modem model, but will try to reproduce during the
> week, any idea on the bug rate?

I've now hit this four times. And only since rc2. So I guess that's
something like four times in a hundred reboots or so.

I added some printks to the pci_generic driver this morning and have
been running a boot loop for one hundred iterations without hitting the
issue even once, however. Perhaps the printks alters the timing enough
to avoid the fw crash or race.

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-16 14:13         ` Manivannan Sadhasivam
  2024-12-16 16:25           ` Loic Poulain
@ 2024-12-18  8:40           ` Johan Hovold
  2024-12-18 11:38             ` Manivannan Sadhasivam
  1 sibling, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-18  8:40 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Mon, Dec 16, 2024 at 07:43:03PM +0530, Manivannan Sadhasivam wrote:
> On Mon, Dec 16, 2024 at 02:20:09PM +0100, Johan Hovold wrote:
> > On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:

> > I just hit the issue again and can confirm that it does block
> > reboot/shutdown forever (I've been waiting for 20 minutes now).
> 
> Ah, that's bad.
> 
> > Judging from a quick look at the code, "Wait for device to enter SBL or
> > Mission mode" is printed by mhi_fw_load_handler(), which in turn is only
> > called from the mhi_pm_st_worker() state machine.
> > 
> > I can't seem to find anything that makes sure that the next state is
> > ever reached, so regardless of the cause of the modem fw crash
> 
> This code will make sure:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pm.c?h=v6.13-rc1#n1264
> 
> But then it doesn't print the error and returns -ETIMEDOUT to the caller after
> powering down MHI. The caller (mhi_pci_recovery_work), in the case of failure,
> unprepares MHI and starts function level recovery.
> 
> > (if
> > that's what it is) the hung reboot appears to be a bug in mhi.

I've tracked down the hang to a deadlock on the parent device lock.

Driver core takes the parent device lock before calling shutdown(), and
then mhi_pci_shutdown() waits indefinitely for the recovery thread to
finish.

But the mhi recovery thread ends up trying to take the same parent
device lock in pci_reset_function() when recovery fails:

[  339.351915] shutdown[1]: Rebooting.
[  339.724498] arm-smmu 3da0000.iommu: disabling translation
[  339.760134] mhi mhi0: Resuming from non M3 state (SYS ERROR)
[  339.766211] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
[  339.773158] mhi-pci-generic 0005:01:00.0: device recovery started

The recovery thread is running before shutdown() is called.

[  339.779638] mhi-pci-generic 0005:01:00.0: __mhi_power_down
[  339.779650] mhi-pci-generic 0005:01:00.0: mhi_pci_shutdown
[  339.785422] wwan wwan0: port wwan0qcdm0 disconnected
[  339.791001] mhi-pci-generic 0005:01:00.0: mhi_pci_remove
[  339.791006] mhi-pci-generic 0005:01:00.0: mhi_pci_remove - cancel work sync

shutdown() waits for the recovery thread to finish

[  339.825892] wwan wwan0: port wwan0mbim0 disconnected
[  339.831320] wwan wwan0: port wwan0qmi0 disconnected
[  339.904249] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
[  340.025390] mhi mhi0: Requested to power ON
[  340.233771] mhi mhi0: Power on setup success
[  340.233954] mhi mhi0: Wait for device to enter SBL or Mission mode
[  340.238272] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
[  348.400082] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110

The recovery thread fails to power up the device.

[  348.419967] mhi-pci-generic 0005:01:00.0: __mhi_power_down
[  348.472665] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
[  348.725069] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
[  348.742644] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down
[  348.762737] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset
[  348.780904] mhi-pci-generic 0005:01:00.0: pci_reset_function

And tries to reset the device, which triggers the deadlock when
trying to take the already held parent (bridge) device lock.

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-17  9:57             ` Johan Hovold
@ 2024-12-18  8:48               ` Johan Hovold
  0 siblings, 0 replies; 20+ messages in thread
From: Johan Hovold @ 2024-12-18  8:48 UTC (permalink / raw)
  To: Loic Poulain; +Cc: Manivannan Sadhasivam, mhi, linux-arm-msm, linux-kernel

On Tue, Dec 17, 2024 at 10:57:39AM +0100, Johan Hovold wrote:

> I've now hit this four times. And only since rc2. So I guess that's
> something like four times in a hundred reboots or so.
> 
> I added some printks to the pci_generic driver this morning and have
> been running a boot loop for one hundred iterations without hitting the
> issue even once, however. Perhaps the printks alters the timing enough
> to avoid the fw crash or race.

The printks were not preventing the bug bug from triggering, but I've
only hit this after the machine have been up for a few minutes (i.e. the
delay before rebooting in my boot loop may have been too short).

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18  8:40           ` Johan Hovold
@ 2024-12-18 11:38             ` Manivannan Sadhasivam
  2024-12-18 12:02               ` Johan Hovold
  0 siblings, 1 reply; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-18 11:38 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> On Mon, Dec 16, 2024 at 07:43:03PM +0530, Manivannan Sadhasivam wrote:
> > On Mon, Dec 16, 2024 at 02:20:09PM +0100, Johan Hovold wrote:
> > > On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> 
> > > I just hit the issue again and can confirm that it does block
> > > reboot/shutdown forever (I've been waiting for 20 minutes now).
> > 
> > Ah, that's bad.
> > 
> > > Judging from a quick look at the code, "Wait for device to enter SBL or
> > > Mission mode" is printed by mhi_fw_load_handler(), which in turn is only
> > > called from the mhi_pm_st_worker() state machine.
> > > 
> > > I can't seem to find anything that makes sure that the next state is
> > > ever reached, so regardless of the cause of the modem fw crash
> > 
> > This code will make sure:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pm.c?h=v6.13-rc1#n1264
> > 
> > But then it doesn't print the error and returns -ETIMEDOUT to the caller after
> > powering down MHI. The caller (mhi_pci_recovery_work), in the case of failure,
> > unprepares MHI and starts function level recovery.
> > 
> > > (if
> > > that's what it is) the hung reboot appears to be a bug in mhi.
> 
> I've tracked down the hang to a deadlock on the parent device lock.
> 
> Driver core takes the parent device lock before calling shutdown(), and
> then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> finish.
> 
> But the mhi recovery thread ends up trying to take the same parent
> device lock in pci_reset_function() when recovery fails:
> 
> [  339.351915] shutdown[1]: Rebooting.
> [  339.724498] arm-smmu 3da0000.iommu: disabling translation
> [  339.760134] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> [  339.766211] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> [  339.773158] mhi-pci-generic 0005:01:00.0: device recovery started
> 
> The recovery thread is running before shutdown() is called.
> 
> [  339.779638] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [  339.779650] mhi-pci-generic 0005:01:00.0: mhi_pci_shutdown
> [  339.785422] wwan wwan0: port wwan0qcdm0 disconnected
> [  339.791001] mhi-pci-generic 0005:01:00.0: mhi_pci_remove
> [  339.791006] mhi-pci-generic 0005:01:00.0: mhi_pci_remove - cancel work sync
> 
> shutdown() waits for the recovery thread to finish
> 
> [  339.825892] wwan wwan0: port wwan0mbim0 disconnected
> [  339.831320] wwan wwan0: port wwan0qmi0 disconnected
> [  339.904249] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [  340.025390] mhi mhi0: Requested to power ON
> [  340.233771] mhi mhi0: Power on setup success
> [  340.233954] mhi mhi0: Wait for device to enter SBL or Mission mode
> [  340.238272] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
> [  348.400082] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110
> 
> The recovery thread fails to power up the device.
> 
> [  348.419967] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [  348.472665] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [  348.725069] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
> [  348.742644] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down
> [  348.762737] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset
> [  348.780904] mhi-pci-generic 0005:01:00.0: pci_reset_function
> 
> And tries to reset the device, which triggers the deadlock when
> trying to take the already held parent (bridge) device lock.
> 

Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
instead of pci_reset_function() in mhi_pci_recovery_work().

If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
not need to worry in that case since the host is going to be powered off anyway
(and so the device).

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 11:38             ` Manivannan Sadhasivam
@ 2024-12-18 12:02               ` Johan Hovold
  2024-12-18 12:30                 ` Manivannan Sadhasivam
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-18 12:02 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 05:08:30PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:

> > I've tracked down the hang to a deadlock on the parent device lock.
> > 
> > Driver core takes the parent device lock before calling shutdown(), and
> > then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> > finish.

> Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
> instead of pci_reset_function() in mhi_pci_recovery_work().
> 
> If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
> not need to worry in that case since the host is going to be powered off anyway
> (and so the device).

That may work. But note that I've now also seen this deadlock during
suspend (i.e. when the device is not going away). The
pci_try_reset_function() should avoid the deadlock here too, but we'll
end up in funny state.

Now I'd also like to know why I'm suddenly seeing these runtime PM
resume errors of this modem. Haven't seen them before 6.13-rc, and I'm
not sure that it's really the firmware that is crashing left and right
all of a sudden.

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 12:02               ` Johan Hovold
@ 2024-12-18 12:30                 ` Manivannan Sadhasivam
  2024-12-18 13:55                   ` Johan Hovold
  0 siblings, 1 reply; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-18 12:30 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 01:02:39PM +0100, Johan Hovold wrote:
> On Wed, Dec 18, 2024 at 05:08:30PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> 
> > > I've tracked down the hang to a deadlock on the parent device lock.
> > > 
> > > Driver core takes the parent device lock before calling shutdown(), and
> > > then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> > > finish.
> 
> > Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
> > instead of pci_reset_function() in mhi_pci_recovery_work().
> > 
> > If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
> > not need to worry in that case since the host is going to be powered off anyway
> > (and so the device).
> 
> That may work. But note that I've now also seen this deadlock during
> suspend (i.e. when the device is not going away). The
> pci_try_reset_function() should avoid the deadlock here too, but we'll
> end up in funny state.
> 

Hopefully, recovery_work() started by mhi_pci_runtime_resume() would be able to
reset the device.

> Now I'd also like to know why I'm suddenly seeing these runtime PM
> resume errors of this modem. Haven't seen them before 6.13-rc, and I'm
> not sure that it's really the firmware that is crashing left and right
> all of a sudden.
> 

Yeah, that's worth the effort. I'll go ahead with the patch since the issue is
present anyway.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 12:30                 ` Manivannan Sadhasivam
@ 2024-12-18 13:55                   ` Johan Hovold
  2024-12-18 14:09                     ` Manivannan Sadhasivam
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-18 13:55 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 06:00:19PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 18, 2024 at 01:02:39PM +0100, Johan Hovold wrote:
> > On Wed, Dec 18, 2024 at 05:08:30PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> > 
> > > > I've tracked down the hang to a deadlock on the parent device lock.
> > > > 
> > > > Driver core takes the parent device lock before calling shutdown(), and
> > > > then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> > > > finish.
> > 
> > > Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
> > > instead of pci_reset_function() in mhi_pci_recovery_work().
> > > 
> > > If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
> > > not need to worry in that case since the host is going to be powered off anyway
> > > (and so the device).
> > 
> > That may work. But note that I've now also seen this deadlock during
> > suspend (i.e. when the device is not going away). The
> > pci_try_reset_function() should avoid the deadlock here too, but we'll
> > end up in funny state.
> 
> Hopefully, recovery_work() started by mhi_pci_runtime_resume() would be able to
> reset the device.

But that's not going to happen as that reset is what is currently
causing the deadlock and which would simply be skipped if you switch to
pci_try_reset_function().

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 13:55                   ` Johan Hovold
@ 2024-12-18 14:09                     ` Manivannan Sadhasivam
  2024-12-18 14:26                       ` Johan Hovold
  0 siblings, 1 reply; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-18 14:09 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 02:55:02PM +0100, Johan Hovold wrote:
> On Wed, Dec 18, 2024 at 06:00:19PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Dec 18, 2024 at 01:02:39PM +0100, Johan Hovold wrote:
> > > On Wed, Dec 18, 2024 at 05:08:30PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> > > 
> > > > > I've tracked down the hang to a deadlock on the parent device lock.
> > > > > 
> > > > > Driver core takes the parent device lock before calling shutdown(), and
> > > > > then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> > > > > finish.
> > > 
> > > > Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
> > > > instead of pci_reset_function() in mhi_pci_recovery_work().
> > > > 
> > > > If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
> > > > not need to worry in that case since the host is going to be powered off anyway
> > > > (and so the device).
> > > 
> > > That may work. But note that I've now also seen this deadlock during
> > > suspend (i.e. when the device is not going away). The
> > > pci_try_reset_function() should avoid the deadlock here too, but we'll
> > > end up in funny state.
> > 
> > Hopefully, recovery_work() started by mhi_pci_runtime_resume() would be able to
> > reset the device.
> 
> But that's not going to happen as that reset is what is currently
> causing the deadlock and which would simply be skipped if you switch to
> pci_try_reset_function().
> 

mhi_pci_runtime_resume() will queue the recovery_work() and return. So I was
hoping that by the time pci_try_reset_function() is called, the lock would be
available.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 14:09                     ` Manivannan Sadhasivam
@ 2024-12-18 14:26                       ` Johan Hovold
  2024-12-18 18:35                         ` Manivannan Sadhasivam
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-18 14:26 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 07:39:10PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 18, 2024 at 02:55:02PM +0100, Johan Hovold wrote:
> > On Wed, Dec 18, 2024 at 06:00:19PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 18, 2024 at 01:02:39PM +0100, Johan Hovold wrote:
> > > > On Wed, Dec 18, 2024 at 05:08:30PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> > > > 
> > > > > > I've tracked down the hang to a deadlock on the parent device lock.
> > > > > > 
> > > > > > Driver core takes the parent device lock before calling shutdown(), and
> > > > > > then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> > > > > > finish.
> > > > 
> > > > > Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
> > > > > instead of pci_reset_function() in mhi_pci_recovery_work().
> > > > > 
> > > > > If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
> > > > > not need to worry in that case since the host is going to be powered off anyway
> > > > > (and so the device).
> > > > 
> > > > That may work. But note that I've now also seen this deadlock during
> > > > suspend (i.e. when the device is not going away). The
> > > > pci_try_reset_function() should avoid the deadlock here too, but we'll
> > > > end up in funny state.
> > > 
> > > Hopefully, recovery_work() started by mhi_pci_runtime_resume() would be able to
> > > reset the device.
> > 
> > But that's not going to happen as that reset is what is currently
> > causing the deadlock and which would simply be skipped if you switch to
> > pci_try_reset_function().
> > 
> 
> mhi_pci_runtime_resume() will queue the recovery_work() and return. So I was
> hoping that by the time pci_try_reset_function() is called, the lock would be
> available.

We can't rely on luck with timings, and this is the very reason for the
deadlock I'm currently seeing (i.e. the recovery thread is still running
when another thread grabs the lock and waits for the recovery thread to
finish).

Perhaps the recovery work should be done synchronously in the resume
handler to avoid such issues.

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 14:26                       ` Johan Hovold
@ 2024-12-18 18:35                         ` Manivannan Sadhasivam
  2024-12-19  8:36                           ` Johan Hovold
  0 siblings, 1 reply; 20+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-18 18:35 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Wed, Dec 18, 2024 at 03:26:38PM +0100, Johan Hovold wrote:
> On Wed, Dec 18, 2024 at 07:39:10PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Dec 18, 2024 at 02:55:02PM +0100, Johan Hovold wrote:
> > > On Wed, Dec 18, 2024 at 06:00:19PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 18, 2024 at 01:02:39PM +0100, Johan Hovold wrote:
> > > > > On Wed, Dec 18, 2024 at 05:08:30PM +0530, Manivannan Sadhasivam wrote:
> > > > > > On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> > > > > 
> > > > > > > I've tracked down the hang to a deadlock on the parent device lock.
> > > > > > > 
> > > > > > > Driver core takes the parent device lock before calling shutdown(), and
> > > > > > > then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> > > > > > > finish.
> > > > > 
> > > > > > Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
> > > > > > instead of pci_reset_function() in mhi_pci_recovery_work().
> > > > > > 
> > > > > > If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
> > > > > > not need to worry in that case since the host is going to be powered off anyway
> > > > > > (and so the device).
> > > > > 
> > > > > That may work. But note that I've now also seen this deadlock during
> > > > > suspend (i.e. when the device is not going away). The
> > > > > pci_try_reset_function() should avoid the deadlock here too, but we'll
> > > > > end up in funny state.
> > > > 
> > > > Hopefully, recovery_work() started by mhi_pci_runtime_resume() would be able to
> > > > reset the device.
> > > 
> > > But that's not going to happen as that reset is what is currently
> > > causing the deadlock and which would simply be skipped if you switch to
> > > pci_try_reset_function().
> > > 
> > 
> > mhi_pci_runtime_resume() will queue the recovery_work() and return. So I was
> > hoping that by the time pci_try_reset_function() is called, the lock would be
> > available.
> 
> We can't rely on luck with timings, and this is the very reason for the
> deadlock I'm currently seeing (i.e. the recovery thread is still running
> when another thread grabs the lock and waits for the recovery thread to
> finish).
> 
> Perhaps the recovery work should be done synchronously in the resume
> handler to avoid such issues.
> 

Synchronously? How can that help when the recovery_work() cannot acquire the
lock?

Anyhow, even if the lock is not available during resume (worst case), PCI core
should reset the device when it tries to change the state.

I don't know if there is any better solution available.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-18 18:35                         ` Manivannan Sadhasivam
@ 2024-12-19  8:36                           ` Johan Hovold
  2025-01-08 12:49                             ` Manivannan Sadhasivam
  0 siblings, 1 reply; 20+ messages in thread
From: Johan Hovold @ 2024-12-19  8:36 UTC (permalink / raw)
  To: Manivannan Sadhasivam; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Thu, Dec 19, 2024 at 12:05:55AM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 18, 2024 at 03:26:38PM +0100, Johan Hovold wrote:
> > On Wed, Dec 18, 2024 at 07:39:10PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 18, 2024 at 02:55:02PM +0100, Johan Hovold wrote:

> > > > But that's not going to happen as that reset is what is currently
> > > > causing the deadlock and which would simply be skipped if you switch to
> > > > pci_try_reset_function().
> > > > 
> > > 
> > > mhi_pci_runtime_resume() will queue the recovery_work() and return. So I was
> > > hoping that by the time pci_try_reset_function() is called, the lock would be
> > > available.
> > 
> > We can't rely on luck with timings, and this is the very reason for the
> > deadlock I'm currently seeing (i.e. the recovery thread is still running
> > when another thread grabs the lock and waits for the recovery thread to
> > finish).
> > 
> > Perhaps the recovery work should be done synchronously in the resume
> > handler to avoid such issues.
> 
> Synchronously? How can that help when the recovery_work() cannot acquire the
> lock?

During system suspend, pm core waits for any on-going runtime resume
operations to complete before taking the device lock and suspending the
device.

Unfortunately, that's currently not the case during shutdown() where
those operations are reversed, so that would indeed need to be addressed
first.

But what the driver is currently doing looks highly questionable as it
returns success when it failed to resume the device (after scheduling
the asynchronous recovery work).

Johan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: mhi resume failure on reboot with 6.13-rc2
  2024-12-19  8:36                           ` Johan Hovold
@ 2025-01-08 12:49                             ` Manivannan Sadhasivam
  0 siblings, 0 replies; 20+ messages in thread
From: Manivannan Sadhasivam @ 2025-01-08 12:49 UTC (permalink / raw)
  To: Johan Hovold; +Cc: mhi, linux-arm-msm, linux-kernel, Loic Poulain

On Thu, Dec 19, 2024 at 09:36:32AM +0100, Johan Hovold wrote:
> On Thu, Dec 19, 2024 at 12:05:55AM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Dec 18, 2024 at 03:26:38PM +0100, Johan Hovold wrote:
> > > On Wed, Dec 18, 2024 at 07:39:10PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 18, 2024 at 02:55:02PM +0100, Johan Hovold wrote:
> 
> > > > > But that's not going to happen as that reset is what is currently
> > > > > causing the deadlock and which would simply be skipped if you switch to
> > > > > pci_try_reset_function().
> > > > > 
> > > > 
> > > > mhi_pci_runtime_resume() will queue the recovery_work() and return. So I was
> > > > hoping that by the time pci_try_reset_function() is called, the lock would be
> > > > available.
> > > 
> > > We can't rely on luck with timings, and this is the very reason for the
> > > deadlock I'm currently seeing (i.e. the recovery thread is still running
> > > when another thread grabs the lock and waits for the recovery thread to
> > > finish).
> > > 
> > > Perhaps the recovery work should be done synchronously in the resume
> > > handler to avoid such issues.
> > 
> > Synchronously? How can that help when the recovery_work() cannot acquire the
> > lock?
> 
> During system suspend, pm core waits for any on-going runtime resume
> operations to complete before taking the device lock and suspending the
> device.
> 

Right, but mhi_pci_runtime_resume() is also called from mhi_pci_resume(). So we
cannot safely carry out the recovery_work() synchronously without the
pci_try_reset_function() change.

> Unfortunately, that's currently not the case during shutdown() where
> those operations are reversed, so that would indeed need to be addressed
> first.
> 
> But what the driver is currently doing looks highly questionable as it
> returns success when it failed to resume the device (after scheduling
> the asynchronous recovery work).
> 

I completely agree and this goes against what PM core expects. IMO we need
two fixes, one uses pci_try_reset_function() and another recovers the device
synchronously from mhi_pci_runtime_resume() and passes the return value to PM
core.

Will post the patches.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-01-08 12:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 14:17 mhi resume failure on reboot with 6.13-rc2 Johan Hovold
2024-12-11 14:53 ` Manivannan Sadhasivam
2024-12-11 15:03   ` Johan Hovold
2024-12-16  7:40     ` Manivannan Sadhasivam
2024-12-16  7:43       ` Manivannan Sadhasivam
2024-12-16 13:20       ` Johan Hovold
2024-12-16 14:13         ` Manivannan Sadhasivam
2024-12-16 16:25           ` Loic Poulain
2024-12-17  9:57             ` Johan Hovold
2024-12-18  8:48               ` Johan Hovold
2024-12-18  8:40           ` Johan Hovold
2024-12-18 11:38             ` Manivannan Sadhasivam
2024-12-18 12:02               ` Johan Hovold
2024-12-18 12:30                 ` Manivannan Sadhasivam
2024-12-18 13:55                   ` Johan Hovold
2024-12-18 14:09                     ` Manivannan Sadhasivam
2024-12-18 14:26                       ` Johan Hovold
2024-12-18 18:35                         ` Manivannan Sadhasivam
2024-12-19  8:36                           ` Johan Hovold
2025-01-08 12:49                             ` Manivannan Sadhasivam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).