Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
       [not found]     ` <6e98724d-2e75-d1fe-188f-a7010f86c509@codeaurora.org>
@ 2021-03-06 16:16       ` Alan Stern
  2021-03-08 16:21         ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Alan Stern @ 2021-03-06 16:16 UTC (permalink / raw)
  To: Asutosh Das (asd), Rafael J. Wysocki
  Cc: Adrian Hunter, cang, martin.petersen, linux-scsi, Bart Van Assche,
	linux-arm-msm, Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Pedro Sousa, Krzysztof Kozlowski, Stanley Chu, Andy Gross,
	Bjorn Andersson, Steven Rostedt, Ingo Molnar, Matthias Brugger,
	Kiwoong Kim, Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:

> Now during my testing I see a weird issue sometimes (1 in 7).
> Scenario - bootups
> 
> Issue:
> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
> when one/more of its consumers are in RPM_ACTIVE state.
> 
> *Log:
> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> 
> /** Printing all the consumer nodes of supplier **/
> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
> <-- this is the usage_count
> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
> (0) has rpm_active flags
> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> 
> And the the suspend of sde is stuck now:
> schedule+0x9c/0xe0
> schedule_timeout+0x40/0x128
> io_schedule_timeout+0x44/0x68
> wait_for_common_io+0x7c/0x100
> wait_for_completion_io+0x14/0x20
> blk_execute_rq+0x90/0xcc
> __scsi_execute+0x104/0x1c4
> sd_sync_cache+0xf8/0x2a0
> sd_suspend_common+0x74/0x11c
> sd_suspend_runtime+0x14/0x20
> scsi_runtime_suspend+0x64/0x94
> __rpm_callback+0x80/0x2a4
> rpm_suspend+0x308/0x614
> pm_runtime_work+0x98/0xa8
> 
> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>       if (hba->sdev_ufs_device) {
>               link = device_link_add(&sdev->sdev_gendev,
>                                   &hba->sdev_ufs_device->sdev_gendev,
>                                  DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
> I didn't expect this to resolve the issue anyway and it didn't.
> 
> Another interesting point here is when I resume any of the above suspended
> consumers, it all goes back to normal, which is kind of expected. I tried
> resuming the consumer and the supplier is resumed and the supplier is
> suspended when all the consumers are suspended.
> 
> Any pointers on this issue please?
> 
> @Bart/@Alan - Do you've any pointers please?

It's very noticeable that although you seem to have isolated a bug in
the power management subsystem (supplier goes into runtime suspend
even when one of its consumers is still active), you did not CC the
power management maintainer or mailing list.

I have added the appropriate CC's.

Alan Stern

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-06 16:16       ` [PATCH v10 1/2] scsi: ufs: Enable power management for wlun Alan Stern
@ 2021-03-08 16:21         ` Rafael J. Wysocki
  2021-03-08 17:17           ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2021-03-08 16:21 UTC (permalink / raw)
  To: Alan Stern, Asutosh Das (asd)
  Cc: Rafael J. Wysocki, Adrian Hunter, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, Bart Van Assche, linux-arm-msm,
	Alim Akhtar, Avri Altman, James E.J. Bottomley, Pedro Sousa,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>
> > Now during my testing I see a weird issue sometimes (1 in 7).
> > Scenario - bootups
> >
> > Issue:
> > The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
> > when one/more of its consumers are in RPM_ACTIVE state.
> >
> > *Log:
> > [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
> > [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
> > [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
> > [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
> > [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
> > [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
> > [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >
> > /** Printing all the consumer nodes of supplier **/
> > [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
> > <-- this is the usage_count
> > [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
> > [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
> > [   11.005453][    T5] sd 0:0:0:0: PM state - 2
> > [   11.009958][    T5] sd 0:0:0:1: PM state - 2
> > [   11.014469][    T5] sd 0:0:0:2: PM state - 2
> > [   11.019072][    T5] sd 0:0:0:3: PM state - 2
> > [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
> > [   11.353298][    T5] sd 0:0:0:5: PM state - 2
> > [   11.357726][    T5] sd 0:0:0:6: PM state - 2
> > [   11.362155][    T5] sd 0:0:0:7: PM state - 2
> > [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
> > [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
> > (0) has rpm_active flags

Do you mean that rpm_active of the link between the consumer and the
supplier is greater than 0 at this point and the consumer is
RPM_ACTIVE, but the supplier suspends successfully nevertheless?

> > [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
> > ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
> > [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> >
> > And the the suspend of sde is stuck now:
> > schedule+0x9c/0xe0
> > schedule_timeout+0x40/0x128
> > io_schedule_timeout+0x44/0x68
> > wait_for_common_io+0x7c/0x100
> > wait_for_completion_io+0x14/0x20
> > blk_execute_rq+0x90/0xcc
> > __scsi_execute+0x104/0x1c4
> > sd_sync_cache+0xf8/0x2a0
> > sd_suspend_common+0x74/0x11c
> > sd_suspend_runtime+0x14/0x20
> > scsi_runtime_suspend+0x64/0x94
> > __rpm_callback+0x80/0x2a4
> > rpm_suspend+0x308/0x614
> > pm_runtime_work+0x98/0xa8
> >
> > I added 'DL_FLAG_RPM_ACTIVE' while creating links.
> >       if (hba->sdev_ufs_device) {
> >               link = device_link_add(&sdev->sdev_gendev,
> >                                   &hba->sdev_ufs_device->sdev_gendev,
> >                                  DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
> > I didn't expect this to resolve the issue anyway and it didn't.
> >
> > Another interesting point here is when I resume any of the above suspended
> > consumers, it all goes back to normal, which is kind of expected. I tried
> > resuming the consumer and the supplier is resumed and the supplier is
> > suspended when all the consumers are suspended.
> >
> > Any pointers on this issue please?
> >
> > @Bart/@Alan - Do you've any pointers please?
>
> It's very noticeable that although you seem to have isolated a bug in
> the power management subsystem (supplier goes into runtime suspend
> even when one of its consumers is still active), you did not CC the
> power management maintainer or mailing list.
>
> I have added the appropriate CC's.

Thanks Alan!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-08 16:21         ` Rafael J. Wysocki
@ 2021-03-08 17:17           ` Rafael J. Wysocki
  2021-03-09 15:56             ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2021-03-08 17:17 UTC (permalink / raw)
  To: Asutosh Das (asd)
  Cc: Alan Stern, Rafael J. Wysocki, Adrian Hunter, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, Bart Van Assche,
	linux-arm-msm, Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
> >
> > > Now during my testing I see a weird issue sometimes (1 in 7).
> > > Scenario - bootups
> > >
> > > Issue:
> > > The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
> > > when one/more of its consumers are in RPM_ACTIVE state.
> > >
> > > *Log:
> > > [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
> > > [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
> > > [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
> > > [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
> > > [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
> > > [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
> > > [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > >
> > > /** Printing all the consumer nodes of supplier **/
> > > [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
> > > <-- this is the usage_count
> > > [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
> > > [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
> > > [   11.005453][    T5] sd 0:0:0:0: PM state - 2
> > > [   11.009958][    T5] sd 0:0:0:1: PM state - 2
> > > [   11.014469][    T5] sd 0:0:0:2: PM state - 2
> > > [   11.019072][    T5] sd 0:0:0:3: PM state - 2
> > > [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
> > > [   11.353298][    T5] sd 0:0:0:5: PM state - 2
> > > [   11.357726][    T5] sd 0:0:0:6: PM state - 2
> > > [   11.362155][    T5] sd 0:0:0:7: PM state - 2
> > > [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
> > > [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
> > > (0) has rpm_active flags
>
> Do you mean that rpm_active of the link between the consumer and the
> supplier is greater than 0 at this point and the consumer is

I mean is rpm_active of the link greater than 1 (because 1 means "no
active references to the supplier")?

> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>
> > > [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
> > > ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
> > > [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> > >
> > > And the the suspend of sde is stuck now:
> > > schedule+0x9c/0xe0
> > > schedule_timeout+0x40/0x128
> > > io_schedule_timeout+0x44/0x68
> > > wait_for_common_io+0x7c/0x100
> > > wait_for_completion_io+0x14/0x20
> > > blk_execute_rq+0x90/0xcc
> > > __scsi_execute+0x104/0x1c4
> > > sd_sync_cache+0xf8/0x2a0
> > > sd_suspend_common+0x74/0x11c
> > > sd_suspend_runtime+0x14/0x20
> > > scsi_runtime_suspend+0x64/0x94
> > > __rpm_callback+0x80/0x2a4
> > > rpm_suspend+0x308/0x614
> > > pm_runtime_work+0x98/0xa8
> > >
> > > I added 'DL_FLAG_RPM_ACTIVE' while creating links.
> > >       if (hba->sdev_ufs_device) {
> > >               link = device_link_add(&sdev->sdev_gendev,
> > >                                   &hba->sdev_ufs_device->sdev_gendev,
> > >                                  DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
> > > I didn't expect this to resolve the issue anyway and it didn't.
> > >
> > > Another interesting point here is when I resume any of the above suspended
> > > consumers, it all goes back to normal, which is kind of expected. I tried
> > > resuming the consumer and the supplier is resumed and the supplier is
> > > suspended when all the consumers are suspended.
> > >
> > > Any pointers on this issue please?
> > >
> > > @Bart/@Alan - Do you've any pointers please?
> >
> > It's very noticeable that although you seem to have isolated a bug in
> > the power management subsystem (supplier goes into runtime suspend
> > even when one of its consumers is still active), you did not CC the
> > power management maintainer or mailing list.
> >
> > I have added the appropriate CC's.
>
> Thanks Alan!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-08 17:17           ` Rafael J. Wysocki
@ 2021-03-09 15:56             ` Asutosh Das (asd)
  2021-03-10  3:04               ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-09 15:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Rafael J. Wysocki, Adrian Hunter, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, Bart Van Assche,
	linux-arm-msm, Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>
>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>
>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>> Scenario - bootups
>>>>
>>>> Issue:
>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>
>>>> *Log:
>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>
>>>> /** Printing all the consumer nodes of supplier **/
>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>> <-- this is the usage_count
>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>> (0) has rpm_active flags
>>
>> Do you mean that rpm_active of the link between the consumer and the
>> supplier is greater than 0 at this point and the consumer is
> 
> I mean is rpm_active of the link greater than 1 (because 1 means "no
> active references to the supplier")?
Hi Rafael:
No - it is not greater than 1.

I'm trying to understand what's going on in it; will update when I've 
something.

> 
>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>
>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>
>>>> And the the suspend of sde is stuck now:
>>>> schedule+0x9c/0xe0
>>>> schedule_timeout+0x40/0x128
>>>> io_schedule_timeout+0x44/0x68
>>>> wait_for_common_io+0x7c/0x100
>>>> wait_for_completion_io+0x14/0x20
>>>> blk_execute_rq+0x90/0xcc
>>>> __scsi_execute+0x104/0x1c4
>>>> sd_sync_cache+0xf8/0x2a0
>>>> sd_suspend_common+0x74/0x11c
>>>> sd_suspend_runtime+0x14/0x20
>>>> scsi_runtime_suspend+0x64/0x94
>>>> __rpm_callback+0x80/0x2a4
>>>> rpm_suspend+0x308/0x614
>>>> pm_runtime_work+0x98/0xa8
>>>>
>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>        if (hba->sdev_ufs_device) {
>>>>                link = device_link_add(&sdev->sdev_gendev,
>>>>                                    &hba->sdev_ufs_device->sdev_gendev,
>>>>                                   DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>
>>>> Another interesting point here is when I resume any of the above suspended
>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>> suspended when all the consumers are suspended.
>>>>
>>>> Any pointers on this issue please?
>>>>
>>>> @Bart/@Alan - Do you've any pointers please?
>>>
>>> It's very noticeable that although you seem to have isolated a bug in
>>> the power management subsystem (supplier goes into runtime suspend
>>> even when one of its consumers is still active), you did not CC the
>>> power management maintainer or mailing list.
>>>
>>> I have added the appropriate CC's.
>>
>> Thanks Alan!


-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-09 15:56             ` Asutosh Das (asd)
@ 2021-03-10  3:04               ` Asutosh Das (asd)
  2021-03-10  3:14                 ` Alan Stern
  2021-03-14  9:11                 ` Adrian Hunter
  0 siblings, 2 replies; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-10  3:04 UTC (permalink / raw)
  To: Alan Stern, Bart Van Assche, Adrian Hunter
  Cc: Rafael J. Wysocki, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> 
>> wrote:
>>>
>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> 
>>> wrote:
>>>>
>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>
>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>> Scenario - bootups
>>>>>
>>>>> Issue:
>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime 
>>>>> suspend even
>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>
>>>>> *Log:
>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>
>>>>> /** Printing all the consumer nodes of supplier **/
>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ 
>>>>> suspend: 0
>>>>> <-- this is the usage_count
>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: 
>>>>> __ufshcd_wl_suspend - 8709
>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: 
>>>>> __ufshcd_wl_suspend -
>>>>> (0) has rpm_active flags
>>>
>>> Do you mean that rpm_active of the link between the consumer and the
>>> supplier is greater than 0 at this point and the consumer is
>>
>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>> active references to the supplier")?
> Hi Rafael:
> No - it is not greater than 1.
> 
> I'm trying to understand what's going on in it; will update when I've 
> something.
> 
>>
>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>
>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>
>>>>> And the the suspend of sde is stuck now:
>>>>> schedule+0x9c/0xe0
>>>>> schedule_timeout+0x40/0x128
>>>>> io_schedule_timeout+0x44/0x68
>>>>> wait_for_common_io+0x7c/0x100
>>>>> wait_for_completion_io+0x14/0x20
>>>>> blk_execute_rq+0x90/0xcc
>>>>> __scsi_execute+0x104/0x1c4
>>>>> sd_sync_cache+0xf8/0x2a0
>>>>> sd_suspend_common+0x74/0x11c
>>>>> sd_suspend_runtime+0x14/0x20
>>>>> scsi_runtime_suspend+0x64/0x94
>>>>> __rpm_callback+0x80/0x2a4
>>>>> rpm_suspend+0x308/0x614
>>>>> pm_runtime_work+0x98/0xa8
>>>>>
>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>        if (hba->sdev_ufs_device) {
>>>>>                link = device_link_add(&sdev->sdev_gendev,
>>>>>                                    &hba->sdev_ufs_device->sdev_gendev,
>>>>>                                   
>>>>> DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>
>>>>> Another interesting point here is when I resume any of the above 
>>>>> suspended
>>>>> consumers, it all goes back to normal, which is kind of expected. I 
>>>>> tried
>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>> suspended when all the consumers are suspended.
>>>>>
>>>>> Any pointers on this issue please?
>>>>>
>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>
>>>> It's very noticeable that although you seem to have isolated a bug in
>>>> the power management subsystem (supplier goes into runtime suspend
>>>> even when one of its consumers is still active), you did not CC the
>>>> power management maintainer or mailing list.
>>>>
>>>> I have added the appropriate CC's.
>>>
>>> Thanks Alan!
> 
> 

Hello
I & Can (thanks CanG) debugged this further:

Looks like this issue can occur if the sd probe is asynchronous.

Essentially, the sd_probe() is done asynchronously and 
driver_probe_device() invokes pm_runtime_get_suppliers() before invoking 
sd_probe().

But scsi_probe_and_add_lun() runs in a separate context.
So the scsi_autopm_put_device() invoked from scsi_scan_host() context 
reduces the link->rpm_active to 1. And sd_probe() invokes 
scsi_autopm_put_device() and starts a timer. And then 
driver_probe_device() invoked from __device_attach_async_helper context 
reduces the link->rpm_active to 1 thus enabling the supplier to suspend 
before the consumer suspends.

So if:
Context T1:
[1] scsi_probe_and_add_lun()
[2]	|- scsi_autopm_put_device() - reduce the link->rpm_active to 1

Context T2:
__device_attach_async_helper()
	|- driver_probe_device()
		|- sd_probe()
In between [1] and [2] say, driver_probe_device() -> sd_probe() is 
invoked in a separate context from __device_attach_async_helper().
The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would 
reduce link->rpm_active to 1.
Then sd_probe() would invoke rpm_resume() and proceed as is.
When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, 
dev->power.timer_autosuspends = 1.

Now then, pm_runtime_put_suppliers() is invoked from 
driver_probe_device() and that makes the link->rpm_active = 1.
But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state 
= RPM_ACTIVE and link->rpm_active = 1.
At this point of time, all other 'sd dev' (consumers) _may_ be suspended 
or active but would have the link->rpm_active = 1.

Since the supplier has 0 auto-suspend delay, it now suspends!


Context [T1]
Call trace:
dump_backtrace+0x0/0x1d4
show_stack+0x18/0x24
dump_stack+0xc4/0x144
__pm_runtime_idle+0xb4/0x184
scsi_autopm_put_device+0x18/0x24
scsi_sysfs_add_sdev+0x26c/0x278
scsi_probe_and_add_lun+0xbac/0xd48
__scsi_scan_target+0x38c/0x510
scsi_scan_host_selected+0x14c/0x1e4
scsi_scan_host+0x1e0/0x228
ufshcd_async_scan+0x39c/0x408
async_run_entry_fn+0x48/0x128
process_one_work+0x1f0/0x470
worker_thread+0x26c/0x4c8
kthread+0x13c/0x320
ret_from_fork+0x10/0x18


Context [T2]
Call trace:
dump_backtrace+0x0/0x1d4
show_stack+0x18/0x24
dump_stack+0xc4/0x144
rpm_get_suppliers+0x48/0x1ac
__rpm_callback+0x58/0x12c
rpm_resume+0x3a4/0x618
__pm_runtime_resume+0x50/0x80
scsi_autopm_get_device+0x20/0x54
sd_probe+0x40/0x3d0
really_probe+0x1bc/0x4a0
driver_probe_device+0x84/0xf0
__device_attach_driver+0x114/0x138
bus_for_each_drv+0x84/0xd0
__device_attach_async_helper+0x7c/0xf0
async_run_entry_fn+0x48/0x128
process_one_work+0x1f0/0x470
worker_thread+0x26c/0x4c8
kthread+0x13c/0x320
ret_from_fork+0x10/0x18

Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
[    7.574654][  T212] Call trace:
[    7.574657][  T212]  dump_backtrace+0x0/0x1d4
[    7.574661][  T212]  show_stack+0x18/0x24
[    7.574665][  T212]  dump_stack+0xc4/0x144
[    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
[    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
[    7.574675][  T212]  sd_probe+0x314/0x3d0
[    7.574677][  T212]  really_probe+0x1bc/0x4a0
[    7.574680][  T212]  driver_probe_device+0x84/0xf0
[    7.574683][  T212]  __device_attach_driver+0x114/0x138
[    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
[    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
[    7.574692][  T212]  async_run_entry_fn+0x48/0x128
[    7.574695][  T212]  process_one_work+0x1f0/0x470
[    7.574698][  T212]  worker_thread+0x26c/0x4c8
[    7.574700][  T212]  kthread+0x13c/0x320
[    7.574703][  T212]  ret_from_fork+0x10/0x18
[    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
[    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: 
pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 
pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
[    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
[    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: 
pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2

[    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
[    7.574754][  T212] Call trace:
[    7.574758][  T212]  dump_backtrace+0x0/0x1d4
[    7.574761][  T212]  show_stack+0x18/0x24
[    7.574765][  T212]  dump_stack+0xc4/0x144
[    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
[    7.574770][  T212]  driver_probe_device+0x94/0xf0
[    7.574773][  T212]  __device_attach_driver+0x114/0x138
[    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
[    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
[    7.574781][  T212]  async_run_entry_fn+0x48/0x128
[    7.574783][  T212]  process_one_work+0x1f0/0x470
[    7.574786][  T212]  worker_thread+0x26c/0x4c8
[    7.574788][  T212]  kthread+0x13c/0x320
[    7.574791][  T212]  ret_from_fork+0x10/0x18
[    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
[    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: 
pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 
0 usage_count: 0 rpm_status: 0 link-rpm_active:2
[    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: 
rpm_status: 0 link-rpm_active:1
[    7.574866][  T212] sd 0:0:0:4: async probe completed
[    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: 
pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1


So, from the above it looks like when async probe is enabled this is a 
possibility.

I don't see a way around this. Please let me know if you 
(@Alan/@Bart/@Adrian) have any thoughts on this.

Thanks,
-asd

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-10  3:04               ` Asutosh Das (asd)
@ 2021-03-10  3:14                 ` Alan Stern
  2021-03-10  4:04                   ` Asutosh Das (asd)
  2021-03-14  9:11                 ` Adrian Hunter
  1 sibling, 1 reply; 20+ messages in thread
From: Alan Stern @ 2021-03-10  3:14 UTC (permalink / raw)
  To: Asutosh Das (asd)
  Cc: Bart Van Assche, Adrian Hunter, Rafael J. Wysocki, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, linux-arm-msm,
	Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Tue, Mar 09, 2021 at 07:04:34PM -0800, Asutosh Das (asd) wrote:
> Hello
> I & Can (thanks CanG) debugged this further:
> 
> Looks like this issue can occur if the sd probe is asynchronous.
> 
> Essentially, the sd_probe() is done asynchronously and driver_probe_device()
> invokes pm_runtime_get_suppliers() before invoking sd_probe().
> 
> But scsi_probe_and_add_lun() runs in a separate context.
> So the scsi_autopm_put_device() invoked from scsi_scan_host() context
> reduces the link->rpm_active to 1. And sd_probe() invokes
> scsi_autopm_put_device() and starts a timer. And then driver_probe_device()
> invoked from __device_attach_async_helper context reduces the
> link->rpm_active to 1 thus enabling the supplier to suspend before the
> consumer suspends.

> I don't see a way around this. Please let me know if you
> (@Alan/@Bart/@Adrian) have any thoughts on this.

How about changing the SCSI core so that it does a runtime_get before
starting an async probe, and the async probe routine does a
runtime_put when it is finished?  In other words, don't allow a device
to go into runtime suspend while it is waiting to be probed.

I don't think that would be too intrusive.

Alan Stern

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-10  3:14                 ` Alan Stern
@ 2021-03-10  4:04                   ` Asutosh Das (asd)
  2021-03-10 16:27                     ` Alan Stern
  0 siblings, 1 reply; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-10  4:04 UTC (permalink / raw)
  To: Alan Stern
  Cc: Bart Van Assche, Adrian Hunter, Rafael J. Wysocki, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, linux-arm-msm,
	Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/9/2021 7:14 PM, Alan Stern wrote:
> On Tue, Mar 09, 2021 at 07:04:34PM -0800, Asutosh Das (asd) wrote:
>> Hello
>> I & Can (thanks CanG) debugged this further:
>>
>> Looks like this issue can occur if the sd probe is asynchronous.
>>
>> Essentially, the sd_probe() is done asynchronously and driver_probe_device()
>> invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>
>> But scsi_probe_and_add_lun() runs in a separate context.
>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context
>> reduces the link->rpm_active to 1. And sd_probe() invokes
>> scsi_autopm_put_device() and starts a timer. And then driver_probe_device()
>> invoked from __device_attach_async_helper context reduces the
>> link->rpm_active to 1 thus enabling the supplier to suspend before the
>> consumer suspends.
> 
>> I don't see a way around this. Please let me know if you
>> (@Alan/@Bart/@Adrian) have any thoughts on this.
> 
> How about changing the SCSI core so that it does a runtime_get before
> starting an async probe, and the async probe routine does a
> runtime_put when it is finished?  In other words, don't allow a device
> to go into runtime suspend while it is waiting to be probed.
> 
> I don't think that would be too intrusive.
> 
> Alan Stern
> 

Hi Alan
Thanks for the suggestion.

Am trying to understand:

Do you mean something like this:

int scsi_sysfs_add_sdev(struct scsi_device *sdev)
{
	
	scsi_autopm_get_device(sdev);
	pm_runtime_get_noresume(&sdev->sdev_gendev);
	[...]
	scsi_autopm_put_device(sdev);
	[...]
}

static int sd_probe(struct device *dev)
{
	[...]
	pm_runtime_put_noidle(dev);
	scsi_autopm_put_device(sdp);
	[...]
}

This may work (I'm limited by my imagination in scsi layer :) ).

But the pm_runtime_put_noidle() would have to be added to all registered 
scsi_driver{}, perhaps? Or may be I can check for sdp->type?

-asd

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-10  4:04                   ` Asutosh Das (asd)
@ 2021-03-10 16:27                     ` Alan Stern
  2021-03-10 16:39                       ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Alan Stern @ 2021-03-10 16:27 UTC (permalink / raw)
  To: Asutosh Das (asd)
  Cc: Bart Van Assche, Adrian Hunter, Rafael J. Wysocki, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, linux-arm-msm,
	Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Tue, Mar 09, 2021 at 08:04:53PM -0800, Asutosh Das (asd) wrote:
> On 3/9/2021 7:14 PM, Alan Stern wrote:
> > On Tue, Mar 09, 2021 at 07:04:34PM -0800, Asutosh Das (asd) wrote:
> > > Hello
> > > I & Can (thanks CanG) debugged this further:
> > > 
> > > Looks like this issue can occur if the sd probe is asynchronous.
> > > 
> > > Essentially, the sd_probe() is done asynchronously and driver_probe_device()
> > > invokes pm_runtime_get_suppliers() before invoking sd_probe().
> > > 
> > > But scsi_probe_and_add_lun() runs in a separate context.
> > > So the scsi_autopm_put_device() invoked from scsi_scan_host() context
> > > reduces the link->rpm_active to 1. And sd_probe() invokes
> > > scsi_autopm_put_device() and starts a timer. And then driver_probe_device()
> > > invoked from __device_attach_async_helper context reduces the
> > > link->rpm_active to 1 thus enabling the supplier to suspend before the
> > > consumer suspends.
> > 
> > > I don't see a way around this. Please let me know if you
> > > (@Alan/@Bart/@Adrian) have any thoughts on this.
> > 
> > How about changing the SCSI core so that it does a runtime_get before
> > starting an async probe, and the async probe routine does a
> > runtime_put when it is finished?  In other words, don't allow a device
> > to go into runtime suspend while it is waiting to be probed.
> > 
> > I don't think that would be too intrusive.
> > 
> > Alan Stern
> > 
> 
> Hi Alan
> Thanks for the suggestion.
> 
> Am trying to understand:
> 
> Do you mean something like this:
> 
> int scsi_sysfs_add_sdev(struct scsi_device *sdev)
> {
> 	
> 	scsi_autopm_get_device(sdev);
> 	pm_runtime_get_noresume(&sdev->sdev_gendev);
> 	[...]
> 	scsi_autopm_put_device(sdev);
> 	[...]
> }
> 
> static int sd_probe(struct device *dev)
> {
> 	[...]
> 	pm_runtime_put_noidle(dev);
> 	scsi_autopm_put_device(sdp);
> 	[...]
> }
> 
> This may work (I'm limited by my imagination in scsi layer :) ).

I'm not sure about this.  To be honest, I did not read the entirety of 
your last message; it had way too much detail.  THere's a time and place 
for that, but when you're brainstorming to figure out the underlying 
cause of a problem and come up with a strategy to fix it, you want to 
concentrate on the overall picture, not the details.

As I understand the situation, you've get a SCSI target with multiple 
logical units, let's say A and B, and you need to make sure that A never 
goes into runtime suspend unless B is already suspended.  In other 
words, B always has to suspend before A and resume after A.

To do this, you register a device link with A as the supplier and B as 
the consumer.  Then the PM core takes care of the ordering for you.

But I don't understand when you set up the device link.  If the timing 
is wrong then, thanks to async SCSI probing, you may have a situation 
where A is registered before B and before the link is set up.  Then 
there's temporarily nothing to stop A from suspending before B.

You also need to prevent each device from suspending before it is 
probed.  That's the easy part I was trying to address before (although 
it may not be so easy if the drivers are in loadable modules and not 
present in the kernel).

You need to think through these issues before proposing actual changes.

> But the pm_runtime_put_noidle() would have to be added to all registered
> scsi_driver{}, perhaps? Or may be I can check for sdp->type?

Like this; it's too early to worry about this sort of thing.

Alan Stern

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-10 16:27                     ` Alan Stern
@ 2021-03-10 16:39                       ` Asutosh Das (asd)
  0 siblings, 0 replies; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-10 16:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: Bart Van Assche, Adrian Hunter, Rafael J. Wysocki, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, linux-arm-msm,
	Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/10/2021 8:27 AM, Alan Stern wrote:
> On Tue, Mar 09, 2021 at 08:04:53PM -0800, Asutosh Das (asd) wrote:
>> On 3/9/2021 7:14 PM, Alan Stern wrote:
>>> On Tue, Mar 09, 2021 at 07:04:34PM -0800, Asutosh Das (asd) wrote:
>>>> Hello
>>>> I & Can (thanks CanG) debugged this further:
>>>>
>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>
>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device()
>>>> invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>
>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context
>>>> reduces the link->rpm_active to 1. And sd_probe() invokes
>>>> scsi_autopm_put_device() and starts a timer. And then driver_probe_device()
>>>> invoked from __device_attach_async_helper context reduces the
>>>> link->rpm_active to 1 thus enabling the supplier to suspend before the
>>>> consumer suspends.
>>>
>>>> I don't see a way around this. Please let me know if you
>>>> (@Alan/@Bart/@Adrian) have any thoughts on this.
>>>
>>> How about changing the SCSI core so that it does a runtime_get before
>>> starting an async probe, and the async probe routine does a
>>> runtime_put when it is finished?  In other words, don't allow a device
>>> to go into runtime suspend while it is waiting to be probed.
>>>
>>> I don't think that would be too intrusive.
>>>
>>> Alan Stern
>>>
>>
>> Hi Alan
>> Thanks for the suggestion.
>>
>> Am trying to understand:
>>
>> Do you mean something like this:
>>
>> int scsi_sysfs_add_sdev(struct scsi_device *sdev)
>> {
>> 	
>> 	scsi_autopm_get_device(sdev);
>> 	pm_runtime_get_noresume(&sdev->sdev_gendev);
>> 	[...]
>> 	scsi_autopm_put_device(sdev);
>> 	[...]
>> }
>>
>> static int sd_probe(struct device *dev)
>> {
>> 	[...]
>> 	pm_runtime_put_noidle(dev);
>> 	scsi_autopm_put_device(sdp);
>> 	[...]
>> }
>>
>> This may work (I'm limited by my imagination in scsi layer :) ).
> 
> I'm not sure about this.  To be honest, I did not read the entirety of
> your last message; it had way too much detail.  THere's a time and place
> for that, but when you're brainstorming to figure out the underlying
> cause of a problem and come up with a strategy to fix it, you want to
> concentrate on the overall picture, not the details.
> 
> As I understand the situation, you've get a SCSI target with multiple
> logical units, let's say A and B, and you need to make sure that A never
> goes into runtime suspend unless B is already suspended.  In other
> words, B always has to suspend before A and resume after A.
> 
> To do this, you register a device link with A as the supplier and B as
> the consumer.  Then the PM core takes care of the ordering for you.
> 
> But I don't understand when you set up the device link.  If the timing
> is wrong then, thanks to async SCSI probing, you may have a situation
> where A is registered before B and before the link is set up.  Then
> there's temporarily nothing to stop A from suspending before B.
> 
> You also need to prevent each device from suspending before it is
> probed.  That's the easy part I was trying to address before (although
> it may not be so easy if the drivers are in loadable modules and not
> present in the kernel).
> 
> You need to think through these issues before proposing actual changes.
> 
>> But the pm_runtime_put_noidle() would have to be added to all registered
>> scsi_driver{}, perhaps? Or may be I can check for sdp->type?
> 
> Like this; it's too early to worry about this sort of thing.
> 
> Alan Stern
> 
Hi Alan
Thanks. Understood.

I will check the details and see if I can come up with something.
I'll propose an alternate fix otherwise and drop this change altogether.

Thanks!
-asd

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-10  3:04               ` Asutosh Das (asd)
  2021-03-10  3:14                 ` Alan Stern
@ 2021-03-14  9:11                 ` Adrian Hunter
  2021-03-15 22:22                   ` Asutosh Das (asd)
  1 sibling, 1 reply; 20+ messages in thread
From: Adrian Hunter @ 2021-03-14  9:11 UTC (permalink / raw)
  To: Asutosh Das (asd), Alan Stern, Bart Van Assche
  Cc: Rafael J. Wysocki, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>
>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>
>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>
>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>> Scenario - bootups
>>>>>>
>>>>>> Issue:
>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>
>>>>>> *Log:
>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>
>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>> <-- this is the usage_count
>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>> (0) has rpm_active flags
>>>>
>>>> Do you mean that rpm_active of the link between the consumer and the
>>>> supplier is greater than 0 at this point and the consumer is
>>>
>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>> active references to the supplier")?
>> Hi Rafael:
>> No - it is not greater than 1.
>>
>> I'm trying to understand what's going on in it; will update when I've something.
>>
>>>
>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>
>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>
>>>>>> And the the suspend of sde is stuck now:
>>>>>> schedule+0x9c/0xe0
>>>>>> schedule_timeout+0x40/0x128
>>>>>> io_schedule_timeout+0x44/0x68
>>>>>> wait_for_common_io+0x7c/0x100
>>>>>> wait_for_completion_io+0x14/0x20
>>>>>> blk_execute_rq+0x90/0xcc
>>>>>> __scsi_execute+0x104/0x1c4
>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>> sd_suspend_common+0x74/0x11c
>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>> __rpm_callback+0x80/0x2a4
>>>>>> rpm_suspend+0x308/0x614
>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>
>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>        if (hba->sdev_ufs_device) {
>>>>>>                link = device_link_add(&sdev->sdev_gendev,
>>>>>>                                    &hba->sdev_ufs_device->sdev_gendev,
>>>>>>                                   DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>
>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>> suspended when all the consumers are suspended.
>>>>>>
>>>>>> Any pointers on this issue please?
>>>>>>
>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>
>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>> even when one of its consumers is still active), you did not CC the
>>>>> power management maintainer or mailing list.
>>>>>
>>>>> I have added the appropriate CC's.
>>>>
>>>> Thanks Alan!
>>
>>
> 
> Hello
> I & Can (thanks CanG) debugged this further:
> 
> Looks like this issue can occur if the sd probe is asynchronous.
> 
> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
> 
> But scsi_probe_and_add_lun() runs in a separate context.
> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
> 
> So if:
> Context T1:
> [1] scsi_probe_and_add_lun()
> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
> 
> Context T2:
> __device_attach_async_helper()
>     |- driver_probe_device()
>         |- sd_probe()
> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
> Then sd_probe() would invoke rpm_resume() and proceed as is.
> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
> 
> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.

Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
consumers have link->rpm_active = 2 and also have incremented
the supplier's usage_count?

Another outstanding issue that comes to mind, is to ensure
hba->sdev_ufs_device does not runtime suspend before it is probed.
I suggest changing ufshcd_slave_configure() so it does not set
sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
ufshcd_wl_remove() respectively.

However we still want to stop hba->sdev_ufs_device runtime
suspending while consumers are being added.  With that in mind,
I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
in ufshcd_scsi_add_wlus() to come *before*
ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
be more logical to make it, pm_runtime_get_sync() since we require
hba->sdev_ufs_device to be active at that point.


> 
> Since the supplier has 0 auto-suspend delay, it now suspends!
> 
> 
> Context [T1]
> Call trace:
> dump_backtrace+0x0/0x1d4
> show_stack+0x18/0x24
> dump_stack+0xc4/0x144
> __pm_runtime_idle+0xb4/0x184
> scsi_autopm_put_device+0x18/0x24
> scsi_sysfs_add_sdev+0x26c/0x278
> scsi_probe_and_add_lun+0xbac/0xd48
> __scsi_scan_target+0x38c/0x510
> scsi_scan_host_selected+0x14c/0x1e4
> scsi_scan_host+0x1e0/0x228
> ufshcd_async_scan+0x39c/0x408
> async_run_entry_fn+0x48/0x128
> process_one_work+0x1f0/0x470
> worker_thread+0x26c/0x4c8
> kthread+0x13c/0x320
> ret_from_fork+0x10/0x18
> 
> 
> Context [T2]
> Call trace:
> dump_backtrace+0x0/0x1d4
> show_stack+0x18/0x24
> dump_stack+0xc4/0x144
> rpm_get_suppliers+0x48/0x1ac
> __rpm_callback+0x58/0x12c
> rpm_resume+0x3a4/0x618
> __pm_runtime_resume+0x50/0x80
> scsi_autopm_get_device+0x20/0x54
> sd_probe+0x40/0x3d0
> really_probe+0x1bc/0x4a0
> driver_probe_device+0x84/0xf0
> __device_attach_driver+0x114/0x138
> bus_for_each_drv+0x84/0xd0
> __device_attach_async_helper+0x7c/0xf0
> async_run_entry_fn+0x48/0x128
> process_one_work+0x1f0/0x470
> worker_thread+0x26c/0x4c8
> kthread+0x13c/0x320
> ret_from_fork+0x10/0x18
> 
> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
> [    7.574654][  T212] Call trace:
> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
> [    7.574661][  T212]  show_stack+0x18/0x24
> [    7.574665][  T212]  dump_stack+0xc4/0x144
> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
> [    7.574675][  T212]  sd_probe+0x314/0x3d0
> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
> [    7.574695][  T212]  process_one_work+0x1f0/0x470
> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
> [    7.574700][  T212]  kthread+0x13c/0x320
> [    7.574703][  T212]  ret_from_fork+0x10/0x18
> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
> 
> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
> [    7.574754][  T212] Call trace:
> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
> [    7.574761][  T212]  show_stack+0x18/0x24
> [    7.574765][  T212]  dump_stack+0xc4/0x144
> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
> [    7.574783][  T212]  process_one_work+0x1f0/0x470
> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
> [    7.574788][  T212]  kthread+0x13c/0x320
> [    7.574791][  T212]  ret_from_fork+0x10/0x18
> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
> [    7.574866][  T212] sd 0:0:0:4: async probe completed
> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
> 
> 
> So, from the above it looks like when async probe is enabled this is a possibility.
> 
> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
> 
> Thanks,
> -asd
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-14  9:11                 ` Adrian Hunter
@ 2021-03-15 22:22                   ` Asutosh Das (asd)
  2021-03-16  7:48                     ` Adrian Hunter
  0 siblings, 1 reply; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-15 22:22 UTC (permalink / raw)
  To: Adrian Hunter, Alan Stern, Bart Van Assche
  Cc: Rafael J. Wysocki, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/14/2021 1:11 AM, Adrian Hunter wrote:
> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>
>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>
>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>
>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>> Scenario - bootups
>>>>>>>
>>>>>>> Issue:
>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>
>>>>>>> *Log:
>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>
>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>> <-- this is the usage_count
>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>> (0) has rpm_active flags
>>>>>
>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>> supplier is greater than 0 at this point and the consumer is
>>>>
>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>> active references to the supplier")?
>>> Hi Rafael:
>>> No - it is not greater than 1.
>>>
>>> I'm trying to understand what's going on in it; will update when I've something.
>>>
>>>>
>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>
>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>
>>>>>>> And the the suspend of sde is stuck now:
>>>>>>> schedule+0x9c/0xe0
>>>>>>> schedule_timeout+0x40/0x128
>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>> rpm_suspend+0x308/0x614
>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>
>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>         if (hba->sdev_ufs_device) {
>>>>>>>                 link = device_link_add(&sdev->sdev_gendev,
>>>>>>>                                     &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>                                    DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>
>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>> suspended when all the consumers are suspended.
>>>>>>>
>>>>>>> Any pointers on this issue please?
>>>>>>>
>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>
>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>> power management maintainer or mailing list.
>>>>>>
>>>>>> I have added the appropriate CC's.
>>>>>
>>>>> Thanks Alan!
>>>
>>>
>>
>> Hello
>> I & Can (thanks CanG) debugged this further:
>>
>> Looks like this issue can occur if the sd probe is asynchronous.
>>
>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>
>> But scsi_probe_and_add_lun() runs in a separate context.
>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>
>> So if:
>> Context T1:
>> [1] scsi_probe_and_add_lun()
>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>
>> Context T2:
>> __device_attach_async_helper()
>>      |- driver_probe_device()
>>          |- sd_probe()
>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>
>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
> 
> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
> consumers have link->rpm_active = 2 and also have incremented
> the supplier's usage_count?
> 
> Another outstanding issue that comes to mind, is to ensure
> hba->sdev_ufs_device does not runtime suspend before it is probed.
> I suggest changing ufshcd_slave_configure() so it does not set
> sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
> pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
> ufshcd_wl_remove() respectively.
> 
> However we still want to stop hba->sdev_ufs_device runtime
> suspending while consumers are being added.  With that in mind,
> I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
> in ufshcd_scsi_add_wlus() to come *before*
> ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
> be more logical to make it, pm_runtime_get_sync() since we require
> hba->sdev_ufs_device to be active at that point.
> 
> 

Hi Adrian,
I think the v11 that I pushed can handle this.
runtime-suspend is forbidden at probe and is re-enabled after probe is 
done. Please take a look and let me know if I'm missing something.

>>
>> Since the supplier has 0 auto-suspend delay, it now suspends!
>>
>>
>> Context [T1]
>> Call trace:
>> dump_backtrace+0x0/0x1d4
>> show_stack+0x18/0x24
>> dump_stack+0xc4/0x144
>> __pm_runtime_idle+0xb4/0x184
>> scsi_autopm_put_device+0x18/0x24
>> scsi_sysfs_add_sdev+0x26c/0x278
>> scsi_probe_and_add_lun+0xbac/0xd48
>> __scsi_scan_target+0x38c/0x510
>> scsi_scan_host_selected+0x14c/0x1e4
>> scsi_scan_host+0x1e0/0x228
>> ufshcd_async_scan+0x39c/0x408
>> async_run_entry_fn+0x48/0x128
>> process_one_work+0x1f0/0x470
>> worker_thread+0x26c/0x4c8
>> kthread+0x13c/0x320
>> ret_from_fork+0x10/0x18
>>
>>
>> Context [T2]
>> Call trace:
>> dump_backtrace+0x0/0x1d4
>> show_stack+0x18/0x24
>> dump_stack+0xc4/0x144
>> rpm_get_suppliers+0x48/0x1ac
>> __rpm_callback+0x58/0x12c
>> rpm_resume+0x3a4/0x618
>> __pm_runtime_resume+0x50/0x80
>> scsi_autopm_get_device+0x20/0x54
>> sd_probe+0x40/0x3d0
>> really_probe+0x1bc/0x4a0
>> driver_probe_device+0x84/0xf0
>> __device_attach_driver+0x114/0x138
>> bus_for_each_drv+0x84/0xd0
>> __device_attach_async_helper+0x7c/0xf0
>> async_run_entry_fn+0x48/0x128
>> process_one_work+0x1f0/0x470
>> worker_thread+0x26c/0x4c8
>> kthread+0x13c/0x320
>> ret_from_fork+0x10/0x18
>>
>> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
>> [    7.574654][  T212] Call trace:
>> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
>> [    7.574661][  T212]  show_stack+0x18/0x24
>> [    7.574665][  T212]  dump_stack+0xc4/0x144
>> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
>> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
>> [    7.574675][  T212]  sd_probe+0x314/0x3d0
>> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
>> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
>> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
>> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
>> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
>> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
>> [    7.574695][  T212]  process_one_work+0x1f0/0x470
>> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
>> [    7.574700][  T212]  kthread+0x13c/0x320
>> [    7.574703][  T212]  ret_from_fork+0x10/0x18
>> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
>> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
>> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>
>> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
>> [    7.574754][  T212] Call trace:
>> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
>> [    7.574761][  T212]  show_stack+0x18/0x24
>> [    7.574765][  T212]  dump_stack+0xc4/0x144
>> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
>> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
>> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
>> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
>> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
>> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
>> [    7.574783][  T212]  process_one_work+0x1f0/0x470
>> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
>> [    7.574788][  T212]  kthread+0x13c/0x320
>> [    7.574791][  T212]  ret_from_fork+0x10/0x18
>> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
>> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
>> [    7.574866][  T212] sd 0:0:0:4: async probe completed
>> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
>>
>>
>> So, from the above it looks like when async probe is enabled this is a possibility.
>>
>> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
>>
>> Thanks,
>> -asd
>>
> 


-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-15 22:22                   ` Asutosh Das (asd)
@ 2021-03-16  7:48                     ` Adrian Hunter
  2021-03-16 20:35                       ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Adrian Hunter @ 2021-03-16  7:48 UTC (permalink / raw)
  To: Asutosh Das (asd), Alan Stern, Bart Van Assche
  Cc: Rafael J. Wysocki, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>
>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>
>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>
>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>> Scenario - bootups
>>>>>>>>
>>>>>>>> Issue:
>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>
>>>>>>>> *Log:
>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>
>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>> <-- this is the usage_count
>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>> (0) has rpm_active flags
>>>>>>
>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>
>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>> active references to the supplier")?
>>>> Hi Rafael:
>>>> No - it is not greater than 1.
>>>>
>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>
>>>>>
>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>
>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>
>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>> schedule+0x9c/0xe0
>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>
>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>         if (hba->sdev_ufs_device) {
>>>>>>>>                 link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>                                     &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>                                    DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>
>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>
>>>>>>>> Any pointers on this issue please?
>>>>>>>>
>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>
>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>> power management maintainer or mailing list.
>>>>>>>
>>>>>>> I have added the appropriate CC's.
>>>>>>
>>>>>> Thanks Alan!
>>>>
>>>>
>>>
>>> Hello
>>> I & Can (thanks CanG) debugged this further:
>>>
>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>
>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>
>>> But scsi_probe_and_add_lun() runs in a separate context.
>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>
>>> So if:
>>> Context T1:
>>> [1] scsi_probe_and_add_lun()
>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>
>>> Context T2:
>>> __device_attach_async_helper()
>>>      |- driver_probe_device()
>>>          |- sd_probe()
>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>
>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>
>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>> consumers have link->rpm_active = 2 and also have incremented
>> the supplier's usage_count?
>>
>> Another outstanding issue that comes to mind, is to ensure
>> hba->sdev_ufs_device does not runtime suspend before it is probed.
>> I suggest changing ufshcd_slave_configure() so it does not set
>> sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
>> pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
>> ufshcd_wl_remove() respectively.
>>
>> However we still want to stop hba->sdev_ufs_device runtime
>> suspending while consumers are being added.  With that in mind,
>> I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
>> in ufshcd_scsi_add_wlus() to come *before*
>> ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
>> be more logical to make it, pm_runtime_get_sync() since we require
>> hba->sdev_ufs_device to be active at that point.
>>
>>
> 
> Hi Adrian,
> I think the v11 that I pushed can handle this.
> runtime-suspend is forbidden at probe and is re-enabled after probe is done. Please take a look and let me know if I'm missing something.

If the PM APIs are being used correctly, the usage and active
counts should never be wrong.  If they were, then that would
be an issue for the PM framework.

However, it is hard to tell what the issues are until all the
UFS driver changes have been completed, such as the ones I
suggested above above.

v11 could be hiding issues rather than fixing them.

> 
>>>
>>> Since the supplier has 0 auto-suspend delay, it now suspends!
>>>
>>>
>>> Context [T1]
>>> Call trace:
>>> dump_backtrace+0x0/0x1d4
>>> show_stack+0x18/0x24
>>> dump_stack+0xc4/0x144
>>> __pm_runtime_idle+0xb4/0x184
>>> scsi_autopm_put_device+0x18/0x24
>>> scsi_sysfs_add_sdev+0x26c/0x278
>>> scsi_probe_and_add_lun+0xbac/0xd48
>>> __scsi_scan_target+0x38c/0x510
>>> scsi_scan_host_selected+0x14c/0x1e4
>>> scsi_scan_host+0x1e0/0x228
>>> ufshcd_async_scan+0x39c/0x408
>>> async_run_entry_fn+0x48/0x128
>>> process_one_work+0x1f0/0x470
>>> worker_thread+0x26c/0x4c8
>>> kthread+0x13c/0x320
>>> ret_from_fork+0x10/0x18
>>>
>>>
>>> Context [T2]
>>> Call trace:
>>> dump_backtrace+0x0/0x1d4
>>> show_stack+0x18/0x24
>>> dump_stack+0xc4/0x144
>>> rpm_get_suppliers+0x48/0x1ac
>>> __rpm_callback+0x58/0x12c
>>> rpm_resume+0x3a4/0x618
>>> __pm_runtime_resume+0x50/0x80
>>> scsi_autopm_get_device+0x20/0x54
>>> sd_probe+0x40/0x3d0
>>> really_probe+0x1bc/0x4a0
>>> driver_probe_device+0x84/0xf0
>>> __device_attach_driver+0x114/0x138
>>> bus_for_each_drv+0x84/0xd0
>>> __device_attach_async_helper+0x7c/0xf0
>>> async_run_entry_fn+0x48/0x128
>>> process_one_work+0x1f0/0x470
>>> worker_thread+0x26c/0x4c8
>>> kthread+0x13c/0x320
>>> ret_from_fork+0x10/0x18
>>>
>>> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
>>> [    7.574654][  T212] Call trace:
>>> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
>>> [    7.574661][  T212]  show_stack+0x18/0x24
>>> [    7.574665][  T212]  dump_stack+0xc4/0x144
>>> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
>>> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
>>> [    7.574675][  T212]  sd_probe+0x314/0x3d0
>>> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
>>> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
>>> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
>>> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
>>> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
>>> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
>>> [    7.574695][  T212]  process_one_work+0x1f0/0x470
>>> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
>>> [    7.574700][  T212]  kthread+0x13c/0x320
>>> [    7.574703][  T212]  ret_from_fork+0x10/0x18
>>> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
>>> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
>>> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>
>>> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
>>> [    7.574754][  T212] Call trace:
>>> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
>>> [    7.574761][  T212]  show_stack+0x18/0x24
>>> [    7.574765][  T212]  dump_stack+0xc4/0x144
>>> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
>>> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
>>> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
>>> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
>>> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
>>> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
>>> [    7.574783][  T212]  process_one_work+0x1f0/0x470
>>> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
>>> [    7.574788][  T212]  kthread+0x13c/0x320
>>> [    7.574791][  T212]  ret_from_fork+0x10/0x18
>>> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
>>> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
>>> [    7.574866][  T212] sd 0:0:0:4: async probe completed
>>> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
>>>
>>>
>>> So, from the above it looks like when async probe is enabled this is a possibility.
>>>
>>> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
>>>
>>> Thanks,
>>> -asd
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-16  7:48                     ` Adrian Hunter
@ 2021-03-16 20:35                       ` Asutosh Das (asd)
  2021-03-17  6:37                         ` Adrian Hunter
  0 siblings, 1 reply; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-16 20:35 UTC (permalink / raw)
  To: Adrian Hunter, Alan Stern, Bart Van Assche
  Cc: Rafael J. Wysocki, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/16/2021 12:48 AM, Adrian Hunter wrote:
> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>>
>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>>
>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>
>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>> Scenario - bootups
>>>>>>>>>
>>>>>>>>> Issue:
>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>
>>>>>>>>> *Log:
>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>
>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>> <-- this is the usage_count
>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>> (0) has rpm_active flags
>>>>>>>
>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>
>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>> active references to the supplier")?
>>>>> Hi Rafael:
>>>>> No - it is not greater than 1.
>>>>>
>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>
>>>>>>
>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>
>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>
>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>
>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>          if (hba->sdev_ufs_device) {
>>>>>>>>>                  link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>                                      &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>                                     DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>
>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>
>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>
>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>
>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>> power management maintainer or mailing list.
>>>>>>>>
>>>>>>>> I have added the appropriate CC's.
>>>>>>>
>>>>>>> Thanks Alan!
>>>>>
>>>>>
>>>>
>>>> Hello
>>>> I & Can (thanks CanG) debugged this further:
>>>>
>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>
>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>
>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>
>>>> So if:
>>>> Context T1:
>>>> [1] scsi_probe_and_add_lun()
>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>
>>>> Context T2:
>>>> __device_attach_async_helper()
>>>>       |- driver_probe_device()
>>>>           |- sd_probe()
>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>
>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>
>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>>> consumers have link->rpm_active = 2 and also have incremented
>>> the supplier's usage_count?

Yes this is with DL_FLAG_RPM_ACTIVE.

Please let me share a log here:
BEF means - Before, AFT means After.

[    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779: 
  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
[    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: 
(0:0:0:49488): supp: usage_count: 5 rpm_active: 4

In the above log, T7 is the context in which this scsi device is being 
added - scsi_sysfs_add_sdev()

[    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: 
ufshcd_rpmb_probe:9692: invoked
[    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4

[    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: 
(0:0:0:49488) [BEF] usage_count: 5
[    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: 
(0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3

[    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp 
(0:0:0:49488) usage_count: 4 rpm_active: 3

------ T196 Context comes in while T7 is running ----------
[    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: 
(0:0:0:49488): supp: usage_count: 5 rpm_active: 4
--------------------------------------------------------------

[    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp 
(0:0:0:49488) usage_count: 2 rpm_active: 1

-- 

T196 is the context in which sd_probe() is invoked for this scsi device.

[    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
[    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
[    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: 
(0:0:0:49488) [BEF] usage_count: 4
[    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: 
(0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1

The reference to the link is released after sd_probe() is completed.
At this point, the rpm_active should be 2. And the rpm_active should 
become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() 
the suspend is only scheduled. However the supplier is now free to suspend.

In this log, the usage_count of supplier becomes 0 here:
[   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp 
(0:0:0:49488) usage_count: 1 rpm_active: 2
[   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp 
(0:0:0:49488) usage_count: 0 rpm_active: 1

However, the consumer sd 0:0:0:4 is still active but has released the 
reference to the supplier:
[   12.002792][  T117] scsi 0:0:0:49456: rpm_status - 2
[   12.002806][  T117] sd 0:0:0:0: rpm_status - 2
[   12.002834][  T117] sd 0:0:0:1: rpm_status - 2
[   12.017730][  T117] sd 0:0:0:2: rpm_status - 2
[   12.041317][  T117] sd 0:0:0:3: rpm_status - 2
[   12.045953][  T117] sd 0:0:0:4: rpm_status - 0

And sd 0:0:0:4 tries to suspend here:
[   15.465914][  T117] sd 0:0:0:4: [sde] Synchronizing SCSI cache

>>>
>>> Another outstanding issue that comes to mind, is to ensure
>>> hba->sdev_ufs_device does not runtime suspend before it is probed.
>>> I suggest changing ufshcd_slave_configure() so it does not set
>>> sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
>>> pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
>>> ufshcd_wl_remove() respectively.
>>>
If pm_runtime_allow() is invoked from ufshcd_wl_probe() it'd invoke 
runtime_suspend on hba->sdev_ufs_device before exiting 
scsi_sysfs_add_sdev(). So I think pm_runtime_allow() should be invoked 
in ufshcd_scsi_add_wlus().

>>> However we still want to stop hba->sdev_ufs_device runtime
>>> suspending while consumers are being added.  With that in mind,
>>> I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
>>> in ufshcd_scsi_add_wlus() to come *before*
>>> ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
>>> be more logical to make it, pm_runtime_get_sync() since we require
>>> hba->sdev_ufs_device to be active at that point.
>>>
Correct, scsi_autopm_get_device(hba->sdev_ufs_device) should be invoked 
before ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).
Now a corresponding scsi_autopm_put_device(hba->sdev_ufs_device) is 
invoked after the consumers are added in ufshcd_setup_links().
Even then I think this issue would still pop up.

>>>
>>
>> Hi Adrian,
>> I think the v11 that I pushed can handle this.
>> runtime-suspend is forbidden at probe and is re-enabled after probe is done. Please take a look and let me know if I'm missing something.
> 
> If the PM APIs are being used correctly, the usage and active
> counts should never be wrong.  If they were, then that would
> be an issue for the PM framework.
>It is more likely that I'm using it incorrectly :).

> However, it is hard to tell what the issues are until all the
> UFS driver changes have been completed, such as the ones I
> suggested above above.
> 
Ok, understood.

> v11 could be hiding issues rather than fixing them.
> 
Based on the logs, in v11, I tried to forbid any runtime-suspend until 
sd_probe() is done.
I could be misunderstanding the whole thing though.

Having said that, I will make the changes as per your suggestions and 
push a v12. I will test with v12 as well and see if this issue is seen.

>>
>>>>
>>>> Since the supplier has 0 auto-suspend delay, it now suspends!
>>>>
>>>>
>>>> Context [T1]
>>>> Call trace:
>>>> dump_backtrace+0x0/0x1d4
>>>> show_stack+0x18/0x24
>>>> dump_stack+0xc4/0x144
>>>> __pm_runtime_idle+0xb4/0x184
>>>> scsi_autopm_put_device+0x18/0x24
>>>> scsi_sysfs_add_sdev+0x26c/0x278
>>>> scsi_probe_and_add_lun+0xbac/0xd48
>>>> __scsi_scan_target+0x38c/0x510
>>>> scsi_scan_host_selected+0x14c/0x1e4
>>>> scsi_scan_host+0x1e0/0x228
>>>> ufshcd_async_scan+0x39c/0x408
>>>> async_run_entry_fn+0x48/0x128
>>>> process_one_work+0x1f0/0x470
>>>> worker_thread+0x26c/0x4c8
>>>> kthread+0x13c/0x320
>>>> ret_from_fork+0x10/0x18
>>>>
>>>>
>>>> Context [T2]
>>>> Call trace:
>>>> dump_backtrace+0x0/0x1d4
>>>> show_stack+0x18/0x24
>>>> dump_stack+0xc4/0x144
>>>> rpm_get_suppliers+0x48/0x1ac
>>>> __rpm_callback+0x58/0x12c
>>>> rpm_resume+0x3a4/0x618
>>>> __pm_runtime_resume+0x50/0x80
>>>> scsi_autopm_get_device+0x20/0x54
>>>> sd_probe+0x40/0x3d0
>>>> really_probe+0x1bc/0x4a0
>>>> driver_probe_device+0x84/0xf0
>>>> __device_attach_driver+0x114/0x138
>>>> bus_for_each_drv+0x84/0xd0
>>>> __device_attach_async_helper+0x7c/0xf0
>>>> async_run_entry_fn+0x48/0x128
>>>> process_one_work+0x1f0/0x470
>>>> worker_thread+0x26c/0x4c8
>>>> kthread+0x13c/0x320
>>>> ret_from_fork+0x10/0x18
>>>>
>>>> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
>>>> [    7.574654][  T212] Call trace:
>>>> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
>>>> [    7.574661][  T212]  show_stack+0x18/0x24
>>>> [    7.574665][  T212]  dump_stack+0xc4/0x144
>>>> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
>>>> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
>>>> [    7.574675][  T212]  sd_probe+0x314/0x3d0
>>>> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
>>>> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
>>>> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
>>>> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
>>>> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
>>>> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
>>>> [    7.574695][  T212]  process_one_work+0x1f0/0x470
>>>> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
>>>> [    7.574700][  T212]  kthread+0x13c/0x320
>>>> [    7.574703][  T212]  ret_from_fork+0x10/0x18
>>>> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
>>>> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
>>>> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>
>>>> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
>>>> [    7.574754][  T212] Call trace:
>>>> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
>>>> [    7.574761][  T212]  show_stack+0x18/0x24
>>>> [    7.574765][  T212]  dump_stack+0xc4/0x144
>>>> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
>>>> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
>>>> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
>>>> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
>>>> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
>>>> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
>>>> [    7.574783][  T212]  process_one_work+0x1f0/0x470
>>>> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
>>>> [    7.574788][  T212]  kthread+0x13c/0x320
>>>> [    7.574791][  T212]  ret_from_fork+0x10/0x18
>>>> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
>>>> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
>>>> [    7.574866][  T212] sd 0:0:0:4: async probe completed
>>>> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
>>>>
>>>>
>>>> So, from the above it looks like when async probe is enabled this is a possibility.
>>>>
>>>> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
>>>>
>>>> Thanks,
>>>> -asd
>>>>
>>>
>>
>>
> 


-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-16 20:35                       ` Asutosh Das (asd)
@ 2021-03-17  6:37                         ` Adrian Hunter
  2021-03-18 14:00                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Adrian Hunter @ 2021-03-17  6:37 UTC (permalink / raw)
  To: Asutosh Das (asd), Alan Stern, Bart Van Assche
  Cc: Rafael J. Wysocki, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>>>
>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>>
>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>>> Scenario - bootups
>>>>>>>>>>
>>>>>>>>>> Issue:
>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>>
>>>>>>>>>> *Log:
>>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>>
>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>>> <-- this is the usage_count
>>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>>> (0) has rpm_active flags
>>>>>>>>
>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>>
>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>>> active references to the supplier")?
>>>>>> Hi Rafael:
>>>>>> No - it is not greater than 1.
>>>>>>
>>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>>
>>>>>>>
>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>>
>>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>>
>>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>>
>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>>          if (hba->sdev_ufs_device) {
>>>>>>>>>>                  link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>>                                      &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>>                                     DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>>
>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>>
>>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>>
>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>>
>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>>> power management maintainer or mailing list.
>>>>>>>>>
>>>>>>>>> I have added the appropriate CC's.
>>>>>>>>
>>>>>>>> Thanks Alan!
>>>>>>
>>>>>>
>>>>>
>>>>> Hello
>>>>> I & Can (thanks CanG) debugged this further:
>>>>>
>>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>>
>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>>
>>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>>
>>>>> So if:
>>>>> Context T1:
>>>>> [1] scsi_probe_and_add_lun()
>>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>>
>>>>> Context T2:
>>>>> __device_attach_async_helper()
>>>>>       |- driver_probe_device()
>>>>>           |- sd_probe()
>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>>
>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>>
>>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>>>> consumers have link->rpm_active = 2 and also have incremented
>>>> the supplier's usage_count?
> 
> Yes this is with DL_FLAG_RPM_ACTIVE.
> 
> Please let me share a log here:
> BEF means - Before, AFT means After.
> 
> [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
> [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
> 
> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
> 
> [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
> 
> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
> [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
> 
> [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
> 
> ------ T196 Context comes in while T7 is running ----------
> [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
> --------------------------------------------------------------
> 
> [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1



> 
> -- 
> 
> T196 is the context in which sd_probe() is invoked for this scsi device.
> 
> [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
> [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
> [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
> [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
> 
> The reference to the link is released after sd_probe() is completed.
> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
> 
> In this log, the usage_count of supplier becomes 0 here:
> [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
> [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
> 
> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:

If that is the case, then it is an error in PM not UFS.

A second look at the code around rpm_put_suppliers() does look
potentially racy, since there does not appear to be anything
stopping the runtime_status changing between
spin_unlock_irq(&dev->power.lock) and device_links_read_lock().

Rafael, can you comment?

	/*
	 * If the device is suspending and the callback has returned success,
	 * drop the usage counters of the suppliers that have been reference
	 * counted on its resume.
	 *
	 * Do that if the resume fails too.
	 */
	put = dev->power.runtime_status == RPM_SUSPENDING && !retval;
	if (put)
		__update_runtime_status(dev, RPM_SUSPENDED);
	else
		put = get && retval;

	if (put) {
		spin_unlock_irq(&dev->power.lock);

		idx = device_links_read_lock();

fail:
		rpm_put_suppliers(dev);

		device_links_read_unlock(idx);

		spin_lock_irq(&dev->power.lock);
	}




> [   12.002792][  T117] scsi 0:0:0:49456: rpm_status - 2
> [   12.002806][  T117] sd 0:0:0:0: rpm_status - 2
> [   12.002834][  T117] sd 0:0:0:1: rpm_status - 2
> [   12.017730][  T117] sd 0:0:0:2: rpm_status - 2
> [   12.041317][  T117] sd 0:0:0:3: rpm_status - 2
> [   12.045953][  T117] sd 0:0:0:4: rpm_status - 0
> 
> And sd 0:0:0:4 tries to suspend here:
> [   15.465914][  T117] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> 
>>>>
>>>> Another outstanding issue that comes to mind, is to ensure
>>>> hba->sdev_ufs_device does not runtime suspend before it is probed.
>>>> I suggest changing ufshcd_slave_configure() so it does not set
>>>> sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
>>>> pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
>>>> ufshcd_wl_remove() respectively.
>>>>
> If pm_runtime_allow() is invoked from ufshcd_wl_probe() it'd invoke runtime_suspend on hba->sdev_ufs_device before exiting scsi_sysfs_add_sdev(). So I think pm_runtime_allow() should be invoked in ufshcd_scsi_add_wlus().
> 
>>>> However we still want to stop hba->sdev_ufs_device runtime
>>>> suspending while consumers are being added.  With that in mind,
>>>> I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
>>>> in ufshcd_scsi_add_wlus() to come *before*
>>>> ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
>>>> be more logical to make it, pm_runtime_get_sync() since we require
>>>> hba->sdev_ufs_device to be active at that point.
>>>>
> Correct, scsi_autopm_get_device(hba->sdev_ufs_device) should be invoked before ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).
> Now a corresponding scsi_autopm_put_device(hba->sdev_ufs_device) is invoked after the consumers are added in ufshcd_setup_links().
> Even then I think this issue would still pop up.
> 
>>>>
>>>
>>> Hi Adrian,
>>> I think the v11 that I pushed can handle this.
>>> runtime-suspend is forbidden at probe and is re-enabled after probe is done. Please take a look and let me know if I'm missing something.
>>
>> If the PM APIs are being used correctly, the usage and active
>> counts should never be wrong.  If they were, then that would
>> be an issue for the PM framework.
>> It is more likely that I'm using it incorrectly :).
> 
>> However, it is hard to tell what the issues are until all the
>> UFS driver changes have been completed, such as the ones I
>> suggested above above.
>>
> Ok, understood.
> 
>> v11 could be hiding issues rather than fixing them.
>>
> Based on the logs, in v11, I tried to forbid any runtime-suspend until sd_probe() is done.
> I could be misunderstanding the whole thing though.
> 
> Having said that, I will make the changes as per your suggestions and push a v12. I will test with v12 as well and see if this issue is seen.
> 
>>>
>>>>>
>>>>> Since the supplier has 0 auto-suspend delay, it now suspends!
>>>>>
>>>>>
>>>>> Context [T1]
>>>>> Call trace:
>>>>> dump_backtrace+0x0/0x1d4
>>>>> show_stack+0x18/0x24
>>>>> dump_stack+0xc4/0x144
>>>>> __pm_runtime_idle+0xb4/0x184
>>>>> scsi_autopm_put_device+0x18/0x24
>>>>> scsi_sysfs_add_sdev+0x26c/0x278
>>>>> scsi_probe_and_add_lun+0xbac/0xd48
>>>>> __scsi_scan_target+0x38c/0x510
>>>>> scsi_scan_host_selected+0x14c/0x1e4
>>>>> scsi_scan_host+0x1e0/0x228
>>>>> ufshcd_async_scan+0x39c/0x408
>>>>> async_run_entry_fn+0x48/0x128
>>>>> process_one_work+0x1f0/0x470
>>>>> worker_thread+0x26c/0x4c8
>>>>> kthread+0x13c/0x320
>>>>> ret_from_fork+0x10/0x18
>>>>>
>>>>>
>>>>> Context [T2]
>>>>> Call trace:
>>>>> dump_backtrace+0x0/0x1d4
>>>>> show_stack+0x18/0x24
>>>>> dump_stack+0xc4/0x144
>>>>> rpm_get_suppliers+0x48/0x1ac
>>>>> __rpm_callback+0x58/0x12c
>>>>> rpm_resume+0x3a4/0x618
>>>>> __pm_runtime_resume+0x50/0x80
>>>>> scsi_autopm_get_device+0x20/0x54
>>>>> sd_probe+0x40/0x3d0
>>>>> really_probe+0x1bc/0x4a0
>>>>> driver_probe_device+0x84/0xf0
>>>>> __device_attach_driver+0x114/0x138
>>>>> bus_for_each_drv+0x84/0xd0
>>>>> __device_attach_async_helper+0x7c/0xf0
>>>>> async_run_entry_fn+0x48/0x128
>>>>> process_one_work+0x1f0/0x470
>>>>> worker_thread+0x26c/0x4c8
>>>>> kthread+0x13c/0x320
>>>>> ret_from_fork+0x10/0x18
>>>>>
>>>>> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
>>>>> [    7.574654][  T212] Call trace:
>>>>> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
>>>>> [    7.574661][  T212]  show_stack+0x18/0x24
>>>>> [    7.574665][  T212]  dump_stack+0xc4/0x144
>>>>> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
>>>>> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
>>>>> [    7.574675][  T212]  sd_probe+0x314/0x3d0
>>>>> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
>>>>> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
>>>>> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
>>>>> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
>>>>> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
>>>>> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
>>>>> [    7.574695][  T212]  process_one_work+0x1f0/0x470
>>>>> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
>>>>> [    7.574700][  T212]  kthread+0x13c/0x320
>>>>> [    7.574703][  T212]  ret_from_fork+0x10/0x18
>>>>> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
>>>>> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
>>>>> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>>
>>>>> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
>>>>> [    7.574754][  T212] Call trace:
>>>>> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
>>>>> [    7.574761][  T212]  show_stack+0x18/0x24
>>>>> [    7.574765][  T212]  dump_stack+0xc4/0x144
>>>>> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
>>>>> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
>>>>> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
>>>>> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
>>>>> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
>>>>> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
>>>>> [    7.574783][  T212]  process_one_work+0x1f0/0x470
>>>>> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
>>>>> [    7.574788][  T212]  kthread+0x13c/0x320
>>>>> [    7.574791][  T212]  ret_from_fork+0x10/0x18
>>>>> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
>>>>> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
>>>>> [    7.574866][  T212] sd 0:0:0:4: async probe completed
>>>>> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
>>>>>
>>>>>
>>>>> So, from the above it looks like when async probe is enabled this is a possibility.
>>>>>
>>>>> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
>>>>>
>>>>> Thanks,
>>>>> -asd
>>>>>
>>>>
>>>
>>>
>>
> 
> 
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> Linux Foundation Collaborative Project


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-17  6:37                         ` Adrian Hunter
@ 2021-03-18 14:00                           ` Rafael J. Wysocki
  2021-03-18 17:27                             ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2021-03-18 14:00 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Asutosh Das (asd), Alan Stern, Bart Van Assche, Rafael J. Wysocki,
	cang, Martin K. Petersen, open list:TARGET SUBSYSTEM,
	linux-arm-msm, Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
> > On 3/16/2021 12:48 AM, Adrian Hunter wrote:
> >> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
> >>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
> >>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
> >>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
> >>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
> >>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> >>>>>>>>>
> >>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
> >>>>>>>>>
> >>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
> >>>>>>>>>> Scenario - bootups
> >>>>>>>>>>
> >>>>>>>>>> Issue:
> >>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
> >>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
> >>>>>>>>>>
> >>>>>>>>>> *Log:
> >>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
> >>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
> >>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
> >>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
> >>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
> >>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
> >>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >>>>>>>>>>
> >>>>>>>>>> /** Printing all the consumer nodes of supplier **/
> >>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
> >>>>>>>>>> <-- this is the usage_count
> >>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
> >>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
> >>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
> >>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
> >>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
> >>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
> >>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
> >>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
> >>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
> >>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
> >>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
> >>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
> >>>>>>>>>> (0) has rpm_active flags
> >>>>>>>>
> >>>>>>>> Do you mean that rpm_active of the link between the consumer and the
> >>>>>>>> supplier is greater than 0 at this point and the consumer is
> >>>>>>>
> >>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
> >>>>>>> active references to the supplier")?
> >>>>>> Hi Rafael:
> >>>>>> No - it is not greater than 1.
> >>>>>>
> >>>>>> I'm trying to understand what's going on in it; will update when I've something.
> >>>>>>
> >>>>>>>
> >>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
> >>>>>>>>
> >>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
> >>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
> >>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> >>>>>>>>>>
> >>>>>>>>>> And the the suspend of sde is stuck now:
> >>>>>>>>>> schedule+0x9c/0xe0
> >>>>>>>>>> schedule_timeout+0x40/0x128
> >>>>>>>>>> io_schedule_timeout+0x44/0x68
> >>>>>>>>>> wait_for_common_io+0x7c/0x100
> >>>>>>>>>> wait_for_completion_io+0x14/0x20
> >>>>>>>>>> blk_execute_rq+0x90/0xcc
> >>>>>>>>>> __scsi_execute+0x104/0x1c4
> >>>>>>>>>> sd_sync_cache+0xf8/0x2a0
> >>>>>>>>>> sd_suspend_common+0x74/0x11c
> >>>>>>>>>> sd_suspend_runtime+0x14/0x20
> >>>>>>>>>> scsi_runtime_suspend+0x64/0x94
> >>>>>>>>>> __rpm_callback+0x80/0x2a4
> >>>>>>>>>> rpm_suspend+0x308/0x614
> >>>>>>>>>> pm_runtime_work+0x98/0xa8
> >>>>>>>>>>
> >>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
> >>>>>>>>>>          if (hba->sdev_ufs_device) {
> >>>>>>>>>>                  link = device_link_add(&sdev->sdev_gendev,
> >>>>>>>>>>                                      &hba->sdev_ufs_device->sdev_gendev,
> >>>>>>>>>>                                     DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
> >>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
> >>>>>>>>>>
> >>>>>>>>>> Another interesting point here is when I resume any of the above suspended
> >>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
> >>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
> >>>>>>>>>> suspended when all the consumers are suspended.
> >>>>>>>>>>
> >>>>>>>>>> Any pointers on this issue please?
> >>>>>>>>>>
> >>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
> >>>>>>>>>
> >>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
> >>>>>>>>> the power management subsystem (supplier goes into runtime suspend
> >>>>>>>>> even when one of its consumers is still active), you did not CC the
> >>>>>>>>> power management maintainer or mailing list.
> >>>>>>>>>
> >>>>>>>>> I have added the appropriate CC's.
> >>>>>>>>
> >>>>>>>> Thanks Alan!
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Hello
> >>>>> I & Can (thanks CanG) debugged this further:
> >>>>>
> >>>>> Looks like this issue can occur if the sd probe is asynchronous.
> >>>>>
> >>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
> >>>>>
> >>>>> But scsi_probe_and_add_lun() runs in a separate context.
> >>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
> >>>>>
> >>>>> So if:
> >>>>> Context T1:
> >>>>> [1] scsi_probe_and_add_lun()
> >>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
> >>>>>
> >>>>> Context T2:
> >>>>> __device_attach_async_helper()
> >>>>>       |- driver_probe_device()
> >>>>>           |- sd_probe()
> >>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
> >>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
> >>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
> >>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
> >>>>>
> >>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
> >>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
> >>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
> >>>>
> >>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
> >>>> consumers have link->rpm_active = 2 and also have incremented
> >>>> the supplier's usage_count?
> >
> > Yes this is with DL_FLAG_RPM_ACTIVE.
> >
> > Please let me share a log here:
> > BEF means - Before, AFT means After.
> >
> > [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
> > [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
> >
> > In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
> >
> > [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
> > [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
> >
> > [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
> > [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
> >
> > [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
> >
> > ------ T196 Context comes in while T7 is running ----------
> > [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
> > --------------------------------------------------------------
> >
> > [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
>
>
>
> >
> > --
> >
> > T196 is the context in which sd_probe() is invoked for this scsi device.
> >
> > [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
> > [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
> > [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
> > [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
> >
> > The reference to the link is released after sd_probe() is completed.
> > At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
> >
> > In this log, the usage_count of supplier becomes 0 here:
> > [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
> > [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
> >
> > However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
>
> If that is the case, then it is an error in PM not UFS.
>
> A second look at the code around rpm_put_suppliers() does look
> potentially racy, since there does not appear to be anything
> stopping the runtime_status changing between
> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
>
> Rafael, can you comment?

Indeed, if the device is suspending, changing its PM-runtime status to
RPM_SUSPENDED and dropping its power.lock allows a concurrent
rpm_resume() to run and resume the device before the suppliers can be
suspended.

That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
runtime: Update device status before letting suppliers suspend").

It is probably better to revert that commit and address the original
issue in a different way.

>         /*
>          * If the device is suspending and the callback has returned success,
>          * drop the usage counters of the suppliers that have been reference
>          * counted on its resume.
>          *
>          * Do that if the resume fails too.
>          */
>         put = dev->power.runtime_status == RPM_SUSPENDING && !retval;
>         if (put)
>                 __update_runtime_status(dev, RPM_SUSPENDED);
>         else
>                 put = get && retval;
>
>         if (put) {
>                 spin_unlock_irq(&dev->power.lock);
>
>                 idx = device_links_read_lock();
>
> fail:
>                 rpm_put_suppliers(dev);
>
>                 device_links_read_unlock(idx);
>
>                 spin_lock_irq(&dev->power.lock);
>         }
>
>
>
>
> > [   12.002792][  T117] scsi 0:0:0:49456: rpm_status - 2
> > [   12.002806][  T117] sd 0:0:0:0: rpm_status - 2
> > [   12.002834][  T117] sd 0:0:0:1: rpm_status - 2
> > [   12.017730][  T117] sd 0:0:0:2: rpm_status - 2
> > [   12.041317][  T117] sd 0:0:0:3: rpm_status - 2
> > [   12.045953][  T117] sd 0:0:0:4: rpm_status - 0
> >
> > And sd 0:0:0:4 tries to suspend here:
> > [   15.465914][  T117] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> >
> >>>>
> >>>> Another outstanding issue that comes to mind, is to ensure
> >>>> hba->sdev_ufs_device does not runtime suspend before it is probed.
> >>>> I suggest changing ufshcd_slave_configure() so it does not set
> >>>> sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
> >>>> pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
> >>>> ufshcd_wl_remove() respectively.
> >>>>
> > If pm_runtime_allow() is invoked from ufshcd_wl_probe() it'd invoke runtime_suspend on hba->sdev_ufs_device before exiting scsi_sysfs_add_sdev(). So I think pm_runtime_allow() should be invoked in ufshcd_scsi_add_wlus().
> >
> >>>> However we still want to stop hba->sdev_ufs_device runtime
> >>>> suspending while consumers are being added.  With that in mind,
> >>>> I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
> >>>> in ufshcd_scsi_add_wlus() to come *before*
> >>>> ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
> >>>> be more logical to make it, pm_runtime_get_sync() since we require
> >>>> hba->sdev_ufs_device to be active at that point.
> >>>>
> > Correct, scsi_autopm_get_device(hba->sdev_ufs_device) should be invoked before ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).
> > Now a corresponding scsi_autopm_put_device(hba->sdev_ufs_device) is invoked after the consumers are added in ufshcd_setup_links().
> > Even then I think this issue would still pop up.
> >
> >>>>
> >>>
> >>> Hi Adrian,
> >>> I think the v11 that I pushed can handle this.
> >>> runtime-suspend is forbidden at probe and is re-enabled after probe is done. Please take a look and let me know if I'm missing something.
> >>
> >> If the PM APIs are being used correctly, the usage and active
> >> counts should never be wrong.  If they were, then that would
> >> be an issue for the PM framework.
> >> It is more likely that I'm using it incorrectly :).
> >
> >> However, it is hard to tell what the issues are until all the
> >> UFS driver changes have been completed, such as the ones I
> >> suggested above above.
> >>
> > Ok, understood.
> >
> >> v11 could be hiding issues rather than fixing them.
> >>
> > Based on the logs, in v11, I tried to forbid any runtime-suspend until sd_probe() is done.
> > I could be misunderstanding the whole thing though.
> >
> > Having said that, I will make the changes as per your suggestions and push a v12. I will test with v12 as well and see if this issue is seen.
> >
> >>>
> >>>>>
> >>>>> Since the supplier has 0 auto-suspend delay, it now suspends!
> >>>>>
> >>>>>
> >>>>> Context [T1]
> >>>>> Call trace:
> >>>>> dump_backtrace+0x0/0x1d4
> >>>>> show_stack+0x18/0x24
> >>>>> dump_stack+0xc4/0x144
> >>>>> __pm_runtime_idle+0xb4/0x184
> >>>>> scsi_autopm_put_device+0x18/0x24
> >>>>> scsi_sysfs_add_sdev+0x26c/0x278
> >>>>> scsi_probe_and_add_lun+0xbac/0xd48
> >>>>> __scsi_scan_target+0x38c/0x510
> >>>>> scsi_scan_host_selected+0x14c/0x1e4
> >>>>> scsi_scan_host+0x1e0/0x228
> >>>>> ufshcd_async_scan+0x39c/0x408
> >>>>> async_run_entry_fn+0x48/0x128
> >>>>> process_one_work+0x1f0/0x470
> >>>>> worker_thread+0x26c/0x4c8
> >>>>> kthread+0x13c/0x320
> >>>>> ret_from_fork+0x10/0x18
> >>>>>
> >>>>>
> >>>>> Context [T2]
> >>>>> Call trace:
> >>>>> dump_backtrace+0x0/0x1d4
> >>>>> show_stack+0x18/0x24
> >>>>> dump_stack+0xc4/0x144
> >>>>> rpm_get_suppliers+0x48/0x1ac
> >>>>> __rpm_callback+0x58/0x12c
> >>>>> rpm_resume+0x3a4/0x618
> >>>>> __pm_runtime_resume+0x50/0x80
> >>>>> scsi_autopm_get_device+0x20/0x54
> >>>>> sd_probe+0x40/0x3d0
> >>>>> really_probe+0x1bc/0x4a0
> >>>>> driver_probe_device+0x84/0xf0
> >>>>> __device_attach_driver+0x114/0x138
> >>>>> bus_for_each_drv+0x84/0xd0
> >>>>> __device_attach_async_helper+0x7c/0xf0
> >>>>> async_run_entry_fn+0x48/0x128
> >>>>> process_one_work+0x1f0/0x470
> >>>>> worker_thread+0x26c/0x4c8
> >>>>> kthread+0x13c/0x320
> >>>>> ret_from_fork+0x10/0x18
> >>>>>
> >>>>> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
> >>>>> [    7.574654][  T212] Call trace:
> >>>>> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
> >>>>> [    7.574661][  T212]  show_stack+0x18/0x24
> >>>>> [    7.574665][  T212]  dump_stack+0xc4/0x144
> >>>>> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
> >>>>> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
> >>>>> [    7.574675][  T212]  sd_probe+0x314/0x3d0
> >>>>> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
> >>>>> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
> >>>>> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
> >>>>> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
> >>>>> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
> >>>>> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
> >>>>> [    7.574695][  T212]  process_one_work+0x1f0/0x470
> >>>>> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
> >>>>> [    7.574700][  T212]  kthread+0x13c/0x320
> >>>>> [    7.574703][  T212]  ret_from_fork+0x10/0x18
> >>>>> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
> >>>>> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
> >>>>> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
> >>>>> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
> >>>>>
> >>>>> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
> >>>>> [    7.574754][  T212] Call trace:
> >>>>> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
> >>>>> [    7.574761][  T212]  show_stack+0x18/0x24
> >>>>> [    7.574765][  T212]  dump_stack+0xc4/0x144
> >>>>> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
> >>>>> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
> >>>>> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
> >>>>> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
> >>>>> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
> >>>>> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
> >>>>> [    7.574783][  T212]  process_one_work+0x1f0/0x470
> >>>>> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
> >>>>> [    7.574788][  T212]  kthread+0x13c/0x320
> >>>>> [    7.574791][  T212]  ret_from_fork+0x10/0x18
> >>>>> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
> >>>>> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
> >>>>> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
> >>>>> [    7.574866][  T212] sd 0:0:0:4: async probe completed
> >>>>> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
> >>>>>
> >>>>>
> >>>>> So, from the above it looks like when async probe is enabled this is a possibility.
> >>>>>
> >>>>> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
> >>>>>
> >>>>> Thanks,
> >>>>> -asd
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> > Linux Foundation Collaborative Project
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-18 14:00                           ` Rafael J. Wysocki
@ 2021-03-18 17:27                             ` Asutosh Das (asd)
  2021-03-18 17:54                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-18 17:27 UTC (permalink / raw)
  To: Rafael J. Wysocki, Adrian Hunter
  Cc: Alan Stern, Bart Van Assche, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/18/2021 7:00 AM, Rafael J. Wysocki wrote:
> On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
>>> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
>>>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>>>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>>>>> Scenario - bootups
>>>>>>>>>>>>
>>>>>>>>>>>> Issue:
>>>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>>>>
>>>>>>>>>>>> *Log:
>>>>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>>>>
>>>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>>>>> <-- this is the usage_count
>>>>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>>>>> (0) has rpm_active flags
>>>>>>>>>>
>>>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>>>>
>>>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>>>>> active references to the supplier")?
>>>>>>>> Hi Rafael:
>>>>>>>> No - it is not greater than 1.
>>>>>>>>
>>>>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>>>>
>>>>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>>>>
>>>>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>>>>
>>>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>>>>           if (hba->sdev_ufs_device) {
>>>>>>>>>>>>                   link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>>>>                                       &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>>>>                                      DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>>>>
>>>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>>>>
>>>>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>>>>
>>>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>>>>
>>>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>>>>> power management maintainer or mailing list.
>>>>>>>>>>>
>>>>>>>>>>> I have added the appropriate CC's.
>>>>>>>>>>
>>>>>>>>>> Thanks Alan!
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Hello
>>>>>>> I & Can (thanks CanG) debugged this further:
>>>>>>>
>>>>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>>>>
>>>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>>>>
>>>>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>>>>
>>>>>>> So if:
>>>>>>> Context T1:
>>>>>>> [1] scsi_probe_and_add_lun()
>>>>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>>>>
>>>>>>> Context T2:
>>>>>>> __device_attach_async_helper()
>>>>>>>        |- driver_probe_device()
>>>>>>>            |- sd_probe()
>>>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>>>>
>>>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>>>>
>>>>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>>>>>> consumers have link->rpm_active = 2 and also have incremented
>>>>>> the supplier's usage_count?
>>>
>>> Yes this is with DL_FLAG_RPM_ACTIVE.
>>>
>>> Please let me share a log here:
>>> BEF means - Before, AFT means After.
>>>
>>> [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
>>> [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>
>>> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
>>>
>>> [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
>>>
>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
>>> [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
>>>
>>> [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
>>>
>>> ------ T196 Context comes in while T7 is running ----------
>>> [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>> --------------------------------------------------------------
>>>
>>> [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
>>
>>
>>
>>>
>>> --
>>>
>>> T196 is the context in which sd_probe() is invoked for this scsi device.
>>>
>>> [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
>>> [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
>>> [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
>>> [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
>>>
>>> The reference to the link is released after sd_probe() is completed.
>>> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
>>>
>>> In this log, the usage_count of supplier becomes 0 here:
>>> [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
>>> [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
>>>
>>> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
>>
>> If that is the case, then it is an error in PM not UFS.
>>
>> A second look at the code around rpm_put_suppliers() does look
>> potentially racy, since there does not appear to be anything
>> stopping the runtime_status changing between
>> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
>>
>> Rafael, can you comment?
> 
> Indeed, if the device is suspending, changing its PM-runtime status to
> RPM_SUSPENDED and dropping its power.lock allows a concurrent
> rpm_resume() to run and resume the device before the suppliers can be
> suspended.
> 
> That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
> runtime: Update device status before letting suppliers suspend").
> 
> It is probably better to revert that commit and address the original
> issue in a different way.
> 
Hello,
One approach to address the original issue could be to prevent the scsi 
devices from suspending until the probe is completed, perhaps?

Please let me know if there's a better way to handle this.
Else I'd address all other comments from Adrian and push a change with 
the above approach.

>>          /*
>>           * If the device is suspending and the callback has returned success,
>>           * drop the usage counters of the suppliers that have been reference
>>           * counted on its resume.
>>           *
>>           * Do that if the resume fails too.
>>           */
>>          put = dev->power.runtime_status == RPM_SUSPENDING && !retval;
>>          if (put)
>>                  __update_runtime_status(dev, RPM_SUSPENDED);
>>          else
>>                  put = get && retval;
>>
>>          if (put) {
>>                  spin_unlock_irq(&dev->power.lock);
>>
>>                  idx = device_links_read_lock();
>>
>> fail:
>>                  rpm_put_suppliers(dev);
>>
>>                  device_links_read_unlock(idx);
>>
>>                  spin_lock_irq(&dev->power.lock);
>>          }
>>
>>
>>
>>
>>> [   12.002792][  T117] scsi 0:0:0:49456: rpm_status - 2
>>> [   12.002806][  T117] sd 0:0:0:0: rpm_status - 2
>>> [   12.002834][  T117] sd 0:0:0:1: rpm_status - 2
>>> [   12.017730][  T117] sd 0:0:0:2: rpm_status - 2
>>> [   12.041317][  T117] sd 0:0:0:3: rpm_status - 2
>>> [   12.045953][  T117] sd 0:0:0:4: rpm_status - 0
>>>
>>> And sd 0:0:0:4 tries to suspend here:
>>> [   15.465914][  T117] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>
>>>>>>
>>>>>> Another outstanding issue that comes to mind, is to ensure
>>>>>> hba->sdev_ufs_device does not runtime suspend before it is probed.
>>>>>> I suggest changing ufshcd_slave_configure() so it does not set
>>>>>> sdev->rpm_autosuspend for hba->sdev_ufs_device, and instead do
>>>>>> pm_runtime_allow / pm_runtime_forbid() in ufshcd_wl_probe() /
>>>>>> ufshcd_wl_remove() respectively.
>>>>>>
>>> If pm_runtime_allow() is invoked from ufshcd_wl_probe() it'd invoke runtime_suspend on hba->sdev_ufs_device before exiting scsi_sysfs_add_sdev(). So I think pm_runtime_allow() should be invoked in ufshcd_scsi_add_wlus().
>>>
>>>>>> However we still want to stop hba->sdev_ufs_device runtime
>>>>>> suspending while consumers are being added.  With that in mind,
>>>>>> I would expect pm_runtime_get_noresume(&hba->sdev_ufs_device->sdev_gendev)
>>>>>> in ufshcd_scsi_add_wlus() to come *before*
>>>>>> ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).  In fact, it would
>>>>>> be more logical to make it, pm_runtime_get_sync() since we require
>>>>>> hba->sdev_ufs_device to be active at that point.
>>>>>>
>>> Correct, scsi_autopm_get_device(hba->sdev_ufs_device) should be invoked before ufshcd_blk_pm_runtime_init(hba->sdev_ufs_device).
>>> Now a corresponding scsi_autopm_put_device(hba->sdev_ufs_device) is invoked after the consumers are added in ufshcd_setup_links().
>>> Even then I think this issue would still pop up.
>>>
>>>>>>
>>>>>
>>>>> Hi Adrian,
>>>>> I think the v11 that I pushed can handle this.
>>>>> runtime-suspend is forbidden at probe and is re-enabled after probe is done. Please take a look and let me know if I'm missing something.
>>>>
>>>> If the PM APIs are being used correctly, the usage and active
>>>> counts should never be wrong.  If they were, then that would
>>>> be an issue for the PM framework.
>>>> It is more likely that I'm using it incorrectly :).
>>>
>>>> However, it is hard to tell what the issues are until all the
>>>> UFS driver changes have been completed, such as the ones I
>>>> suggested above above.
>>>>
>>> Ok, understood.
>>>
>>>> v11 could be hiding issues rather than fixing them.
>>>>
>>> Based on the logs, in v11, I tried to forbid any runtime-suspend until sd_probe() is done.
>>> I could be misunderstanding the whole thing though.
>>>
>>> Having said that, I will make the changes as per your suggestions and push a v12. I will test with v12 as well and see if this issue is seen.
>>>
>>>>>
>>>>>>>
>>>>>>> Since the supplier has 0 auto-suspend delay, it now suspends!
>>>>>>>
>>>>>>>
>>>>>>> Context [T1]
>>>>>>> Call trace:
>>>>>>> dump_backtrace+0x0/0x1d4
>>>>>>> show_stack+0x18/0x24
>>>>>>> dump_stack+0xc4/0x144
>>>>>>> __pm_runtime_idle+0xb4/0x184
>>>>>>> scsi_autopm_put_device+0x18/0x24
>>>>>>> scsi_sysfs_add_sdev+0x26c/0x278
>>>>>>> scsi_probe_and_add_lun+0xbac/0xd48
>>>>>>> __scsi_scan_target+0x38c/0x510
>>>>>>> scsi_scan_host_selected+0x14c/0x1e4
>>>>>>> scsi_scan_host+0x1e0/0x228
>>>>>>> ufshcd_async_scan+0x39c/0x408
>>>>>>> async_run_entry_fn+0x48/0x128
>>>>>>> process_one_work+0x1f0/0x470
>>>>>>> worker_thread+0x26c/0x4c8
>>>>>>> kthread+0x13c/0x320
>>>>>>> ret_from_fork+0x10/0x18
>>>>>>>
>>>>>>>
>>>>>>> Context [T2]
>>>>>>> Call trace:
>>>>>>> dump_backtrace+0x0/0x1d4
>>>>>>> show_stack+0x18/0x24
>>>>>>> dump_stack+0xc4/0x144
>>>>>>> rpm_get_suppliers+0x48/0x1ac
>>>>>>> __rpm_callback+0x58/0x12c
>>>>>>> rpm_resume+0x3a4/0x618
>>>>>>> __pm_runtime_resume+0x50/0x80
>>>>>>> scsi_autopm_get_device+0x20/0x54
>>>>>>> sd_probe+0x40/0x3d0
>>>>>>> really_probe+0x1bc/0x4a0
>>>>>>> driver_probe_device+0x84/0xf0
>>>>>>> __device_attach_driver+0x114/0x138
>>>>>>> bus_for_each_drv+0x84/0xd0
>>>>>>> __device_attach_async_helper+0x7c/0xf0
>>>>>>> async_run_entry_fn+0x48/0x128
>>>>>>> process_one_work+0x1f0/0x470
>>>>>>> worker_thread+0x26c/0x4c8
>>>>>>> kthread+0x13c/0x320
>>>>>>> ret_from_fork+0x10/0x18
>>>>>>>
>>>>>>> Below prints show how link->rpm_active becomes 1 for sd 0:0:0:4
>>>>>>> [    7.574654][  T212] Call trace:
>>>>>>> [    7.574657][  T212]  dump_backtrace+0x0/0x1d4
>>>>>>> [    7.574661][  T212]  show_stack+0x18/0x24
>>>>>>> [    7.574665][  T212]  dump_stack+0xc4/0x144
>>>>>>> [    7.574668][  T212]  __pm_runtime_idle+0xb4/0x184
>>>>>>> [    7.574671][  T212]  scsi_autopm_put_device+0x18/0x24
>>>>>>> [    7.574675][  T212]  sd_probe+0x314/0x3d0
>>>>>>> [    7.574677][  T212]  really_probe+0x1bc/0x4a0
>>>>>>> [    7.574680][  T212]  driver_probe_device+0x84/0xf0
>>>>>>> [    7.574683][  T212]  __device_attach_driver+0x114/0x138
>>>>>>> [    7.574686][  T212]  bus_for_each_drv+0x84/0xd0
>>>>>>> [    7.574689][  T212]  __device_attach_async_helper+0x7c/0xf0
>>>>>>> [    7.574692][  T212]  async_run_entry_fn+0x48/0x128
>>>>>>> [    7.574695][  T212]  process_one_work+0x1f0/0x470
>>>>>>> [    7.574698][  T212]  worker_thread+0x26c/0x4c8
>>>>>>> [    7.574700][  T212]  kthread+0x13c/0x320
>>>>>>> [    7.574703][  T212]  ret_from_fork+0x10/0x18
>>>>>>> [    7.574706][  T212] sd 0:0:0:4: scsi_runtime_idle
>>>>>>> [    7.574712][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: -16 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>>>> [    7.574715][  T212] sd 0:0:0:4: sd_probe: [UFSDBG]: Exit
>>>>>>> [    7.574738][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>>>>
>>>>>>> [    7.574752][  T212] Workqueue: events_unbound async_run_entry_fn
>>>>>>> [    7.574754][  T212] Call trace:
>>>>>>> [    7.574758][  T212]  dump_backtrace+0x0/0x1d4
>>>>>>> [    7.574761][  T212]  show_stack+0x18/0x24
>>>>>>> [    7.574765][  T212]  dump_stack+0xc4/0x144
>>>>>>> [    7.574767][  T212]  __pm_runtime_idle+0xb4/0x184
>>>>>>> [    7.574770][  T212]  driver_probe_device+0x94/0xf0
>>>>>>> [    7.574773][  T212]  __device_attach_driver+0x114/0x138
>>>>>>> [    7.574775][  T212]  bus_for_each_drv+0x84/0xd0
>>>>>>> [    7.574778][  T212]  __device_attach_async_helper+0x7c/0xf0
>>>>>>> [    7.574781][  T212]  async_run_entry_fn+0x48/0x128
>>>>>>> [    7.574783][  T212]  process_one_work+0x1f0/0x470
>>>>>>> [    7.574786][  T212]  worker_thread+0x26c/0x4c8
>>>>>>> [    7.574788][  T212]  kthread+0x13c/0x320
>>>>>>> [    7.574791][  T212]  ret_from_fork+0x10/0x18
>>>>>>> [    7.574848][   T80] sd 0:0:0:4: scsi_runtime_idle
>>>>>>> [    7.574858][  T212] sd 0:0:0:4: __pm_runtime_idle: aft: [UFSDBG]: pwr.timer_autosuspends: 1 pwr.request_pending: 0 retval: 0 pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:2
>>>>>>> [    7.574863][  T212] sd 0:0:0:4: pm_runtime_put_suppliers: [UFSDBG]: rpm_status: 0 link-rpm_active:1
>>>>>>> [    7.574866][  T212] sd 0:0:0:4: async probe completed
>>>>>>> [    7.574870][  T212] sd 0:0:0:4: __pm_runtime_idle: b4: [UFSDBG]: pwr.request: 0 usage_count: 0 rpm_status: 0 link-rpm_active:1
>>>>>>>
>>>>>>>
>>>>>>> So, from the above it looks like when async probe is enabled this is a possibility.
>>>>>>>
>>>>>>> I don't see a way around this. Please let me know if you (@Alan/@Bart/@Adrian) have any thoughts on this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -asd
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
>>> Linux Foundation Collaborative Project
>>


-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-18 17:27                             ` Asutosh Das (asd)
@ 2021-03-18 17:54                               ` Rafael J. Wysocki
  2021-03-18 17:58                                 ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Rafael J. Wysocki @ 2021-03-18 17:54 UTC (permalink / raw)
  To: Asutosh Das (asd)
  Cc: Rafael J. Wysocki, Adrian Hunter, Alan Stern, Bart Van Assche,
	cang, Martin K. Petersen, open list:TARGET SUBSYSTEM,
	linux-arm-msm, Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On Thu, Mar 18, 2021 at 6:33 PM Asutosh Das (asd)
<asutoshd@codeaurora.org> wrote:
>
> On 3/18/2021 7:00 AM, Rafael J. Wysocki wrote:
> > On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
> >>> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
> >>>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
> >>>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
> >>>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
> >>>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
> >>>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
> >>>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
> >>>>>>>>>>>> Scenario - bootups
> >>>>>>>>>>>>
> >>>>>>>>>>>> Issue:
> >>>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
> >>>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
> >>>>>>>>>>>>
> >>>>>>>>>>>> *Log:
> >>>>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
> >>>>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
> >>>>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
> >>>>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
> >>>>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
> >>>>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
> >>>>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >>>>>>>>>>>>
> >>>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
> >>>>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
> >>>>>>>>>>>> <-- this is the usage_count
> >>>>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
> >>>>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
> >>>>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
> >>>>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
> >>>>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
> >>>>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
> >>>>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
> >>>>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
> >>>>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
> >>>>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
> >>>>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
> >>>>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
> >>>>>>>>>>>> (0) has rpm_active flags
> >>>>>>>>>>
> >>>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
> >>>>>>>>>> supplier is greater than 0 at this point and the consumer is
> >>>>>>>>>
> >>>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
> >>>>>>>>> active references to the supplier")?
> >>>>>>>> Hi Rafael:
> >>>>>>>> No - it is not greater than 1.
> >>>>>>>>
> >>>>>>>> I'm trying to understand what's going on in it; will update when I've something.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
> >>>>>>>>>>
> >>>>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
> >>>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
> >>>>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
> >>>>>>>>>>>>
> >>>>>>>>>>>> And the the suspend of sde is stuck now:
> >>>>>>>>>>>> schedule+0x9c/0xe0
> >>>>>>>>>>>> schedule_timeout+0x40/0x128
> >>>>>>>>>>>> io_schedule_timeout+0x44/0x68
> >>>>>>>>>>>> wait_for_common_io+0x7c/0x100
> >>>>>>>>>>>> wait_for_completion_io+0x14/0x20
> >>>>>>>>>>>> blk_execute_rq+0x90/0xcc
> >>>>>>>>>>>> __scsi_execute+0x104/0x1c4
> >>>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
> >>>>>>>>>>>> sd_suspend_common+0x74/0x11c
> >>>>>>>>>>>> sd_suspend_runtime+0x14/0x20
> >>>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
> >>>>>>>>>>>> __rpm_callback+0x80/0x2a4
> >>>>>>>>>>>> rpm_suspend+0x308/0x614
> >>>>>>>>>>>> pm_runtime_work+0x98/0xa8
> >>>>>>>>>>>>
> >>>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
> >>>>>>>>>>>>           if (hba->sdev_ufs_device) {
> >>>>>>>>>>>>                   link = device_link_add(&sdev->sdev_gendev,
> >>>>>>>>>>>>                                       &hba->sdev_ufs_device->sdev_gendev,
> >>>>>>>>>>>>                                      DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
> >>>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
> >>>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
> >>>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
> >>>>>>>>>>>> suspended when all the consumers are suspended.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any pointers on this issue please?
> >>>>>>>>>>>>
> >>>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
> >>>>>>>>>>>
> >>>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
> >>>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
> >>>>>>>>>>> even when one of its consumers is still active), you did not CC the
> >>>>>>>>>>> power management maintainer or mailing list.
> >>>>>>>>>>>
> >>>>>>>>>>> I have added the appropriate CC's.
> >>>>>>>>>>
> >>>>>>>>>> Thanks Alan!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hello
> >>>>>>> I & Can (thanks CanG) debugged this further:
> >>>>>>>
> >>>>>>> Looks like this issue can occur if the sd probe is asynchronous.
> >>>>>>>
> >>>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
> >>>>>>>
> >>>>>>> But scsi_probe_and_add_lun() runs in a separate context.
> >>>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
> >>>>>>>
> >>>>>>> So if:
> >>>>>>> Context T1:
> >>>>>>> [1] scsi_probe_and_add_lun()
> >>>>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
> >>>>>>>
> >>>>>>> Context T2:
> >>>>>>> __device_attach_async_helper()
> >>>>>>>        |- driver_probe_device()
> >>>>>>>            |- sd_probe()
> >>>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
> >>>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
> >>>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
> >>>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
> >>>>>>>
> >>>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
> >>>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
> >>>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
> >>>>>>
> >>>>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
> >>>>>> consumers have link->rpm_active = 2 and also have incremented
> >>>>>> the supplier's usage_count?
> >>>
> >>> Yes this is with DL_FLAG_RPM_ACTIVE.
> >>>
> >>> Please let me share a log here:
> >>> BEF means - Before, AFT means After.
> >>>
> >>> [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
> >>> [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
> >>>
> >>> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
> >>>
> >>> [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
> >>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
> >>>
> >>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
> >>> [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
> >>>
> >>> [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
> >>>
> >>> ------ T196 Context comes in while T7 is running ----------
> >>> [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
> >>> --------------------------------------------------------------
> >>>
> >>> [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
> >>
> >>
> >>
> >>>
> >>> --
> >>>
> >>> T196 is the context in which sd_probe() is invoked for this scsi device.
> >>>
> >>> [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
> >>> [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
> >>> [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
> >>> [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
> >>>
> >>> The reference to the link is released after sd_probe() is completed.
> >>> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
> >>>
> >>> In this log, the usage_count of supplier becomes 0 here:
> >>> [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
> >>> [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
> >>>
> >>> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
> >>
> >> If that is the case, then it is an error in PM not UFS.
> >>
> >> A second look at the code around rpm_put_suppliers() does look
> >> potentially racy, since there does not appear to be anything
> >> stopping the runtime_status changing between
> >> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
> >>
> >> Rafael, can you comment?
> >
> > Indeed, if the device is suspending, changing its PM-runtime status to
> > RPM_SUSPENDED and dropping its power.lock allows a concurrent
> > rpm_resume() to run and resume the device before the suppliers can be
> > suspended.
> >
> > That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
> > runtime: Update device status before letting suppliers suspend").
> >
> > It is probably better to revert that commit and address the original
> > issue in a different way.
> >
> Hello,
> One approach to address the original issue could be to prevent the scsi
> devices from suspending until the probe is completed, perhaps?

I was talking about the original issue that commit 44cc89f76464
attempted to address.

I'm not sure if and how it is related to the issue you have been debugging.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-18 17:54                               ` Rafael J. Wysocki
@ 2021-03-18 17:58                                 ` Asutosh Das (asd)
  2021-03-18 19:16                                   ` Adrian Hunter
  0 siblings, 1 reply; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-18 17:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Adrian Hunter, Alan Stern, Bart Van Assche, cang,
	Martin K. Petersen, open list:TARGET SUBSYSTEM, linux-arm-msm,
	Alim Akhtar, Avri Altman, James E.J. Bottomley,
	Krzysztof Kozlowski, Stanley Chu, Andy Gross, Bjorn Andersson,
	Steven Rostedt, Ingo Molnar, Matthias Brugger, Kiwoong Kim,
	Bean Huo, Lee Jones, Wei Yongjun, Dinghao Liu,
	Gustavo A. R. Silva, Tomas Winkler, Jaegeuk Kim, Satya Tangirala,
	open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/18/2021 10:54 AM, Rafael J. Wysocki wrote:
> On Thu, Mar 18, 2021 at 6:33 PM Asutosh Das (asd)
> <asutoshd@codeaurora.org> wrote:
>>
>> On 3/18/2021 7:00 AM, Rafael J. Wysocki wrote:
>>> On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
>>>>> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
>>>>>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>>>>>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>>>>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>>>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>>>>>>> Scenario - bootups
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Issue:
>>>>>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Log:
>>>>>>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>>>>>>> <-- this is the usage_count
>>>>>>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>>>>>>> (0) has rpm_active flags
>>>>>>>>>>>>
>>>>>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>>>>>>
>>>>>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>>>>>>> active references to the supplier")?
>>>>>>>>>> Hi Rafael:
>>>>>>>>>> No - it is not greater than 1.
>>>>>>>>>>
>>>>>>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>>>>>>
>>>>>>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>>>>>>            if (hba->sdev_ufs_device) {
>>>>>>>>>>>>>>                    link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>>>>>>                                        &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>>>>>>                                       DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>>>>>>> power management maintainer or mailing list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have added the appropriate CC's.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks Alan!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello
>>>>>>>>> I & Can (thanks CanG) debugged this further:
>>>>>>>>>
>>>>>>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>>>>>>
>>>>>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>>>>>>
>>>>>>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>>>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>>>>>>
>>>>>>>>> So if:
>>>>>>>>> Context T1:
>>>>>>>>> [1] scsi_probe_and_add_lun()
>>>>>>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>>>>>>
>>>>>>>>> Context T2:
>>>>>>>>> __device_attach_async_helper()
>>>>>>>>>         |- driver_probe_device()
>>>>>>>>>             |- sd_probe()
>>>>>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>>>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>>>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>>>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>>>>>>
>>>>>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>>>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>>>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>>>>>>
>>>>>>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>>>>>>>> consumers have link->rpm_active = 2 and also have incremented
>>>>>>>> the supplier's usage_count?
>>>>>
>>>>> Yes this is with DL_FLAG_RPM_ACTIVE.
>>>>>
>>>>> Please let me share a log here:
>>>>> BEF means - Before, AFT means After.
>>>>>
>>>>> [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
>>>>> [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>
>>>>> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
>>>>>
>>>>> [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
>>>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
>>>>>
>>>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
>>>>> [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
>>>>>
>>>>> [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
>>>>>
>>>>> ------ T196 Context comes in while T7 is running ----------
>>>>> [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>> --------------------------------------------------------------
>>>>>
>>>>> [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
>>>>
>>>>
>>>>
>>>>>
>>>>> --
>>>>>
>>>>> T196 is the context in which sd_probe() is invoked for this scsi device.
>>>>>
>>>>> [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
>>>>> [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
>>>>> [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
>>>>> [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
>>>>>
>>>>> The reference to the link is released after sd_probe() is completed.
>>>>> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
>>>>>
>>>>> In this log, the usage_count of supplier becomes 0 here:
>>>>> [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
>>>>> [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
>>>>>
>>>>> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
>>>>
>>>> If that is the case, then it is an error in PM not UFS.
>>>>
>>>> A second look at the code around rpm_put_suppliers() does look
>>>> potentially racy, since there does not appear to be anything
>>>> stopping the runtime_status changing between
>>>> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
>>>>
>>>> Rafael, can you comment?
>>>
>>> Indeed, if the device is suspending, changing its PM-runtime status to
>>> RPM_SUSPENDED and dropping its power.lock allows a concurrent
>>> rpm_resume() to run and resume the device before the suppliers can be
>>> suspended.
>>>
>>> That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
>>> runtime: Update device status before letting suppliers suspend").
>>>
>>> It is probably better to revert that commit and address the original
>>> issue in a different way.
>>>
>> Hello,
>> One approach to address the original issue could be to prevent the scsi
>> devices from suspending until the probe is completed, perhaps?
> 
> I was talking about the original issue that commit 44cc89f76464
> attempted to address.
> 
> I'm not sure if and how it is related to the issue you have been debugging.
> 
Hi Rafael
Thanks for clarifying that.
Understood.
I was referring to the issue that I've been discussing with Adrian.

-asd

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-18 17:58                                 ` Asutosh Das (asd)
@ 2021-03-18 19:16                                   ` Adrian Hunter
  2021-03-19  0:40                                     ` Asutosh Das (asd)
  0 siblings, 1 reply; 20+ messages in thread
From: Adrian Hunter @ 2021-03-18 19:16 UTC (permalink / raw)
  To: Asutosh Das (asd), Rafael J. Wysocki
  Cc: Alan Stern, Bart Van Assche, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 18/03/21 7:58 pm, Asutosh Das (asd) wrote:
> On 3/18/2021 10:54 AM, Rafael J. Wysocki wrote:
>> On Thu, Mar 18, 2021 at 6:33 PM Asutosh Das (asd)
>> <asutoshd@codeaurora.org> wrote:
>>>
>>> On 3/18/2021 7:00 AM, Rafael J. Wysocki wrote:
>>>> On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>>
>>>>> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
>>>>>> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
>>>>>>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>>>>>>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>>>>>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>>>>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>>>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>>>>>>>> Scenario - bootups
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Issue:
>>>>>>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Log:
>>>>>>>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>>>>>>>> <-- this is the usage_count
>>>>>>>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>>>>>>>> (0) has rpm_active flags
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>>>>>>>
>>>>>>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>>>>>>>> active references to the supplier")?
>>>>>>>>>>> Hi Rafael:
>>>>>>>>>>> No - it is not greater than 1.
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>>>>>>>            if (hba->sdev_ufs_device) {
>>>>>>>>>>>>>>>                    link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>>>>>>>                                        &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>>>>>>>                                       DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>>>>>>>> power management maintainer or mailing list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have added the appropriate CC's.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Alan!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello
>>>>>>>>>> I & Can (thanks CanG) debugged this further:
>>>>>>>>>>
>>>>>>>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>>>>>>>
>>>>>>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>>>>>>>
>>>>>>>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>>>>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>>>>>>>
>>>>>>>>>> So if:
>>>>>>>>>> Context T1:
>>>>>>>>>> [1] scsi_probe_and_add_lun()
>>>>>>>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>>>>>>>
>>>>>>>>>> Context T2:
>>>>>>>>>> __device_attach_async_helper()
>>>>>>>>>>         |- driver_probe_device()
>>>>>>>>>>             |- sd_probe()
>>>>>>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>>>>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>>>>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>>>>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>>>>>>>
>>>>>>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>>>>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>>>>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>>>>>>>
>>>>>>>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>>>>>>>>> consumers have link->rpm_active = 2 and also have incremented
>>>>>>>>> the supplier's usage_count?
>>>>>>
>>>>>> Yes this is with DL_FLAG_RPM_ACTIVE.
>>>>>>
>>>>>> Please let me share a log here:
>>>>>> BEF means - Before, AFT means After.
>>>>>>
>>>>>> [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
>>>>>> [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>>
>>>>>> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
>>>>>>
>>>>>> [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
>>>>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
>>>>>>
>>>>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
>>>>>> [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
>>>>>>
>>>>>> [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
>>>>>>
>>>>>> ------ T196 Context comes in while T7 is running ----------
>>>>>> [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>> --------------------------------------------------------------
>>>>>>
>>>>>> [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> T196 is the context in which sd_probe() is invoked for this scsi device.
>>>>>>
>>>>>> [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
>>>>>> [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
>>>>>> [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
>>>>>> [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
>>>>>>
>>>>>> The reference to the link is released after sd_probe() is completed.
>>>>>> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
>>>>>>
>>>>>> In this log, the usage_count of supplier becomes 0 here:
>>>>>> [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
>>>>>> [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
>>>>>>
>>>>>> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
>>>>>
>>>>> If that is the case, then it is an error in PM not UFS.
>>>>>
>>>>> A second look at the code around rpm_put_suppliers() does look
>>>>> potentially racy, since there does not appear to be anything
>>>>> stopping the runtime_status changing between
>>>>> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
>>>>>
>>>>> Rafael, can you comment?
>>>>
>>>> Indeed, if the device is suspending, changing its PM-runtime status to
>>>> RPM_SUSPENDED and dropping its power.lock allows a concurrent
>>>> rpm_resume() to run and resume the device before the suppliers can be
>>>> suspended.
>>>>
>>>> That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
>>>> runtime: Update device status before letting suppliers suspend").
>>>>
>>>> It is probably better to revert that commit and address the original
>>>> issue in a different way.
>>>>
>>> Hello,
>>> One approach to address the original issue could be to prevent the scsi
>>> devices from suspending until the probe is completed, perhaps?
>>
>> I was talking about the original issue that commit 44cc89f76464
>> attempted to address.
>>
>> I'm not sure if and how it is related to the issue you have been debugging.
>>
> Hi Rafael
> Thanks for clarifying that.
> Understood.
> I was referring to the issue that I've been discussing with Adrian.

For test purposes, you could try reverting 44cc89f76464, making the
other changes to the UFS driver, and see if the device_links issue
goes away.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v10 1/2] scsi: ufs: Enable power management for wlun
  2021-03-18 19:16                                   ` Adrian Hunter
@ 2021-03-19  0:40                                     ` Asutosh Das (asd)
  0 siblings, 0 replies; 20+ messages in thread
From: Asutosh Das (asd) @ 2021-03-19  0:40 UTC (permalink / raw)
  To: Adrian Hunter, Rafael J. Wysocki
  Cc: Alan Stern, Bart Van Assche, cang, Martin K. Petersen,
	open list:TARGET SUBSYSTEM, linux-arm-msm, Alim Akhtar,
	Avri Altman, James E.J. Bottomley, Krzysztof Kozlowski,
	Stanley Chu, Andy Gross, Bjorn Andersson, Steven Rostedt,
	Ingo Molnar, Matthias Brugger, Kiwoong Kim, Bean Huo, Lee Jones,
	Wei Yongjun, Dinghao Liu, Gustavo A. R. Silva, Tomas Winkler,
	Jaegeuk Kim, Satya Tangirala, open list,
	moderated list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	open list:ARM/SAMSUNG S3C, S5P AND EXYNOS ARM ARCHITECTURES,
	moderated list:UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER...,
	Linux-PM mailing list

On 3/18/2021 12:16 PM, Adrian Hunter wrote:
> On 18/03/21 7:58 pm, Asutosh Das (asd) wrote:
>> On 3/18/2021 10:54 AM, Rafael J. Wysocki wrote:
>>> On Thu, Mar 18, 2021 at 6:33 PM Asutosh Das (asd)
>>> <asutoshd@codeaurora.org> wrote:
>>>>
>>>> On 3/18/2021 7:00 AM, Rafael J. Wysocki wrote:
>>>>> On Wed, Mar 17, 2021 at 7:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>>>
>>>>>> On 16/03/21 10:35 pm, Asutosh Das (asd) wrote:
>>>>>>> On 3/16/2021 12:48 AM, Adrian Hunter wrote:
>>>>>>>> On 16/03/21 12:22 am, Asutosh Das (asd) wrote:
>>>>>>>>> On 3/14/2021 1:11 AM, Adrian Hunter wrote:
>>>>>>>>>> On 10/03/21 5:04 am, Asutosh Das (asd) wrote:
>>>>>>>>>>> On 3/9/2021 7:56 AM, Asutosh Das (asd) wrote:
>>>>>>>>>>>> On 3/8/2021 9:17 AM, Rafael J. Wysocki wrote:
>>>>>>>>>>>>> On Mon, Mar 8, 2021 at 5:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Mar 6, 2021 at 5:17 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 05, 2021 at 06:54:24PM -0800, Asutosh Das (asd) wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Now during my testing I see a weird issue sometimes (1 in 7).
>>>>>>>>>>>>>>>> Scenario - bootups
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Issue:
>>>>>>>>>>>>>>>> The supplier 'ufs_device_wlun 0:0:0:49488' goes into runtime suspend even
>>>>>>>>>>>>>>>> when one/more of its consumers are in RPM_ACTIVE state.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Log:
>>>>>>>>>>>>>>>> [   10.056379][  T206] sd 0:0:0:1: [sdb] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [   10.062497][  T113] sd 0:0:0:5: [sdf] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [   10.356600][   T32] sd 0:0:0:7: [sdh] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [   10.362944][  T174] sd 0:0:0:3: [sdd] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [   10.696627][   T83] sd 0:0:0:2: [sdc] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [   10.704562][  T170] sd 0:0:0:6: [sdg] Synchronizing SCSI cache
>>>>>>>>>>>>>>>> [   10.980602][    T5] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /** Printing all the consumer nodes of supplier **/
>>>>>>>>>>>>>>>> [   10.987327][    T5] ufs_device_wlun 0:0:0:49488: usage-count @ suspend: 0
>>>>>>>>>>>>>>>> <-- this is the usage_count
>>>>>>>>>>>>>>>> [   10.994440][    T5] ufs_rpmb_wlun 0:0:0:49476: PM state - 2
>>>>>>>>>>>>>>>> [   11.000402][    T5] scsi 0:0:0:49456: PM state - 2
>>>>>>>>>>>>>>>> [   11.005453][    T5] sd 0:0:0:0: PM state - 2
>>>>>>>>>>>>>>>> [   11.009958][    T5] sd 0:0:0:1: PM state - 2
>>>>>>>>>>>>>>>> [   11.014469][    T5] sd 0:0:0:2: PM state - 2
>>>>>>>>>>>>>>>> [   11.019072][    T5] sd 0:0:0:3: PM state - 2
>>>>>>>>>>>>>>>> [   11.023595][    T5] sd 0:0:0:4: PM state - 0 << RPM_ACTIVE
>>>>>>>>>>>>>>>> [   11.353298][    T5] sd 0:0:0:5: PM state - 2
>>>>>>>>>>>>>>>> [   11.357726][    T5] sd 0:0:0:6: PM state - 2
>>>>>>>>>>>>>>>> [   11.362155][    T5] sd 0:0:0:7: PM state - 2
>>>>>>>>>>>>>>>> [   11.366584][    T5] ufshcd-qcom 1d84000.ufshc: __ufshcd_wl_suspend - 8709
>>>>>>>>>>>>>>>> [   11.374366][    T5] ufs_device_wlun 0:0:0:49488: __ufshcd_wl_suspend -
>>>>>>>>>>>>>>>> (0) has rpm_active flags
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you mean that rpm_active of the link between the consumer and the
>>>>>>>>>>>>>> supplier is greater than 0 at this point and the consumer is
>>>>>>>>>>>>>
>>>>>>>>>>>>> I mean is rpm_active of the link greater than 1 (because 1 means "no
>>>>>>>>>>>>> active references to the supplier")?
>>>>>>>>>>>> Hi Rafael:
>>>>>>>>>>>> No - it is not greater than 1.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm trying to understand what's going on in it; will update when I've something.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> RPM_ACTIVE, but the supplier suspends successfully nevertheless?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [   11.383376][    T5] ufs_device_wlun 0:0:0:49488:
>>>>>>>>>>>>>>>> ufshcd_wl_runtime_suspend <-- Supplier suspends fine.
>>>>>>>>>>>>>>>> [   12.977318][  T174] sd 0:0:0:4: [sde] Synchronizing SCSI cache
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And the the suspend of sde is stuck now:
>>>>>>>>>>>>>>>> schedule+0x9c/0xe0
>>>>>>>>>>>>>>>> schedule_timeout+0x40/0x128
>>>>>>>>>>>>>>>> io_schedule_timeout+0x44/0x68
>>>>>>>>>>>>>>>> wait_for_common_io+0x7c/0x100
>>>>>>>>>>>>>>>> wait_for_completion_io+0x14/0x20
>>>>>>>>>>>>>>>> blk_execute_rq+0x90/0xcc
>>>>>>>>>>>>>>>> __scsi_execute+0x104/0x1c4
>>>>>>>>>>>>>>>> sd_sync_cache+0xf8/0x2a0
>>>>>>>>>>>>>>>> sd_suspend_common+0x74/0x11c
>>>>>>>>>>>>>>>> sd_suspend_runtime+0x14/0x20
>>>>>>>>>>>>>>>> scsi_runtime_suspend+0x64/0x94
>>>>>>>>>>>>>>>> __rpm_callback+0x80/0x2a4
>>>>>>>>>>>>>>>> rpm_suspend+0x308/0x614
>>>>>>>>>>>>>>>> pm_runtime_work+0x98/0xa8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I added 'DL_FLAG_RPM_ACTIVE' while creating links.
>>>>>>>>>>>>>>>>             if (hba->sdev_ufs_device) {
>>>>>>>>>>>>>>>>                     link = device_link_add(&sdev->sdev_gendev,
>>>>>>>>>>>>>>>>                                         &hba->sdev_ufs_device->sdev_gendev,
>>>>>>>>>>>>>>>>                                        DL_FLAG_PM_RUNTIME|DL_FLAG_RPM_ACTIVE);
>>>>>>>>>>>>>>>> I didn't expect this to resolve the issue anyway and it didn't.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another interesting point here is when I resume any of the above suspended
>>>>>>>>>>>>>>>> consumers, it all goes back to normal, which is kind of expected. I tried
>>>>>>>>>>>>>>>> resuming the consumer and the supplier is resumed and the supplier is
>>>>>>>>>>>>>>>> suspended when all the consumers are suspended.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any pointers on this issue please?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Bart/@Alan - Do you've any pointers please?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's very noticeable that although you seem to have isolated a bug in
>>>>>>>>>>>>>>> the power management subsystem (supplier goes into runtime suspend
>>>>>>>>>>>>>>> even when one of its consumers is still active), you did not CC the
>>>>>>>>>>>>>>> power management maintainer or mailing list.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have added the appropriate CC's.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Alan!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hello
>>>>>>>>>>> I & Can (thanks CanG) debugged this further:
>>>>>>>>>>>
>>>>>>>>>>> Looks like this issue can occur if the sd probe is asynchronous.
>>>>>>>>>>>
>>>>>>>>>>> Essentially, the sd_probe() is done asynchronously and driver_probe_device() invokes pm_runtime_get_suppliers() before invoking sd_probe().
>>>>>>>>>>>
>>>>>>>>>>> But scsi_probe_and_add_lun() runs in a separate context.
>>>>>>>>>>> So the scsi_autopm_put_device() invoked from scsi_scan_host() context reduces the link->rpm_active to 1. And sd_probe() invokes scsi_autopm_put_device() and starts a timer. And then driver_probe_device() invoked from __device_attach_async_helper context reduces the link->rpm_active to 1 thus enabling the supplier to suspend before the consumer suspends.
>>>>>>>>>>>
>>>>>>>>>>> So if:
>>>>>>>>>>> Context T1:
>>>>>>>>>>> [1] scsi_probe_and_add_lun()
>>>>>>>>>>> [2]    |- scsi_autopm_put_device() - reduce the link->rpm_active to 1
>>>>>>>>>>>
>>>>>>>>>>> Context T2:
>>>>>>>>>>> __device_attach_async_helper()
>>>>>>>>>>>          |- driver_probe_device()
>>>>>>>>>>>              |- sd_probe()
>>>>>>>>>>> In between [1] and [2] say, driver_probe_device() -> sd_probe() is invoked in a separate context from __device_attach_async_helper().
>>>>>>>>>>> The driver_probe_device() -> pm_runtime_get_suppliers() but [2] would reduce link->rpm_active to 1.
>>>>>>>>>>> Then sd_probe() would invoke rpm_resume() and proceed as is.
>>>>>>>>>>> When sd_probe() invokes scsi_autopm_put_device() it'd start a timer, dev->power.timer_autosuspends = 1.
>>>>>>>>>>>
>>>>>>>>>>> Now then, pm_runtime_put_suppliers() is invoked from driver_probe_device() and that makes the link->rpm_active = 1.
>>>>>>>>>>> But by now, the corresponding 'sd dev' (consumer) usage_count = 0, state = RPM_ACTIVE and link->rpm_active = 1.
>>>>>>>>>>> At this point of time, all other 'sd dev' (consumers) _may_ be suspended or active but would have the link->rpm_active = 1.
>>>>>>>>>>
>>>>>>>>>> Is this with DL_FLAG_RPM_ACTIVE?  In that case, wouldn't active
>>>>>>>>>> consumers have link->rpm_active = 2 and also have incremented
>>>>>>>>>> the supplier's usage_count?
>>>>>>>
>>>>>>> Yes this is with DL_FLAG_RPM_ACTIVE.
>>>>>>>
>>>>>>> Please let me share a log here:
>>>>>>> BEF means - Before, AFT means After.
>>>>>>>
>>>>>>> [    6.843445][    T7] scsi 0:0:0:4: [UFSDBG]: ufshcd_setup_links:4779:  supp: usage_cnt: 3 Link - 0:0:0:49488 link-rpm_active: 2 avail_luns: 5
>>>>>>> [    6.892545][    T7] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>>>
>>>>>>> In the above log, T7 is the context in which this scsi device is being added - scsi_sysfs_add_sdev()
>>>>>>>
>>>>>>> [    6.931846][    T7] ufs_rpmb_wlun 0:0:0:4: [UFSDBG]: ufshcd_rpmb_probe:9692: invoked
>>>>>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 4
>>>>>>>
>>>>>>> [    6.941246][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 5
>>>>>>> [    6.941247][    T7] scsi 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 4 rpm_active: 3
>>>>>>>
>>>>>>> [    6.941267][    T7] scsi 0:0:0:4: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 4 rpm_active: 3
>>>>>>>
>>>>>>> ------ T196 Context comes in while T7 is running ----------
>>>>>>> [    6.941466][  T196] scsi 0:0:0:4: pm_runtime_get_suppliers: (0:0:0:49488): supp: usage_count: 5 rpm_active: 4
>>>>>>> --------------------------------------------------------------
>>>>>>>
>>>>>>> [    7.788397][    T7] scsi 0:0:0:4: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 2 rpm_active: 1
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>>
>>>>>>> T196 is the context in which sd_probe() is invoked for this scsi device.
>>>>>>>
>>>>>>> [    7.974410][  T196] sd 0:0:0:4: [sde] Attached SCSI disk
>>>>>>> [    7.984188][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: rpm_active: 2
>>>>>>> [    7.998424][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [BEF] usage_count: 4
>>>>>>> [    8.017320][  T196] sd 0:0:0:4: pm_runtime_put_suppliers: (0:0:0:49488) [AFT] usage_count: 1 rpm_active: 1
>>>>>>>
>>>>>>> The reference to the link is released after sd_probe() is completed.
>>>>>>> At this point, the rpm_active should be 2. And the rpm_active should become 1 when sd 0:0:0:4 actually suspends. But at the end of sd_probe() the suspend is only scheduled. However the supplier is now free to suspend.
>>>>>>>
>>>>>>> In this log, the usage_count of supplier becomes 0 here:
>>>>>>> [   11.963885][  T117] sd 0:0:0:7: rpm_put_suppliers: [BEF] Supp (0:0:0:49488) usage_count: 1 rpm_active: 2
>>>>>>> [   11.973821][  T117] sd 0:0:0:7: rpm_put_suppliers: [AFT] Supp (0:0:0:49488) usage_count: 0 rpm_active: 1
>>>>>>>
>>>>>>> However, the consumer sd 0:0:0:4 is still active but has released the reference to the supplier:
>>>>>>
>>>>>> If that is the case, then it is an error in PM not UFS.
>>>>>>
>>>>>> A second look at the code around rpm_put_suppliers() does look
>>>>>> potentially racy, since there does not appear to be anything
>>>>>> stopping the runtime_status changing between
>>>>>> spin_unlock_irq(&dev->power.lock) and device_links_read_lock().
>>>>>>
>>>>>> Rafael, can you comment?
>>>>>
>>>>> Indeed, if the device is suspending, changing its PM-runtime status to
>>>>> RPM_SUSPENDED and dropping its power.lock allows a concurrent
>>>>> rpm_resume() to run and resume the device before the suppliers can be
>>>>> suspended.
>>>>>
>>>>> That's incorrect and has been introduced by commit 44cc89f76464 ("PM:
>>>>> runtime: Update device status before letting suppliers suspend").
>>>>>
>>>>> It is probably better to revert that commit and address the original
>>>>> issue in a different way.
>>>>>
>>>> Hello,
>>>> One approach to address the original issue could be to prevent the scsi
>>>> devices from suspending until the probe is completed, perhaps?
>>>
>>> I was talking about the original issue that commit 44cc89f76464
>>> attempted to address.
>>>
>>> I'm not sure if and how it is related to the issue you have been debugging.
>>>
>> Hi Rafael
>> Thanks for clarifying that.
>> Understood.
>> I was referring to the issue that I've been discussing with Adrian.
> 
> For test purposes, you could try reverting 44cc89f76464, making the
> other changes to the UFS driver, and see if the device_links issue
> goes away.
> 
Hi Adrian

Thanks for the review and comments.

I addressed the comments and tested it. I still see the device_links 
issue. I've pushed the change that I tested as v12.

# grep -H . /sys/bus/scsi/devices/*/power/runtime_status 
          <
/sys/bus/scsi/devices/0:0:0:0/power/runtime_status:suspending <
/sys/bus/scsi/devices/0:0:0:1/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:2/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:3/power/runtime_status:suspending <
/sys/bus/scsi/devices/0:0:0:4/power/runtime_status:suspending <
/sys/bus/scsi/devices/0:0:0:49456/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:49476/power/runtime_status:suspended
/sys/bus/scsi/devices/0:0:0:49488/power/runtime_status:suspended <

I think it couldn't resolve the issue because we're not stopping the 
scsi devices from suspending after the async probe is scheduled.

-asd

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-03-19  0:42 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1614725302.git.asutoshd@codeaurora.org>
     [not found] ` <0576d6eae15486740c25767e2d8805f7e94eb79d.1614725302.git.asutoshd@codeaurora.org>
     [not found]   ` <85086647-7292-b0a2-d842-290818bd2858@intel.com>
     [not found]     ` <6e98724d-2e75-d1fe-188f-a7010f86c509@codeaurora.org>
2021-03-06 16:16       ` [PATCH v10 1/2] scsi: ufs: Enable power management for wlun Alan Stern
2021-03-08 16:21         ` Rafael J. Wysocki
2021-03-08 17:17           ` Rafael J. Wysocki
2021-03-09 15:56             ` Asutosh Das (asd)
2021-03-10  3:04               ` Asutosh Das (asd)
2021-03-10  3:14                 ` Alan Stern
2021-03-10  4:04                   ` Asutosh Das (asd)
2021-03-10 16:27                     ` Alan Stern
2021-03-10 16:39                       ` Asutosh Das (asd)
2021-03-14  9:11                 ` Adrian Hunter
2021-03-15 22:22                   ` Asutosh Das (asd)
2021-03-16  7:48                     ` Adrian Hunter
2021-03-16 20:35                       ` Asutosh Das (asd)
2021-03-17  6:37                         ` Adrian Hunter
2021-03-18 14:00                           ` Rafael J. Wysocki
2021-03-18 17:27                             ` Asutosh Das (asd)
2021-03-18 17:54                               ` Rafael J. Wysocki
2021-03-18 17:58                                 ` Asutosh Das (asd)
2021-03-18 19:16                                   ` Adrian Hunter
2021-03-19  0:40                                     ` Asutosh Das (asd)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).