[PATCH 0/2] PM: runtime: Fix potential I/O hang

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] PM: runtime: Fix potential I/O hang
@ 2025-11-26 10:16 Yang Yang
  2025-11-26 10:16 ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Yang Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Yang Yang @ 2025-11-26 10:16 UTC (permalink / raw)
  To: Jens Axboe, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm
  Cc: Yang Yang


Yang Yang (2):
  PM: runtime: Fix I/O hang due to race between resume and runtime
    disable
  blk-mq: Fix I/O hang caused by incomplete device resume

 block/blk-pm.c               | 1 +
 drivers/base/power/runtime.c | 3 ++-
 include/linux/pm.h           | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 10:16 [PATCH 0/2] PM: runtime: Fix potential I/O hang Yang Yang
@ 2025-11-26 10:16 ` Yang Yang
  2025-11-26 11:30   ` Rafael J. Wysocki
  2025-11-26 10:16 ` [PATCH 2/2] blk-mq: Fix I/O hang caused by incomplete device resume Yang Yang
  2025-11-26 11:31 ` [PATCH 0/2] PM: runtime: Fix potential I/O hang Rafael J. Wysocki
  2 siblings, 1 reply; 44+ messages in thread
From: Yang Yang @ 2025-11-26 10:16 UTC (permalink / raw)
  To: Jens Axboe, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm
  Cc: Yang Yang

We observed the following hung task during our test:

[ 3987.095999] INFO: task "kworker/u32:7":239 blocked for more than 188 seconds.
[ 3987.096017] task:kworker/u32:7   state:D stack:0     pid:239   tgid:239   ppid:2      flags:0x00000408
[ 3987.096042] Workqueue: writeback wb_workfn (flush-254:59)
[ 3987.096069] Call trace:
[ 3987.096073]  __switch_to+0x1a0/0x318
[ 3987.096089]  __schedule+0xa38/0xf9c
[ 3987.096104]  schedule+0x74/0x10c
[ 3987.096118]  __bio_queue_enter+0xb8/0x178
[ 3987.096132]  blk_mq_submit_bio+0x104/0x728
[ 3987.096145]  __submit_bio+0xa0/0x23c
[ 3987.096159]  submit_bio_noacct_nocheck+0x164/0x330
[ 3987.096173]  submit_bio_noacct+0x348/0x468
[ 3987.096186]  submit_bio+0x17c/0x198
[ 3987.096199]  f2fs_submit_write_bio+0x44/0xe8
[ 3987.096211]  __submit_merged_bio+0x40/0x11c
[ 3987.096222]  __submit_merged_write_cond+0xcc/0x1f8
[ 3987.096233]  f2fs_write_data_pages+0xbb8/0xd0c
[ 3987.096246]  do_writepages+0xe0/0x2f4
[ 3987.096255]  __writeback_single_inode+0x44/0x4ac
[ 3987.096272]  writeback_sb_inodes+0x30c/0x538
[ 3987.096289]  __writeback_inodes_wb+0x9c/0xec
[ 3987.096305]  wb_writeback+0x158/0x440
[ 3987.096321]  wb_workfn+0x388/0x5d4
[ 3987.096335]  process_scheduled_works+0x1c4/0x45c
[ 3987.096346]  worker_thread+0x32c/0x3e8
[ 3987.096356]  kthread+0x11c/0x1b0
[ 3987.096372]  ret_from_fork+0x10/0x20

 T1:                                   T2:
 blk_queue_enter
 blk_pm_resume_queue
 pm_request_resume
 __pm_runtime_resume(dev, RPM_ASYNC)
 rpm_resume                            __pm_runtime_disable
 dev->power.request_pending = true     dev->power.disable_depth++
 queue_work(pm_wq, &dev->power.work)   __pm_runtime_barrier
 wait_event                            cancel_work_sync(&dev->power.work)

T1 queues the work item, which is then cancelled by T2 before it starts
execution. As a result, q->dev cannot be resumed, and T1 waits here for
a long time.

Signed-off-by: Yang Yang <yang.yang@vivo.com>
---
 drivers/base/power/runtime.c | 3 ++-
 include/linux/pm.h           | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 1b11a3cd4acc..fc9bf3fb3bb7 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -1533,7 +1533,8 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
 	 * means there probably is some I/O to process and disabling runtime PM
 	 * shouldn't prevent the device from processing the I/O.
 	 */
-	if (check_resume && dev->power.request_pending &&
+	if ((check_resume || dev->power.force_check_resume) &&
+	    dev->power.request_pending &&
 	    dev->power.request == RPM_REQ_RESUME) {
 		/*
 		 * Prevent suspends and idle notifications from being carried
diff --git a/include/linux/pm.h b/include/linux/pm.h
index cc7b2dc28574..4eb20569cdbc 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -708,6 +708,7 @@ struct dev_pm_info {
 	bool			use_autosuspend:1;
 	bool			timer_autosuspends:1;
 	bool			memalloc_noio:1;
+	bool			force_check_resume:1;
 	unsigned int		links_count;
 	enum rpm_request	request;
 	enum rpm_status		runtime_status;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 10:16 ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Yang Yang
@ 2025-11-26 11:30   ` Rafael J. Wysocki
  2025-11-26 11:59     ` YangYang
  2025-11-26 18:06     ` Bart Van Assche
  0 siblings, 2 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 11:30 UTC (permalink / raw)
  To: Yang Yang
  Cc: Jens Axboe, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
>
> We observed the following hung task during our test:
>
> [ 3987.095999] INFO: task "kworker/u32:7":239 blocked for more than 188 seconds.
> [ 3987.096017] task:kworker/u32:7   state:D stack:0     pid:239   tgid:239   ppid:2      flags:0x00000408
> [ 3987.096042] Workqueue: writeback wb_workfn (flush-254:59)
> [ 3987.096069] Call trace:
> [ 3987.096073]  __switch_to+0x1a0/0x318
> [ 3987.096089]  __schedule+0xa38/0xf9c
> [ 3987.096104]  schedule+0x74/0x10c
> [ 3987.096118]  __bio_queue_enter+0xb8/0x178
> [ 3987.096132]  blk_mq_submit_bio+0x104/0x728
> [ 3987.096145]  __submit_bio+0xa0/0x23c
> [ 3987.096159]  submit_bio_noacct_nocheck+0x164/0x330
> [ 3987.096173]  submit_bio_noacct+0x348/0x468
> [ 3987.096186]  submit_bio+0x17c/0x198
> [ 3987.096199]  f2fs_submit_write_bio+0x44/0xe8
> [ 3987.096211]  __submit_merged_bio+0x40/0x11c
> [ 3987.096222]  __submit_merged_write_cond+0xcc/0x1f8
> [ 3987.096233]  f2fs_write_data_pages+0xbb8/0xd0c
> [ 3987.096246]  do_writepages+0xe0/0x2f4
> [ 3987.096255]  __writeback_single_inode+0x44/0x4ac
> [ 3987.096272]  writeback_sb_inodes+0x30c/0x538
> [ 3987.096289]  __writeback_inodes_wb+0x9c/0xec
> [ 3987.096305]  wb_writeback+0x158/0x440
> [ 3987.096321]  wb_workfn+0x388/0x5d4
> [ 3987.096335]  process_scheduled_works+0x1c4/0x45c
> [ 3987.096346]  worker_thread+0x32c/0x3e8
> [ 3987.096356]  kthread+0x11c/0x1b0
> [ 3987.096372]  ret_from_fork+0x10/0x20
>
>  T1:                                   T2:
>  blk_queue_enter
>  blk_pm_resume_queue
>  pm_request_resume

Shouldn't this be pm_runtime_resume() rather?

>  __pm_runtime_resume(dev, RPM_ASYNC)
>  rpm_resume                            __pm_runtime_disable
>  dev->power.request_pending = true     dev->power.disable_depth++
>  queue_work(pm_wq, &dev->power.work)   __pm_runtime_barrier
>  wait_event                            cancel_work_sync(&dev->power.work)
>
> T1 queues the work item, which is then cancelled by T2 before it starts
> execution. As a result, q->dev cannot be resumed, and T1 waits here for
> a long time.
>
> Signed-off-by: Yang Yang <yang.yang@vivo.com>
> ---
>  drivers/base/power/runtime.c | 3 ++-
>  include/linux/pm.h           | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index 1b11a3cd4acc..fc9bf3fb3bb7 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -1533,7 +1533,8 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
>          * means there probably is some I/O to process and disabling runtime PM
>          * shouldn't prevent the device from processing the I/O.
>          */
> -       if (check_resume && dev->power.request_pending &&
> +       if ((check_resume || dev->power.force_check_resume) &&
> +           dev->power.request_pending &&
>             dev->power.request == RPM_REQ_RESUME) {
>                 /*
>                  * Prevent suspends and idle notifications from being carried

There are only two cases in which false is passed to
__pm_runtime_disable(), one is in device_suspend_late() and I don't
think that's relevant here, and the other is in pm_runtime_remove()
and that's get called when the device is going away.

So apparently, blk_pm_resume_queue() races with the device going away.
Is this expected to happen even?

If so, wouldn't it be better to modify pm_runtime_remove() to pass
true to __pm_runtime_disable() instead of making these ad hoc changes?

> diff --git a/include/linux/pm.h b/include/linux/pm.h
> index cc7b2dc28574..4eb20569cdbc 100644
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -708,6 +708,7 @@ struct dev_pm_info {
>         bool                    use_autosuspend:1;
>         bool                    timer_autosuspends:1;
>         bool                    memalloc_noio:1;
> +       bool                    force_check_resume:1;
>         unsigned int            links_count;
>         enum rpm_request        request;
>         enum rpm_status         runtime_status;
> --

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 11:30   ` Rafael J. Wysocki
@ 2025-11-26 11:59     ` YangYang
  2025-11-26 12:36       ` Rafael J. Wysocki
  2025-11-26 18:06     ` Bart Van Assche
  1 sibling, 1 reply; 44+ messages in thread
From: YangYang @ 2025-11-26 11:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 2025/11/26 19:30, Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
>>
>> We observed the following hung task during our test:
>>
>> [ 3987.095999] INFO: task "kworker/u32:7":239 blocked for more than 188 seconds.
>> [ 3987.096017] task:kworker/u32:7   state:D stack:0     pid:239   tgid:239   ppid:2      flags:0x00000408
>> [ 3987.096042] Workqueue: writeback wb_workfn (flush-254:59)
>> [ 3987.096069] Call trace:
>> [ 3987.096073]  __switch_to+0x1a0/0x318
>> [ 3987.096089]  __schedule+0xa38/0xf9c
>> [ 3987.096104]  schedule+0x74/0x10c
>> [ 3987.096118]  __bio_queue_enter+0xb8/0x178
>> [ 3987.096132]  blk_mq_submit_bio+0x104/0x728
>> [ 3987.096145]  __submit_bio+0xa0/0x23c
>> [ 3987.096159]  submit_bio_noacct_nocheck+0x164/0x330
>> [ 3987.096173]  submit_bio_noacct+0x348/0x468
>> [ 3987.096186]  submit_bio+0x17c/0x198
>> [ 3987.096199]  f2fs_submit_write_bio+0x44/0xe8
>> [ 3987.096211]  __submit_merged_bio+0x40/0x11c
>> [ 3987.096222]  __submit_merged_write_cond+0xcc/0x1f8
>> [ 3987.096233]  f2fs_write_data_pages+0xbb8/0xd0c
>> [ 3987.096246]  do_writepages+0xe0/0x2f4
>> [ 3987.096255]  __writeback_single_inode+0x44/0x4ac
>> [ 3987.096272]  writeback_sb_inodes+0x30c/0x538
>> [ 3987.096289]  __writeback_inodes_wb+0x9c/0xec
>> [ 3987.096305]  wb_writeback+0x158/0x440
>> [ 3987.096321]  wb_workfn+0x388/0x5d4
>> [ 3987.096335]  process_scheduled_works+0x1c4/0x45c
>> [ 3987.096346]  worker_thread+0x32c/0x3e8
>> [ 3987.096356]  kthread+0x11c/0x1b0
>> [ 3987.096372]  ret_from_fork+0x10/0x20
>>
>>   T1:                                   T2:
>>   blk_queue_enter
>>   blk_pm_resume_queue
>>   pm_request_resume
> 
> Shouldn't this be pm_runtime_resume() rather?

I'm not sure about that, I'll check if pm_runtime_resume() should be 
used here instead.

> 
>>   __pm_runtime_resume(dev, RPM_ASYNC)
>>   rpm_resume                            __pm_runtime_disable
>>   dev->power.request_pending = true     dev->power.disable_depth++
>>   queue_work(pm_wq, &dev->power.work)   __pm_runtime_barrier
>>   wait_event                            cancel_work_sync(&dev->power.work)
>>
>> T1 queues the work item, which is then cancelled by T2 before it starts
>> execution. As a result, q->dev cannot be resumed, and T1 waits here for
>> a long time.
>>
>> Signed-off-by: Yang Yang <yang.yang@vivo.com>
>> ---
>>   drivers/base/power/runtime.c | 3 ++-
>>   include/linux/pm.h           | 1 +
>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
>> index 1b11a3cd4acc..fc9bf3fb3bb7 100644
>> --- a/drivers/base/power/runtime.c
>> +++ b/drivers/base/power/runtime.c
>> @@ -1533,7 +1533,8 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
>>           * means there probably is some I/O to process and disabling runtime PM
>>           * shouldn't prevent the device from processing the I/O.
>>           */
>> -       if (check_resume && dev->power.request_pending &&
>> +       if ((check_resume || dev->power.force_check_resume) &&
>> +           dev->power.request_pending &&
>>              dev->power.request == RPM_REQ_RESUME) {
>>                  /*
>>                   * Prevent suspends and idle notifications from being carried
> 
> There are only two cases in which false is passed to
> __pm_runtime_disable(), one is in device_suspend_late() and I don't
> think that's relevant here, and the other is in pm_runtime_remove()
> and that's get called when the device is going away.
> 
> So apparently, blk_pm_resume_queue() races with the device going away.
> Is this expected to happen even?
> 
> If so, wouldn't it be better to modify pm_runtime_remove() to pass
> true to __pm_runtime_disable() instead of making these ad hoc changes?
> 

Sorry, I didn't make it clear in my previous message.
I can confirm that __pm_runtime_disable() is called from 
device_suspend_late(), and this issue occurs during system suspend.

>> diff --git a/include/linux/pm.h b/include/linux/pm.h
>> index cc7b2dc28574..4eb20569cdbc 100644
>> --- a/include/linux/pm.h
>> +++ b/include/linux/pm.h
>> @@ -708,6 +708,7 @@ struct dev_pm_info {
>>          bool                    use_autosuspend:1;
>>          bool                    timer_autosuspends:1;
>>          bool                    memalloc_noio:1;
>> +       bool                    force_check_resume:1;
>>          unsigned int            links_count;
>>          enum rpm_request        request;
>>          enum rpm_status         runtime_status;
>> --


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 11:59     ` YangYang
@ 2025-11-26 12:36       ` Rafael J. Wysocki
  2025-11-26 15:33         ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 12:36 UTC (permalink / raw)
  To: YangYang
  Cc: Rafael J. Wysocki, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 12:59 PM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/11/26 19:30, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
> >>
> >> We observed the following hung task during our test:
> >>
> >> [ 3987.095999] INFO: task "kworker/u32:7":239 blocked for more than 188 seconds.
> >> [ 3987.096017] task:kworker/u32:7   state:D stack:0     pid:239   tgid:239   ppid:2      flags:0x00000408
> >> [ 3987.096042] Workqueue: writeback wb_workfn (flush-254:59)
> >> [ 3987.096069] Call trace:
> >> [ 3987.096073]  __switch_to+0x1a0/0x318
> >> [ 3987.096089]  __schedule+0xa38/0xf9c
> >> [ 3987.096104]  schedule+0x74/0x10c
> >> [ 3987.096118]  __bio_queue_enter+0xb8/0x178
> >> [ 3987.096132]  blk_mq_submit_bio+0x104/0x728
> >> [ 3987.096145]  __submit_bio+0xa0/0x23c
> >> [ 3987.096159]  submit_bio_noacct_nocheck+0x164/0x330
> >> [ 3987.096173]  submit_bio_noacct+0x348/0x468
> >> [ 3987.096186]  submit_bio+0x17c/0x198
> >> [ 3987.096199]  f2fs_submit_write_bio+0x44/0xe8
> >> [ 3987.096211]  __submit_merged_bio+0x40/0x11c
> >> [ 3987.096222]  __submit_merged_write_cond+0xcc/0x1f8
> >> [ 3987.096233]  f2fs_write_data_pages+0xbb8/0xd0c
> >> [ 3987.096246]  do_writepages+0xe0/0x2f4
> >> [ 3987.096255]  __writeback_single_inode+0x44/0x4ac
> >> [ 3987.096272]  writeback_sb_inodes+0x30c/0x538
> >> [ 3987.096289]  __writeback_inodes_wb+0x9c/0xec
> >> [ 3987.096305]  wb_writeback+0x158/0x440
> >> [ 3987.096321]  wb_workfn+0x388/0x5d4
> >> [ 3987.096335]  process_scheduled_works+0x1c4/0x45c
> >> [ 3987.096346]  worker_thread+0x32c/0x3e8
> >> [ 3987.096356]  kthread+0x11c/0x1b0
> >> [ 3987.096372]  ret_from_fork+0x10/0x20
> >>
> >>   T1:                                   T2:
> >>   blk_queue_enter
> >>   blk_pm_resume_queue
> >>   pm_request_resume
> >
> > Shouldn't this be pm_runtime_resume() rather?
>
> I'm not sure about that, I'll check if pm_runtime_resume() should be
> used here instead.

Well, the code as is now schedules an async resume of the device and
then waits for it to complete.  It would be more straightforward to
resume the device synchronously IMV.

> >
> >>   __pm_runtime_resume(dev, RPM_ASYNC)
> >>   rpm_resume                            __pm_runtime_disable
> >>   dev->power.request_pending = true     dev->power.disable_depth++
> >>   queue_work(pm_wq, &dev->power.work)   __pm_runtime_barrier
> >>   wait_event                            cancel_work_sync(&dev->power.work)
> >>
> >> T1 queues the work item, which is then cancelled by T2 before it starts
> >> execution. As a result, q->dev cannot be resumed, and T1 waits here for
> >> a long time.
> >>
> >> Signed-off-by: Yang Yang <yang.yang@vivo.com>
> >> ---
> >>   drivers/base/power/runtime.c | 3 ++-
> >>   include/linux/pm.h           | 1 +
> >>   2 files changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> >> index 1b11a3cd4acc..fc9bf3fb3bb7 100644
> >> --- a/drivers/base/power/runtime.c
> >> +++ b/drivers/base/power/runtime.c
> >> @@ -1533,7 +1533,8 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
> >>           * means there probably is some I/O to process and disabling runtime PM
> >>           * shouldn't prevent the device from processing the I/O.
> >>           */
> >> -       if (check_resume && dev->power.request_pending &&
> >> +       if ((check_resume || dev->power.force_check_resume) &&
> >> +           dev->power.request_pending &&
> >>              dev->power.request == RPM_REQ_RESUME) {
> >>                  /*
> >>                   * Prevent suspends and idle notifications from being carried
> >
> > There are only two cases in which false is passed to
> > __pm_runtime_disable(), one is in device_suspend_late() and I don't
> > think that's relevant here, and the other is in pm_runtime_remove()
> > and that's get called when the device is going away.
> >
> > So apparently, blk_pm_resume_queue() races with the device going away.
> > Is this expected to happen even?
> >
> > If so, wouldn't it be better to modify pm_runtime_remove() to pass
> > true to __pm_runtime_disable() instead of making these ad hoc changes?
> >
>
> Sorry, I didn't make it clear in my previous message.
> I can confirm that __pm_runtime_disable() is called from
> device_suspend_late(), and this issue occurs during system suspend.

Interesting because the runtime PM workqueue is frozen at this point,
so waiting for a work item in it to complete is pointless.

What the patch does is to declare that the device can be
runtime-resumed in device_suspend_late(), but this is kind of a hack
IMV as it potentially affects the device's parent etc.

If the device cannot stay in runtime suspend across the entire system
suspend transition, it should be resumed (synchronously) earlier, in
device_suspend() or in device_prepare() even.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 12:36       ` Rafael J. Wysocki
@ 2025-11-26 15:33         ` Bart Van Assche
  2025-11-26 15:41           ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-11-26 15:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, YangYang
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 11/26/25 4:36 AM, Rafael J. Wysocki wrote:
> Well, the code as is now schedules an async resume of the device and
> then waits for it to complete.  It would be more straightforward to
> resume the device synchronously IMV.

That would increase the depth of the call stack significantly. I'm not
sure that's safe in this context.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 15:33         ` Bart Van Assche
@ 2025-11-26 15:41           ` Rafael J. Wysocki
  2025-11-26 18:40             ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 15:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, YangYang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 4:34 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 4:36 AM, Rafael J. Wysocki wrote:
> > Well, the code as is now schedules an async resume of the device and
> > then waits for it to complete.  It would be more straightforward to
> > resume the device synchronously IMV.
>
> That would increase the depth of the call stack significantly. I'm not
> sure that's safe in this context.

As it stands, you have a basic problem with respect to system
suspend/hibernation.  As I said before, the PM workqueue is frozen
during system suspend/hibernation transitions, so waiting for an async
resume request to complete then is pointless.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 15:41           ` Rafael J. Wysocki
@ 2025-11-26 18:40             ` Bart Van Assche
  2025-11-27 11:29               ` YangYang
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-11-26 18:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: YangYang, Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 11/26/25 7:41 AM, Rafael J. Wysocki wrote:
> As it stands, you have a basic problem with respect to system
> suspend/hibernation.  As I said before, the PM workqueue is frozen
> during system suspend/hibernation transitions, so waiting for an async
> resume request to complete then is pointless.

Agreed. I noticed that any attempt to call request_firmware() from
driver system resume callback functions causes a deadlock if these
calls happen before the block device has been resumed.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 18:40             ` Bart Van Assche
@ 2025-11-27 11:29               ` YangYang
  2025-11-27 12:44                 ` Rafael J. Wysocki
  2025-12-01 16:40                 ` Bart Van Assche
  0 siblings, 2 replies; 44+ messages in thread
From: YangYang @ 2025-11-27 11:29 UTC (permalink / raw)
  To: Bart Van Assche, Rafael J. Wysocki
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 2025/11/27 2:40, Bart Van Assche wrote:
> On 11/26/25 7:41 AM, Rafael J. Wysocki wrote:
>> As it stands, you have a basic problem with respect to system
>> suspend/hibernation.  As I said before, the PM workqueue is frozen
>> during system suspend/hibernation transitions, so waiting for an async
>> resume request to complete then is pointless.
> 
> Agreed. I noticed that any attempt to call request_firmware() from
> driver system resume callback functions causes a deadlock if these
> calls happen before the block device has been resumed.
> 
> Thanks,
> 
> Bart.

Does this patch look reasonable to you? It hasn't been fully tested 
yet, but the resume is now performed synchronously.

diff --git a/block/blk-core.c b/block/blk-core.c
index 66fb2071d..041d29ba4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -323,12 +323,15 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
                  * reordered.
                  */
                 smp_rmb();
-               wait_event(q->mq_freeze_wq,
-                          (!q->mq_freeze_depth &&
-                           blk_pm_resume_queue(pm, q)) ||
-                          blk_queue_dying(q));
+check:
+               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);
+
                 if (blk_queue_dying(q))
                         return -ENODEV;
+               if (!blk_pm_resume_queue(pm, q)) {
+                       pm_runtime_resume(q->dev);
+                       goto check;
+               }
         }

         rwsem_acquire_read(&q->q_lockdep_map, 0, 0, _RET_IP_);
@@ -356,12 +359,15 @@ int __bio_queue_enter(struct request_queue *q, 
struct bio *bio)
                  * reordered.
                  */
                 smp_rmb();
-               wait_event(q->mq_freeze_wq,
-                          (!q->mq_freeze_depth &&
-                           blk_pm_resume_queue(false, q)) ||
-                          test_bit(GD_DEAD, &disk->state));
+check:
+               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);
+
                 if (test_bit(GD_DEAD, &disk->state))
                         goto dead;
+               if (!blk_pm_resume_queue(false, q)) {
+                       pm_runtime_resume(q->dev);
+                       goto check;
+               }
         }

         rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 8a5a0d4b3..c28fad105 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -12,7 +12,6 @@ static inline int blk_pm_resume_queue(const bool pm, 
struct request_queue *q)
                 return 1;       /* Nothing to do */
         if (pm && q->rpm_status != RPM_SUSPENDED)
                 return 1;       /* Request allowed */
-       pm_request_resume(q->dev);
         return 0;
  }


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-27 11:29               ` YangYang
@ 2025-11-27 12:44                 ` Rafael J. Wysocki
  2025-11-28  7:20                   ` YangYang
  2025-12-01 16:40                 ` Bart Van Assche
  1 sibling, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-27 12:44 UTC (permalink / raw)
  To: YangYang
  Cc: Bart Van Assche, Rafael J. Wysocki, Jens Axboe, Pavel Machek,
	Len Brown, Greg Kroah-Hartman, Danilo Krummrich, linux-block,
	linux-kernel, linux-pm

On Thu, Nov 27, 2025 at 12:29 PM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/11/27 2:40, Bart Van Assche wrote:
> > On 11/26/25 7:41 AM, Rafael J. Wysocki wrote:
> >> As it stands, you have a basic problem with respect to system
> >> suspend/hibernation.  As I said before, the PM workqueue is frozen
> >> during system suspend/hibernation transitions, so waiting for an async
> >> resume request to complete then is pointless.
> >
> > Agreed. I noticed that any attempt to call request_firmware() from
> > driver system resume callback functions causes a deadlock if these
> > calls happen before the block device has been resumed.
> >
> > Thanks,
> >
> > Bart.
>
> Does this patch look reasonable to you? It hasn't been fully tested
> yet, but the resume is now performed synchronously.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 66fb2071d..041d29ba4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -323,12 +323,15 @@ int blk_queue_enter(struct request_queue *q,
> blk_mq_req_flags_t flags)
>                   * reordered.
>                   */
>                  smp_rmb();
> -               wait_event(q->mq_freeze_wq,
> -                          (!q->mq_freeze_depth &&
> -                           blk_pm_resume_queue(pm, q)) ||
> -                          blk_queue_dying(q));
> +check:
> +               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);

I think that you still need to check blk_queue_dying(q) under
wait_even() or you may not stop waiting when this happens.

> +
>                  if (blk_queue_dying(q))
>                          return -ENODEV;
> +               if (!blk_pm_resume_queue(pm, q)) {
> +                       pm_runtime_resume(q->dev);
> +                       goto check;
> +               }
>          }
>
>          rwsem_acquire_read(&q->q_lockdep_map, 0, 0, _RET_IP_);
> @@ -356,12 +359,15 @@ int __bio_queue_enter(struct request_queue *q,
> struct bio *bio)
>                   * reordered.
>                   */
>                  smp_rmb();
> -               wait_event(q->mq_freeze_wq,
> -                          (!q->mq_freeze_depth &&
> -                           blk_pm_resume_queue(false, q)) ||
> -                          test_bit(GD_DEAD, &disk->state));
> +check:
> +               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);

Analogously here, you may not stop waiting when test_bit(GD_DEAD,
&disk->state) is true.

> +
>                  if (test_bit(GD_DEAD, &disk->state))
>                          goto dead;
> +               if (!blk_pm_resume_queue(false, q)) {
> +                       pm_runtime_resume(q->dev);
> +                       goto check;
> +               }
>          }
>
>          rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
> diff --git a/block/blk-pm.h b/block/blk-pm.h
> index 8a5a0d4b3..c28fad105 100644
> --- a/block/blk-pm.h
> +++ b/block/blk-pm.h
> @@ -12,7 +12,6 @@ static inline int blk_pm_resume_queue(const bool pm,
> struct request_queue *q)
>                  return 1;       /* Nothing to do */
>          if (pm && q->rpm_status != RPM_SUSPENDED)
>                  return 1;       /* Request allowed */
> -       pm_request_resume(q->dev);
>          return 0;
>   }

And I would rename blk_pm_resume_queue() to something like
blk_pm_queue_active() because it is a bit confusing as it stands.

Apart from the above remarks this makes sense to me FWIW.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-27 12:44                 ` Rafael J. Wysocki
@ 2025-11-28  7:20                   ` YangYang
  0 siblings, 0 replies; 44+ messages in thread
From: YangYang @ 2025-11-28  7:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bart Van Assche, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On 2025/11/27 20:44, Rafael J. Wysocki wrote:
> On Thu, Nov 27, 2025 at 12:29 PM YangYang <yang.yang@vivo.com> wrote:
>>
>> On 2025/11/27 2:40, Bart Van Assche wrote:
>>> On 11/26/25 7:41 AM, Rafael J. Wysocki wrote:
>>>> As it stands, you have a basic problem with respect to system
>>>> suspend/hibernation.  As I said before, the PM workqueue is frozen
>>>> during system suspend/hibernation transitions, so waiting for an async
>>>> resume request to complete then is pointless.
>>>
>>> Agreed. I noticed that any attempt to call request_firmware() from
>>> driver system resume callback functions causes a deadlock if these
>>> calls happen before the block device has been resumed.
>>>
>>> Thanks,
>>>
>>> Bart.
>>
>> Does this patch look reasonable to you? It hasn't been fully tested
>> yet, but the resume is now performed synchronously.
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 66fb2071d..041d29ba4 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -323,12 +323,15 @@ int blk_queue_enter(struct request_queue *q,
>> blk_mq_req_flags_t flags)
>>                    * reordered.
>>                    */
>>                   smp_rmb();
>> -               wait_event(q->mq_freeze_wq,
>> -                          (!q->mq_freeze_depth &&
>> -                           blk_pm_resume_queue(pm, q)) ||
>> -                          blk_queue_dying(q));
>> +check:
>> +               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);
> 
> I think that you still need to check blk_queue_dying(q) under
> wait_even() or you may not stop waiting when this happens.
> 

Got it.

>> +
>>                   if (blk_queue_dying(q))
>>                           return -ENODEV;
>> +               if (!blk_pm_resume_queue(pm, q)) {
>> +                       pm_runtime_resume(q->dev);
>> +                       goto check;
>> +               }
>>           }
>>
>>           rwsem_acquire_read(&q->q_lockdep_map, 0, 0, _RET_IP_);
>> @@ -356,12 +359,15 @@ int __bio_queue_enter(struct request_queue *q,
>> struct bio *bio)
>>                    * reordered.
>>                    */
>>                   smp_rmb();
>> -               wait_event(q->mq_freeze_wq,
>> -                          (!q->mq_freeze_depth &&
>> -                           blk_pm_resume_queue(false, q)) ||
>> -                          test_bit(GD_DEAD, &disk->state));
>> +check:
>> +               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);
> 
> Analogously here, you may not stop waiting when test_bit(GD_DEAD,
> &disk->state) is true.
> 

Got it.

>> +
>>                   if (test_bit(GD_DEAD, &disk->state))
>>                           goto dead;
>> +               if (!blk_pm_resume_queue(false, q)) {
>> +                       pm_runtime_resume(q->dev);
>> +                       goto check;
>> +               }
>>           }
>>
>>           rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
>> diff --git a/block/blk-pm.h b/block/blk-pm.h
>> index 8a5a0d4b3..c28fad105 100644
>> --- a/block/blk-pm.h
>> +++ b/block/blk-pm.h
>> @@ -12,7 +12,6 @@ static inline int blk_pm_resume_queue(const bool pm,
>> struct request_queue *q)
>>                   return 1;       /* Nothing to do */
>>           if (pm && q->rpm_status != RPM_SUSPENDED)
>>                   return 1;       /* Request allowed */
>> -       pm_request_resume(q->dev);
>>           return 0;
>>    }
> 
> And I would rename blk_pm_resume_queue() to something like
> blk_pm_queue_active() because it is a bit confusing as it stands.
> 
> Apart from the above remarks this makes sense to me FWIW.

Got it. I'll fix these in the next version and run some tests before 
sending it out. Thanks for the review.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-27 11:29               ` YangYang
  2025-11-27 12:44                 ` Rafael J. Wysocki
@ 2025-12-01 16:40                 ` Bart Van Assche
  1 sibling, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-12-01 16:40 UTC (permalink / raw)
  To: YangYang, Rafael J. Wysocki
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 11/27/25 3:29 AM, YangYang wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 66fb2071d..041d29ba4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -323,12 +323,15 @@ int blk_queue_enter(struct request_queue *q, 
> blk_mq_req_flags_t flags)
>                   * reordered.
>                   */
>                  smp_rmb();
> -               wait_event(q->mq_freeze_wq,
> -                          (!q->mq_freeze_depth &&
> -                           blk_pm_resume_queue(pm, q)) ||
> -                          blk_queue_dying(q));
> +check:
> +               wait_event(q->mq_freeze_wq, !q->mq_freeze_depth);
> +
>                  if (blk_queue_dying(q))
>                          return -ENODEV;

This can't work. blk_mq_destroy_queue() freezes a request queue without
unfreezing it so the above code will introduce a deadlock and/or a use-
after-free if it executes concurrently with blk_mq_destroy_queue().

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 11:30   ` Rafael J. Wysocki
  2025-11-26 11:59     ` YangYang
@ 2025-11-26 18:06     ` Bart Van Assche
  2025-11-26 19:16       ` Rafael J. Wysocki
  1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-11-26 18:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Yang Yang
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 11/26/25 3:30 AM, Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
>>   T1:                                   T2:
>>   blk_queue_enter
>>   blk_pm_resume_queue
>>   pm_request_resume
> 
> Shouldn't this be pm_runtime_resume() rather?

I tried to make that change on an Android device. As a result, the
kernel complaint shown below appeared. My understanding is that sleeping
in atomic context can trigger a deadlock and hence is not allowed.

[   13.728890][    T1] WARNING: CPU: 6 PID: 1 at 
kernel/sched/core.c:9714 __might_sleep+0x78/0x84
[   13.758800][    T1] Call trace:
[   13.759027][    T1]  __might_sleep+0x78/0x84
[   13.759340][    T1]  __pm_runtime_resume+0x40/0xb8
[   13.759781][    T1]  __bio_queue_enter+0xc0/0x1cc
[   13.760153][    T1]  blk_mq_submit_bio+0x884/0xadc
[   13.760548][    T1]  __submit_bio+0x2c8/0x49c
[   13.760879][    T1]  __submit_bio_noacct_mq+0x38/0x88
[   13.761242][    T1]  submit_bio_noacct_nocheck+0x4fc/0x7b8
[   13.761631][    T1]  submit_bio+0x214/0x4c0
[   13.761941][    T1]  mpage_readahead+0x1b8/0x1fc
[   13.762284][    T1]  blkdev_readahead+0x18/0x28
[   13.762660][    T1]  page_cache_ra_unbounded+0x310/0x4d8
[   13.763072][    T1]  page_cache_ra_order+0xc0/0x5b0
[   13.763434][    T1]  page_cache_sync_ra+0x17c/0x268
[   13.763782][    T1]  filemap_read+0x4c4/0x12f4
[   13.764125][    T1]  blkdev_read_iter+0x100/0x164
[   13.764475][    T1]  vfs_read+0x188/0x348
[   13.764789][    T1]  __se_sys_pread64+0x84/0xc8
[   13.765180][    T1]  __arm64_sys_pread64+0x1c/0x2c
[   13.765556][    T1]  invoke_syscall+0x58/0xf0
[   13.765876][    T1]  do_el0_svc+0x8c/0xe0
[   13.766172][    T1]  el0_svc+0x50/0xd4
[   13.766583][    T1]  el0t_64_sync_handler+0x20/0xf4
[   13.766932][    T1]  el0t_64_sync+0x1bc/0x1c0
[   13.767294][    T1] irq event stamp: 2589614
[   13.767592][    T1] hardirqs last  enabled at (2589613): 
[<ffffffc0800eaf24>] finish_lock_switch+0x70/0x108
[   13.768283][    T1] hardirqs last disabled at (2589614): 
[<ffffffc0814b66f4>] el1_dbg+0x24/0x80
[   13.768875][    T1] softirqs last  enabled at (2589370): 
[<ffffffc080082a7c>] ____do_softirq+0x10/0x20
[   13.769529][    T1] softirqs last disabled at (2589349): 
[<ffffffc080082a7c>] ____do_softirq+0x10/0x20

I think that the filemap_invalidate_lock_shared() call in
page_cache_ra_unbounded() forbids sleeping in submit_bio().

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 18:06     ` Bart Van Assche
@ 2025-11-26 19:16       ` Rafael J. Wysocki
  2025-11-26 19:34         ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 19:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 7:06 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 3:30 AM, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
> >>   T1:                                   T2:
> >>   blk_queue_enter
> >>   blk_pm_resume_queue
> >>   pm_request_resume
> >
> > Shouldn't this be pm_runtime_resume() rather?
>
> I tried to make that change on an Android device. As a result, the
> kernel complaint shown below appeared. My understanding is that sleeping
> in atomic context can trigger a deadlock and hence is not allowed.
>
> [   13.728890][    T1] WARNING: CPU: 6 PID: 1 at
> kernel/sched/core.c:9714 __might_sleep+0x78/0x84
> [   13.758800][    T1] Call trace:
> [   13.759027][    T1]  __might_sleep+0x78/0x84
> [   13.759340][    T1]  __pm_runtime_resume+0x40/0xb8
> [   13.759781][    T1]  __bio_queue_enter+0xc0/0x1cc
> [   13.760153][    T1]  blk_mq_submit_bio+0x884/0xadc
> [   13.760548][    T1]  __submit_bio+0x2c8/0x49c
> [   13.760879][    T1]  __submit_bio_noacct_mq+0x38/0x88
> [   13.761242][    T1]  submit_bio_noacct_nocheck+0x4fc/0x7b8
> [   13.761631][    T1]  submit_bio+0x214/0x4c0
> [   13.761941][    T1]  mpage_readahead+0x1b8/0x1fc
> [   13.762284][    T1]  blkdev_readahead+0x18/0x28
> [   13.762660][    T1]  page_cache_ra_unbounded+0x310/0x4d8
> [   13.763072][    T1]  page_cache_ra_order+0xc0/0x5b0
> [   13.763434][    T1]  page_cache_sync_ra+0x17c/0x268
> [   13.763782][    T1]  filemap_read+0x4c4/0x12f4
> [   13.764125][    T1]  blkdev_read_iter+0x100/0x164
> [   13.764475][    T1]  vfs_read+0x188/0x348
> [   13.764789][    T1]  __se_sys_pread64+0x84/0xc8
> [   13.765180][    T1]  __arm64_sys_pread64+0x1c/0x2c
> [   13.765556][    T1]  invoke_syscall+0x58/0xf0
> [   13.765876][    T1]  do_el0_svc+0x8c/0xe0
> [   13.766172][    T1]  el0_svc+0x50/0xd4
> [   13.766583][    T1]  el0t_64_sync_handler+0x20/0xf4
> [   13.766932][    T1]  el0t_64_sync+0x1bc/0x1c0
> [   13.767294][    T1] irq event stamp: 2589614
> [   13.767592][    T1] hardirqs last  enabled at (2589613):
> [<ffffffc0800eaf24>] finish_lock_switch+0x70/0x108
> [   13.768283][    T1] hardirqs last disabled at (2589614):
> [<ffffffc0814b66f4>] el1_dbg+0x24/0x80
> [   13.768875][    T1] softirqs last  enabled at (2589370):
> [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> [   13.769529][    T1] softirqs last disabled at (2589349):
> [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
>
> I think that the filemap_invalidate_lock_shared() call in
> page_cache_ra_unbounded() forbids sleeping in submit_bio().

The wait_event() macro in __bio_queue_enter() calls might_sleep() at
the very beginning, so why would it not complain?

IIUC, this is the WARN_ONCE() in __might_sleep() about the task state
being different from TASK_RUNNING, which triggers because
prepare_to_wait_event() changes the task state to
TASK_UNINTERRUPTIBLE.

This means that calling pm_runtime_resume() cannot be part of the
wait_event() condition, so blk_pm_resume_queue() and the wait_event()
macros involving it would need some rewriting.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 19:16       ` Rafael J. Wysocki
@ 2025-11-26 19:34         ` Rafael J. Wysocki
  2025-11-26 20:17           ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 19:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 8:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Nov 26, 2025 at 7:06 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 11/26/25 3:30 AM, Rafael J. Wysocki wrote:
> > > On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
> > >>   T1:                                   T2:
> > >>   blk_queue_enter
> > >>   blk_pm_resume_queue
> > >>   pm_request_resume
> > >
> > > Shouldn't this be pm_runtime_resume() rather?
> >
> > I tried to make that change on an Android device. As a result, the
> > kernel complaint shown below appeared. My understanding is that sleeping
> > in atomic context can trigger a deadlock and hence is not allowed.
> >
> > [   13.728890][    T1] WARNING: CPU: 6 PID: 1 at
> > kernel/sched/core.c:9714 __might_sleep+0x78/0x84
> > [   13.758800][    T1] Call trace:
> > [   13.759027][    T1]  __might_sleep+0x78/0x84
> > [   13.759340][    T1]  __pm_runtime_resume+0x40/0xb8
> > [   13.759781][    T1]  __bio_queue_enter+0xc0/0x1cc
> > [   13.760153][    T1]  blk_mq_submit_bio+0x884/0xadc
> > [   13.760548][    T1]  __submit_bio+0x2c8/0x49c
> > [   13.760879][    T1]  __submit_bio_noacct_mq+0x38/0x88
> > [   13.761242][    T1]  submit_bio_noacct_nocheck+0x4fc/0x7b8
> > [   13.761631][    T1]  submit_bio+0x214/0x4c0
> > [   13.761941][    T1]  mpage_readahead+0x1b8/0x1fc
> > [   13.762284][    T1]  blkdev_readahead+0x18/0x28
> > [   13.762660][    T1]  page_cache_ra_unbounded+0x310/0x4d8
> > [   13.763072][    T1]  page_cache_ra_order+0xc0/0x5b0
> > [   13.763434][    T1]  page_cache_sync_ra+0x17c/0x268
> > [   13.763782][    T1]  filemap_read+0x4c4/0x12f4
> > [   13.764125][    T1]  blkdev_read_iter+0x100/0x164
> > [   13.764475][    T1]  vfs_read+0x188/0x348
> > [   13.764789][    T1]  __se_sys_pread64+0x84/0xc8
> > [   13.765180][    T1]  __arm64_sys_pread64+0x1c/0x2c
> > [   13.765556][    T1]  invoke_syscall+0x58/0xf0
> > [   13.765876][    T1]  do_el0_svc+0x8c/0xe0
> > [   13.766172][    T1]  el0_svc+0x50/0xd4
> > [   13.766583][    T1]  el0t_64_sync_handler+0x20/0xf4
> > [   13.766932][    T1]  el0t_64_sync+0x1bc/0x1c0
> > [   13.767294][    T1] irq event stamp: 2589614
> > [   13.767592][    T1] hardirqs last  enabled at (2589613):
> > [<ffffffc0800eaf24>] finish_lock_switch+0x70/0x108
> > [   13.768283][    T1] hardirqs last disabled at (2589614):
> > [<ffffffc0814b66f4>] el1_dbg+0x24/0x80
> > [   13.768875][    T1] softirqs last  enabled at (2589370):
> > [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> > [   13.769529][    T1] softirqs last disabled at (2589349):
> > [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> >
> > I think that the filemap_invalidate_lock_shared() call in
> > page_cache_ra_unbounded() forbids sleeping in submit_bio().
>
> The wait_event() macro in __bio_queue_enter() calls might_sleep() at
> the very beginning, so why would it not complain?
>
> IIUC, this is the WARN_ONCE() in __might_sleep() about the task state
> being different from TASK_RUNNING, which triggers because
> prepare_to_wait_event() changes the task state to
> TASK_UNINTERRUPTIBLE.
>
> This means that calling pm_runtime_resume() cannot be part of the
> wait_event() condition, so blk_pm_resume_queue() and the wait_event()
> macros involving it would need some rewriting.

Interestingly enough, the pm_request_resume() call in
blk_pm_resume_queue() is not even necessary in the __bio_queue_enter()
case because pm is false there and it doesn't even check
q->rpm_status.

So in fact the resume is only necessary in blk_queue_enter() if pm is nonzero.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 19:34         ` Rafael J. Wysocki
@ 2025-11-26 20:17           ` Rafael J. Wysocki
  2025-11-26 21:10             ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 20:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wednesday, November 26, 2025 8:34:54 PM CET Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 8:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Wed, Nov 26, 2025 at 7:06 PM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 11/26/25 3:30 AM, Rafael J. Wysocki wrote:
> > > > On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
> > > >>   T1:                                   T2:
> > > >>   blk_queue_enter
> > > >>   blk_pm_resume_queue
> > > >>   pm_request_resume
> > > >
> > > > Shouldn't this be pm_runtime_resume() rather?
> > >
> > > I tried to make that change on an Android device. As a result, the
> > > kernel complaint shown below appeared. My understanding is that sleeping
> > > in atomic context can trigger a deadlock and hence is not allowed.
> > >
> > > [   13.728890][    T1] WARNING: CPU: 6 PID: 1 at
> > > kernel/sched/core.c:9714 __might_sleep+0x78/0x84
> > > [   13.758800][    T1] Call trace:
> > > [   13.759027][    T1]  __might_sleep+0x78/0x84
> > > [   13.759340][    T1]  __pm_runtime_resume+0x40/0xb8
> > > [   13.759781][    T1]  __bio_queue_enter+0xc0/0x1cc
> > > [   13.760153][    T1]  blk_mq_submit_bio+0x884/0xadc
> > > [   13.760548][    T1]  __submit_bio+0x2c8/0x49c
> > > [   13.760879][    T1]  __submit_bio_noacct_mq+0x38/0x88
> > > [   13.761242][    T1]  submit_bio_noacct_nocheck+0x4fc/0x7b8
> > > [   13.761631][    T1]  submit_bio+0x214/0x4c0
> > > [   13.761941][    T1]  mpage_readahead+0x1b8/0x1fc
> > > [   13.762284][    T1]  blkdev_readahead+0x18/0x28
> > > [   13.762660][    T1]  page_cache_ra_unbounded+0x310/0x4d8
> > > [   13.763072][    T1]  page_cache_ra_order+0xc0/0x5b0
> > > [   13.763434][    T1]  page_cache_sync_ra+0x17c/0x268
> > > [   13.763782][    T1]  filemap_read+0x4c4/0x12f4
> > > [   13.764125][    T1]  blkdev_read_iter+0x100/0x164
> > > [   13.764475][    T1]  vfs_read+0x188/0x348
> > > [   13.764789][    T1]  __se_sys_pread64+0x84/0xc8
> > > [   13.765180][    T1]  __arm64_sys_pread64+0x1c/0x2c
> > > [   13.765556][    T1]  invoke_syscall+0x58/0xf0
> > > [   13.765876][    T1]  do_el0_svc+0x8c/0xe0
> > > [   13.766172][    T1]  el0_svc+0x50/0xd4
> > > [   13.766583][    T1]  el0t_64_sync_handler+0x20/0xf4
> > > [   13.766932][    T1]  el0t_64_sync+0x1bc/0x1c0
> > > [   13.767294][    T1] irq event stamp: 2589614
> > > [   13.767592][    T1] hardirqs last  enabled at (2589613):
> > > [<ffffffc0800eaf24>] finish_lock_switch+0x70/0x108
> > > [   13.768283][    T1] hardirqs last disabled at (2589614):
> > > [<ffffffc0814b66f4>] el1_dbg+0x24/0x80
> > > [   13.768875][    T1] softirqs last  enabled at (2589370):
> > > [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> > > [   13.769529][    T1] softirqs last disabled at (2589349):
> > > [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> > >
> > > I think that the filemap_invalidate_lock_shared() call in
> > > page_cache_ra_unbounded() forbids sleeping in submit_bio().
> >
> > The wait_event() macro in __bio_queue_enter() calls might_sleep() at
> > the very beginning, so why would it not complain?
> >
> > IIUC, this is the WARN_ONCE() in __might_sleep() about the task state
> > being different from TASK_RUNNING, which triggers because
> > prepare_to_wait_event() changes the task state to
> > TASK_UNINTERRUPTIBLE.
> >
> > This means that calling pm_runtime_resume() cannot be part of the
> > wait_event() condition, so blk_pm_resume_queue() and the wait_event()
> > macros involving it would need some rewriting.
> 
> Interestingly enough, the pm_request_resume() call in
> blk_pm_resume_queue() is not even necessary in the __bio_queue_enter()
> case because pm is false there and it doesn't even check
> q->rpm_status.
> 
> So in fact the resume is only necessary in blk_queue_enter() if pm is nonzero.

If I'm not completely in the weeds, something like the patch below should be
doable.

Also, I'd consider using pm_runtime_get_noresume() and pm_runtime_put_noidle()
in blk_queue_enter() and blk_queue_exit(), respectively, in the "pm != 0" case
to prevent the device from suspending while the .q_usage_counter ref is held.

---
 block/blk-core.c |    6 +++---
 block/blk-pm.h   |    7 ++++---
 2 files changed, 7 insertions(+), 6 deletions(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
 		if (flags & BLK_MQ_REQ_NOWAIT)
 			return -EAGAIN;
 
+		/* if necessary, resume .dev (assume success). */
+		blk_pm_resume_queue(pm, q);
 		/*
 		 * read pair of barrier in blk_freeze_queue_start(), we need to
 		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
@@ -318,9 +320,7 @@ int blk_queue_enter(struct request_queue
 		 */
 		smp_rmb();
 		wait_event(q->mq_freeze_wq,
-			   (!q->mq_freeze_depth &&
-			    blk_pm_resume_queue(pm, q)) ||
-			   blk_queue_dying(q));
+			   !q->mq_freeze_depth || blk_queue_dying(q));
 		if (blk_queue_dying(q))
 			return -ENODEV;
 	}
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -10,9 +10,10 @@ static inline int blk_pm_resume_queue(co
 {
 	if (!q->dev || !blk_queue_pm_only(q))
 		return 1;	/* Nothing to do */
-	if (pm && q->rpm_status != RPM_SUSPENDED)
-		return 1;	/* Request allowed */
-	pm_request_resume(q->dev);
+
+	if (pm)
+		pm_runtime_resume(q->dev);
+
 	return 0;
 }
 




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 20:17           ` Rafael J. Wysocki
@ 2025-11-26 21:10             ` Bart Van Assche
  2025-11-26 21:30               ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-11-26 21:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>   		if (flags & BLK_MQ_REQ_NOWAIT)
>   			return -EAGAIN;
>   
> +		/* if necessary, resume .dev (assume success). */
> +		blk_pm_resume_queue(pm, q);
>   		/*
>   		 * read pair of barrier in blk_freeze_queue_start(), we need to
>   		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and

blk_queue_enter() may be called from the suspend path so I don't think
that the above change will work. As an example, the UFS driver submits a
SCSI START STOP UNIT command from its runtime suspend callback. The call
chain is as follows:

   ufshcd_wl_runtime_suspend()
     __ufshcd_wl_suspend()
       ufshcd_set_dev_pwr_mode()
         ufshcd_execute_start_stop()
           scsi_execute_cmd()
             scsi_alloc_request()
               blk_queue_enter()
             blk_execute_rq()
             blk_mq_free_request()
               blk_queue_exit()

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 21:10             ` Bart Van Assche
@ 2025-11-26 21:30               ` Rafael J. Wysocki
  2025-11-26 22:47                 ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 21:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >               if (flags & BLK_MQ_REQ_NOWAIT)
> >                       return -EAGAIN;
> >
> > +             /* if necessary, resume .dev (assume success). */
> > +             blk_pm_resume_queue(pm, q);
> >               /*
> >                * read pair of barrier in blk_freeze_queue_start(), we need to
> >                * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>
> blk_queue_enter() may be called from the suspend path so I don't think
> that the above change will work.

Why would the existing code work then?

Are you suggesting that q->rpm_status should still be checked before
calling pm_runtime_resume() or do you mean something else?

> As an example, the UFS driver submits a
> SCSI START STOP UNIT command from its runtime suspend callback. The call
> chain is as follows:
>
>    ufshcd_wl_runtime_suspend()
>      __ufshcd_wl_suspend()
>        ufshcd_set_dev_pwr_mode()
>          ufshcd_execute_start_stop()
>            scsi_execute_cmd()
>              scsi_alloc_request()
>                blk_queue_enter()
>              blk_execute_rq()
>              blk_mq_free_request()
>                blk_queue_exit()

In any case, calling pm_request_resume() from blk_pm_resume_queue() in
the !pm case is a mistake.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 21:30               ` Rafael J. Wysocki
@ 2025-11-26 22:47                 ` Bart Van Assche
  2025-11-27 12:34                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-11-26 22:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>                if (flags & BLK_MQ_REQ_NOWAIT)
>>>                        return -EAGAIN;
>>>
>>> +             /* if necessary, resume .dev (assume success). */
>>> +             blk_pm_resume_queue(pm, q);
>>>                /*
>>>                 * read pair of barrier in blk_freeze_queue_start(), we need to
>>>                 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>
>> blk_queue_enter() may be called from the suspend path so I don't think
>> that the above change will work.
> 
> Why would the existing code work then?

The existing code works reliably on a very large number of devices.
Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
requests that should be processed even if the power status is changing
(RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
as follows: process this request even if a power state change is
ongoing.
> Are you suggesting that q->rpm_status should still be checked before
> calling pm_runtime_resume() or do you mean something else?
The purpose of the code changes from a previous email is not entirely
clear to me so I'm not sure what the code should look like. But to
answer your question, calling blk_pm_resume_queue() if the runtime
status is RPM_SUSPENDED should be safe.
>> As an example, the UFS driver submits a
>> SCSI START STOP UNIT command from its runtime suspend callback. The call
>> chain is as follows:
>>
>>     ufshcd_wl_runtime_suspend()
>>       __ufshcd_wl_suspend()
>>         ufshcd_set_dev_pwr_mode()
>>           ufshcd_execute_start_stop()
>>             scsi_execute_cmd()
>>               scsi_alloc_request()
>>                 blk_queue_enter()
>>               blk_execute_rq()
>>               blk_mq_free_request()
>>                 blk_queue_exit()
> 
> In any case, calling pm_request_resume() from blk_pm_resume_queue() in
> the !pm case is a mistake.
  Hmm ... we may disagree about this. Does what I wrote above make clear
why blk_pm_resume_queue() is called if pm == false?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-26 22:47                 ` Bart Van Assche
@ 2025-11-27 12:34                   ` Rafael J. Wysocki
  2025-12-01  9:46                     ` YangYang
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-27 12:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>
> >> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>                if (flags & BLK_MQ_REQ_NOWAIT)
> >>>                        return -EAGAIN;
> >>>
> >>> +             /* if necessary, resume .dev (assume success). */
> >>> +             blk_pm_resume_queue(pm, q);
> >>>                /*
> >>>                 * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>                 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>
> >> blk_queue_enter() may be called from the suspend path so I don't think
> >> that the above change will work.
> >
> > Why would the existing code work then?
>
> The existing code works reliably on a very large number of devices.

Well, except that it doesn't work during system suspend and
hibernation when the PM workqueue is frozen.  I think that we agree
here.

This needs to be addressed because it may very well cause system
suspend to deadlock.

There are two possible ways to address it I can think of:

1. Changing blk_pm_resume_queue() and its users to carry out a
synchronous resume of q->dev instead of calling pm_request_resume()
and (effectively) waiting for the queued-up runtime resume of q->dev
to take effect.

This would be my preferred option, but at this point I'm not sure if
it's viable.

2. Stop freezing the PM workqueue before system suspend/hibernation
and adapt device_suspend_late() to that.

This should be doable, even though it is a bit risky because it may
uncover some latent bugs (the freezing of the PM workqueue has been
there forever), but it wouldn't address the problem entirely because
device_suspend_late() would still need to disable runtime PM for the
device (and for some devices it is disabled earlier), so
pm_request_resume() would just start to fail at that point and if
blk_queue_enter() were called after that point for a device supporting
runtime PM, it might deadlock.

> Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
> requests that should be processed even if the power status is changing
> (RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
> as follows: process this request even if a power state change is
> ongoing.

I see.

The behavior depends on whether or not q->pm_only is set.  If it is
not set, both blk_queue_enter() and __bio_queue_enter() will allow the
request to be processed.

If q->pm_only is set, __bio_queue_enter() will wait until it gets
cleared and in that case pm_request_resume(q->dev) is called to make
that happen (did I get it right?).  This is a bit fragile because what
if the async resume of q->dev fails for some reason?  You deadlock
instead of failing the request.

Unlike __bio_queue_enter(), blk_queue_enter() additionally checks the
runtime PM status of the queue if q->pm_only is set and it will allow
the request to be processed in that case so long as q->rpm_status is
not RPM_SUSPENDED.  However, if the queue status is RPM_SUSPENDED,
pm_request_resume(q->dev) will be called like in the
__bio_queue_enter() case.

I'm not sure why pm_request_resume(q->dev) needs to be called from
within blk_pm_resume_queue().  Arguably, it should be sufficient to
call it once before using the wait_event() macro, if the conditions
checked by blk_pm_resume_queue() are not met.

> > Are you suggesting that q->rpm_status should still be checked before
> > calling pm_runtime_resume() or do you mean something else?
> The purpose of the code changes from a previous email is not entirely
> clear to me so I'm not sure what the code should look like. But to
> answer your question, calling blk_pm_resume_queue() if the runtime
> status is RPM_SUSPENDED should be safe.
> >> As an example, the UFS driver submits a
> >> SCSI START STOP UNIT command from its runtime suspend callback. The call
> >> chain is as follows:
> >>
> >>     ufshcd_wl_runtime_suspend()
> >>       __ufshcd_wl_suspend()
> >>         ufshcd_set_dev_pwr_mode()
> >>           ufshcd_execute_start_stop()
> >>             scsi_execute_cmd()
> >>               scsi_alloc_request()
> >>                 blk_queue_enter()
> >>               blk_execute_rq()
> >>               blk_mq_free_request()
> >>                 blk_queue_exit()
> >
> > In any case, calling pm_request_resume() from blk_pm_resume_queue() in
> > the !pm case is a mistake.
>   Hmm ... we may disagree about this. Does what I wrote above make clear
> why blk_pm_resume_queue() is called if pm == false?

Yes, it does, thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-11-27 12:34                   ` Rafael J. Wysocki
@ 2025-12-01  9:46                     ` YangYang
  2025-12-01 12:56                       ` YangYang
  2025-12-01 18:47                       ` Rafael J. Wysocki
  0 siblings, 2 replies; 44+ messages in thread
From: YangYang @ 2025-12-01  9:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Bart Van Assche
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>
>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
>>>>>                         return -EAGAIN;
>>>>>
>>>>> +             /* if necessary, resume .dev (assume success). */
>>>>> +             blk_pm_resume_queue(pm, q);
>>>>>                 /*
>>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
>>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>>>
>>>> blk_queue_enter() may be called from the suspend path so I don't think
>>>> that the above change will work.
>>>
>>> Why would the existing code work then?
>>
>> The existing code works reliably on a very large number of devices.
> 
> Well, except that it doesn't work during system suspend and
> hibernation when the PM workqueue is frozen.  I think that we agree
> here.
> 
> This needs to be addressed because it may very well cause system
> suspend to deadlock.
> 
> There are two possible ways to address it I can think of:
> 
> 1. Changing blk_pm_resume_queue() and its users to carry out a
> synchronous resume of q->dev instead of calling pm_request_resume()
> and (effectively) waiting for the queued-up runtime resume of q->dev
> to take effect.
> 
> This would be my preferred option, but at this point I'm not sure if
> it's viable.
> 

After __pm_runtime_disable() is called from device_suspend_late(), 
dev->power.disable_depth is set, preventing rpm_resume() from making 
progress until the system resume completes, regardless of whether 
rpm_resume() is invoked synchronously or asynchronously.
Performing a synchronous resume of q->dev seems to have a similar 
effect to removing the following code block from 
__pm_runtime_barrier(), which is invoked by __pm_runtime_disable():

1428     if (dev->power.request_pending) {
1429         dev->power.request = RPM_REQ_NONE;
1430         spin_unlock_irq(&dev->power.lock);
1431
1432         cancel_work_sync(&dev->power.work);
1433
1434         spin_lock_irq(&dev->power.lock);
1435         dev->power.request_pending = false;
1436     }

> 2. Stop freezing the PM workqueue before system suspend/hibernation
> and adapt device_suspend_late() to that.
> 
> This should be doable, even though it is a bit risky because it may
> uncover some latent bugs (the freezing of the PM workqueue has been
> there forever), but it wouldn't address the problem entirely because
> device_suspend_late() would still need to disable runtime PM for the
> device (and for some devices it is disabled earlier), so
> pm_request_resume() would just start to fail at that point and if
> blk_queue_enter() were called after that point for a device supporting
> runtime PM, it might deadlock.
> 
>> Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
>> requests that should be processed even if the power status is changing
>> (RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
>> as follows: process this request even if a power state change is
>> ongoing.
> 
> I see.
> 
> The behavior depends on whether or not q->pm_only is set.  If it is
> not set, both blk_queue_enter() and __bio_queue_enter() will allow the
> request to be processed.
> 
> If q->pm_only is set, __bio_queue_enter() will wait until it gets
> cleared and in that case pm_request_resume(q->dev) is called to make
> that happen (did I get it right?).  This is a bit fragile because what
> if the async resume of q->dev fails for some reason?  You deadlock
> instead of failing the request.
> 
> Unlike __bio_queue_enter(), blk_queue_enter() additionally checks the
> runtime PM status of the queue if q->pm_only is set and it will allow
> the request to be processed in that case so long as q->rpm_status is
> not RPM_SUSPENDED.  However, if the queue status is RPM_SUSPENDED,
> pm_request_resume(q->dev) will be called like in the
> __bio_queue_enter() case.
> 
> I'm not sure why pm_request_resume(q->dev) needs to be called from
> within blk_pm_resume_queue().  Arguably, it should be sufficient to
> call it once before using the wait_event() macro, if the conditions
> checked by blk_pm_resume_queue() are not met.
> 
>>> Are you suggesting that q->rpm_status should still be checked before
>>> calling pm_runtime_resume() or do you mean something else?
>> The purpose of the code changes from a previous email is not entirely
>> clear to me so I'm not sure what the code should look like. But to
>> answer your question, calling blk_pm_resume_queue() if the runtime
>> status is RPM_SUSPENDED should be safe.
>>>> As an example, the UFS driver submits a
>>>> SCSI START STOP UNIT command from its runtime suspend callback. The call
>>>> chain is as follows:
>>>>
>>>>      ufshcd_wl_runtime_suspend()
>>>>        __ufshcd_wl_suspend()
>>>>          ufshcd_set_dev_pwr_mode()
>>>>            ufshcd_execute_start_stop()
>>>>              scsi_execute_cmd()
>>>>                scsi_alloc_request()
>>>>                  blk_queue_enter()
>>>>                blk_execute_rq()
>>>>                blk_mq_free_request()
>>>>                  blk_queue_exit()
>>>
>>> In any case, calling pm_request_resume() from blk_pm_resume_queue() in
>>> the !pm case is a mistake.
>>    Hmm ... we may disagree about this. Does what I wrote above make clear
>> why blk_pm_resume_queue() is called if pm == false?
> 
> Yes, it does, thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-01  9:46                     ` YangYang
@ 2025-12-01 12:56                       ` YangYang
  2025-12-01 18:55                         ` Rafael J. Wysocki
  2025-12-01 18:47                       ` Rafael J. Wysocki
  1 sibling, 1 reply; 44+ messages in thread
From: YangYang @ 2025-12-01 12:56 UTC (permalink / raw)
  To: Rafael J. Wysocki, Bart Van Assche
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 2025/12/1 17:46, YangYang wrote:
> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
>> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
>>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>>
>>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
>>>>>>                         return -EAGAIN;
>>>>>>
>>>>>> +             /* if necessary, resume .dev (assume success). */
>>>>>> +             blk_pm_resume_queue(pm, q);
>>>>>>                 /*
>>>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
>>>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>>>>
>>>>> blk_queue_enter() may be called from the suspend path so I don't think
>>>>> that the above change will work.
>>>>
>>>> Why would the existing code work then?
>>>
>>> The existing code works reliably on a very large number of devices.
>>
>> Well, except that it doesn't work during system suspend and
>> hibernation when the PM workqueue is frozen.  I think that we agree
>> here.
>>
>> This needs to be addressed because it may very well cause system
>> suspend to deadlock.
>>
>> There are two possible ways to address it I can think of:
>>
>> 1. Changing blk_pm_resume_queue() and its users to carry out a
>> synchronous resume of q->dev instead of calling pm_request_resume()
>> and (effectively) waiting for the queued-up runtime resume of q->dev
>> to take effect.
>>
>> This would be my preferred option, but at this point I'm not sure if
>> it's viable.
>>
> 
> After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing 
> rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked 
> synchronously or asynchronously.
> Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from 
> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
> 
> 1428     if (dev->power.request_pending) {
> 1429         dev->power.request = RPM_REQ_NONE;
> 1430         spin_unlock_irq(&dev->power.lock);
> 1431
> 1432         cancel_work_sync(&dev->power.work);
> 1433
> 1434         spin_lock_irq(&dev->power.lock);
> 1435         dev->power.request_pending = false;
> 1436     }
> 

Since both synchronous and asynchronous resumes face similar issues,
it may be sufficient to keep using the asynchronous resume path as long as
pending work items are not canceled while the PM workqueue is frozen.
This allows the pending work to proceed normally once the PM workqueue
is unfrozen.

---
  drivers/base/power/main.c    |  2 +-
  drivers/base/power/runtime.c | 17 +++++++++++------
  include/linux/pm_runtime.h   |  6 +++---
  3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 1de1cd72b616..d5c3d7a6777e 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1635,7 +1635,7 @@ static void device_suspend_late(struct device *dev, pm_message_t state, bool asy
  	 * Disable runtime PM for the device without checking if there is a
  	 * pending resume request for it.
  	 */
-	__pm_runtime_disable(dev, false);
+	__pm_runtime_disable(dev, false, true);

  	if (dev->power.syscore)
  		goto Skip;
diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 1b11a3cd4acc..ff3fdfba2dc8 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -1421,11 +1421,16 @@ EXPORT_SYMBOL_GPL(__pm_runtime_set_status);
   *
   * Should be called under dev->power.lock with interrupts disabled.
   */
-static void __pm_runtime_barrier(struct device *dev)
+static void __pm_runtime_barrier(struct device *dev, bool frozen)
  {
  	pm_runtime_deactivate_timer(dev);

-	if (dev->power.request_pending) {
+	/*
+	 * If the PM workqueue has already been frozen, the following
+	 * operations are unnecessary. This allows any pending work to
+	 * continue execution once the PM workqueue is unfrozen.
+	 */
+	if (!frozen && dev->power.request_pending) {
  		dev->power.request = RPM_REQ_NONE;
  		spin_unlock_irq(&dev->power.lock);

@@ -1485,7 +1490,7 @@ int pm_runtime_barrier(struct device *dev)
  		retval = 1;
  	}

-	__pm_runtime_barrier(dev);
+	__pm_runtime_barrier(dev, false);

  	spin_unlock_irq(&dev->power.lock);
  	pm_runtime_put_noidle(dev);
@@ -1519,7 +1524,7 @@ void pm_runtime_unblock(struct device *dev)
  	spin_unlock_irq(&dev->power.lock);
  }

-void __pm_runtime_disable(struct device *dev, bool check_resume)
+void __pm_runtime_disable(struct device *dev, bool check_resume, bool frozen)
  {
  	spin_lock_irq(&dev->power.lock);

@@ -1550,7 +1555,7 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
  	update_pm_runtime_accounting(dev);

  	if (!dev->power.disable_depth++) {
-		__pm_runtime_barrier(dev);
+		__pm_runtime_barrier(dev, frozen);
  		dev->power.last_status = dev->power.runtime_status;
  	}

@@ -1893,7 +1898,7 @@ void pm_runtime_reinit(struct device *dev)
   */
  void pm_runtime_remove(struct device *dev)
  {
-	__pm_runtime_disable(dev, false);
+	__pm_runtime_disable(dev, false, false);
  	pm_runtime_reinit(dev);
  }

diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
index 0b436e15f4cd..102060a9ebc7 100644
--- a/include/linux/pm_runtime.h
+++ b/include/linux/pm_runtime.h
@@ -80,7 +80,7 @@ extern int pm_runtime_barrier(struct device *dev);
  extern bool pm_runtime_block_if_disabled(struct device *dev);
  extern void pm_runtime_unblock(struct device *dev);
  extern void pm_runtime_enable(struct device *dev);
-extern void __pm_runtime_disable(struct device *dev, bool check_resume);
+extern void __pm_runtime_disable(struct device *dev, bool check_resume, bool frozen);
  extern void pm_runtime_allow(struct device *dev);
  extern void pm_runtime_forbid(struct device *dev);
  extern void pm_runtime_no_callbacks(struct device *dev);
@@ -288,7 +288,7 @@ static inline int pm_runtime_barrier(struct device *dev) { return 0; }
  static inline bool pm_runtime_block_if_disabled(struct device *dev) { return true; }
  static inline void pm_runtime_unblock(struct device *dev) {}
  static inline void pm_runtime_enable(struct device *dev) {}
-static inline void __pm_runtime_disable(struct device *dev, bool c) {}
+static inline void __pm_runtime_disable(struct device *dev, bool c, bool f) {}
  static inline bool pm_runtime_blocked(struct device *dev) { return true; }
  static inline void pm_runtime_allow(struct device *dev) {}
  static inline void pm_runtime_forbid(struct device *dev) {}
@@ -775,7 +775,7 @@ static inline int pm_runtime_set_suspended(struct device *dev)
   */
  static inline void pm_runtime_disable(struct device *dev)
  {
-	__pm_runtime_disable(dev, true);
+	__pm_runtime_disable(dev, true, false);
  }

  /**
-- 


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-01 12:56                       ` YangYang
@ 2025-12-01 18:55                         ` Rafael J. Wysocki
  2025-12-02 10:33                           ` YangYang
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-01 18:55 UTC (permalink / raw)
  To: YangYang
  Cc: Rafael J. Wysocki, Bart Van Assche, Jens Axboe, Pavel Machek,
	Len Brown, Greg Kroah-Hartman, Danilo Krummrich, linux-block,
	linux-kernel, linux-pm

On Mon, Dec 1, 2025 at 1:56 PM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/12/1 17:46, YangYang wrote:
> > On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> >> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>
> >>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> >>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>>>
> >>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>>>>> --- a/block/blk-core.c
> >>>>>> +++ b/block/blk-core.c
> >>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
> >>>>>>                         return -EAGAIN;
> >>>>>>
> >>>>>> +             /* if necessary, resume .dev (assume success). */
> >>>>>> +             blk_pm_resume_queue(pm, q);
> >>>>>>                 /*
> >>>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>>>>
> >>>>> blk_queue_enter() may be called from the suspend path so I don't think
> >>>>> that the above change will work.
> >>>>
> >>>> Why would the existing code work then?
> >>>
> >>> The existing code works reliably on a very large number of devices.
> >>
> >> Well, except that it doesn't work during system suspend and
> >> hibernation when the PM workqueue is frozen.  I think that we agree
> >> here.
> >>
> >> This needs to be addressed because it may very well cause system
> >> suspend to deadlock.
> >>
> >> There are two possible ways to address it I can think of:
> >>
> >> 1. Changing blk_pm_resume_queue() and its users to carry out a
> >> synchronous resume of q->dev instead of calling pm_request_resume()
> >> and (effectively) waiting for the queued-up runtime resume of q->dev
> >> to take effect.
> >>
> >> This would be my preferred option, but at this point I'm not sure if
> >> it's viable.
> >>
> >
> > After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing
> > rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked
> > synchronously or asynchronously.
> > Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from
> > __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
> >
> > 1428     if (dev->power.request_pending) {
> > 1429         dev->power.request = RPM_REQ_NONE;
> > 1430         spin_unlock_irq(&dev->power.lock);
> > 1431
> > 1432         cancel_work_sync(&dev->power.work);
> > 1433
> > 1434         spin_lock_irq(&dev->power.lock);
> > 1435         dev->power.request_pending = false;
> > 1436     }
> >
>
> Since both synchronous and asynchronous resumes face similar issues,

No, they don't.

> it may be sufficient to keep using the asynchronous resume path as long as
> pending work items are not canceled while the PM workqueue is frozen.

Except for two things:

1. If blk_queue_enter() or __bio_queue_enter() is allowed to race with
disabling runtime PM, queuing up the resume work item may fail in the
first place.

2. If a device runtime resume work item is queued up before the whole
system is suspended, it may not make sense to run that work item after
resuming the whole system because the state of the system as a whole
is generally different at that point.

> This allows the pending work to proceed normally once the PM workqueue
> is unfrozen.

Not really.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-01 18:55                         ` Rafael J. Wysocki
@ 2025-12-02 10:33                           ` YangYang
  2025-12-02 12:18                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: YangYang @ 2025-12-02 10:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bart Van Assche, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On 2025/12/2 2:55, Rafael J. Wysocki wrote:
> On Mon, Dec 1, 2025 at 1:56 PM YangYang <yang.yang@vivo.com> wrote:
>>
>> On 2025/12/1 17:46, YangYang wrote:
>>> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
>>>> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>>
>>>>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
>>>>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>>>>
>>>>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>>>>>>> --- a/block/blk-core.c
>>>>>>>> +++ b/block/blk-core.c
>>>>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>>>>>>                  if (flags & BLK_MQ_REQ_NOWAIT)
>>>>>>>>                          return -EAGAIN;
>>>>>>>>
>>>>>>>> +             /* if necessary, resume .dev (assume success). */
>>>>>>>> +             blk_pm_resume_queue(pm, q);
>>>>>>>>                  /*
>>>>>>>>                   * read pair of barrier in blk_freeze_queue_start(), we need to
>>>>>>>>                   * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>>>>>>
>>>>>>> blk_queue_enter() may be called from the suspend path so I don't think
>>>>>>> that the above change will work.
>>>>>>
>>>>>> Why would the existing code work then?
>>>>>
>>>>> The existing code works reliably on a very large number of devices.
>>>>
>>>> Well, except that it doesn't work during system suspend and
>>>> hibernation when the PM workqueue is frozen.  I think that we agree
>>>> here.
>>>>
>>>> This needs to be addressed because it may very well cause system
>>>> suspend to deadlock.
>>>>
>>>> There are two possible ways to address it I can think of:
>>>>
>>>> 1. Changing blk_pm_resume_queue() and its users to carry out a
>>>> synchronous resume of q->dev instead of calling pm_request_resume()
>>>> and (effectively) waiting for the queued-up runtime resume of q->dev
>>>> to take effect.
>>>>
>>>> This would be my preferred option, but at this point I'm not sure if
>>>> it's viable.
>>>>
>>>
>>> After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing
>>> rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked
>>> synchronously or asynchronously.
>>> Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from
>>> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
>>>
>>> 1428     if (dev->power.request_pending) {
>>> 1429         dev->power.request = RPM_REQ_NONE;
>>> 1430         spin_unlock_irq(&dev->power.lock);
>>> 1431
>>> 1432         cancel_work_sync(&dev->power.work);
>>> 1433
>>> 1434         spin_lock_irq(&dev->power.lock);
>>> 1435         dev->power.request_pending = false;
>>> 1436     }
>>>
>>
>> Since both synchronous and asynchronous resumes face similar issues,
> 
> No, they don't.
> 
>> it may be sufficient to keep using the asynchronous resume path as long as
>> pending work items are not canceled while the PM workqueue is frozen.
> 
> Except for two things:
> 
> 1. If blk_queue_enter() or __bio_queue_enter() is allowed to race with
> disabling runtime PM, queuing up the resume work item may fail in the
> first place.
> 

Perhaps my understanding is incorrect, but during the execution of
device_suspend_late(), the PM workqueue should already be frozen.
In that case, queuing a resume work item would not fail; it would
simply not be executed until the workqueue is unfrozen, as long as
it is not canceled.

> 2. If a device runtime resume work item is queued up before the whole
> system is suspended, it may not make sense to run that work item after
> resuming the whole system because the state of the system as a whole
> is generally different at that point.
> 
>> This allows the pending work to proceed normally once the PM workqueue
>> is unfrozen.
> 
> Not really.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-02 10:33                           ` YangYang
@ 2025-12-02 12:18                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-02 12:18 UTC (permalink / raw)
  To: YangYang
  Cc: Rafael J. Wysocki, Bart Van Assche, Jens Axboe, Pavel Machek,
	Len Brown, Greg Kroah-Hartman, Danilo Krummrich, linux-block,
	linux-kernel, linux-pm

On Tue, Dec 2, 2025 at 11:33 AM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/12/2 2:55, Rafael J. Wysocki wrote:
> > On Mon, Dec 1, 2025 at 1:56 PM YangYang <yang.yang@vivo.com> wrote:
> >>
> >> On 2025/12/1 17:46, YangYang wrote:
> >>> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> >>>> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>>>
> >>>>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> >>>>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>>>>>
> >>>>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>>>>>>> --- a/block/blk-core.c
> >>>>>>>> +++ b/block/blk-core.c
> >>>>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>>>>>>                  if (flags & BLK_MQ_REQ_NOWAIT)
> >>>>>>>>                          return -EAGAIN;
> >>>>>>>>
> >>>>>>>> +             /* if necessary, resume .dev (assume success). */
> >>>>>>>> +             blk_pm_resume_queue(pm, q);
> >>>>>>>>                  /*
> >>>>>>>>                   * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>>>>>>                   * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>>>>>>
> >>>>>>> blk_queue_enter() may be called from the suspend path so I don't think
> >>>>>>> that the above change will work.
> >>>>>>
> >>>>>> Why would the existing code work then?
> >>>>>
> >>>>> The existing code works reliably on a very large number of devices.
> >>>>
> >>>> Well, except that it doesn't work during system suspend and
> >>>> hibernation when the PM workqueue is frozen.  I think that we agree
> >>>> here.
> >>>>
> >>>> This needs to be addressed because it may very well cause system
> >>>> suspend to deadlock.
> >>>>
> >>>> There are two possible ways to address it I can think of:
> >>>>
> >>>> 1. Changing blk_pm_resume_queue() and its users to carry out a
> >>>> synchronous resume of q->dev instead of calling pm_request_resume()
> >>>> and (effectively) waiting for the queued-up runtime resume of q->dev
> >>>> to take effect.
> >>>>
> >>>> This would be my preferred option, but at this point I'm not sure if
> >>>> it's viable.
> >>>>
> >>>
> >>> After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing
> >>> rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked
> >>> synchronously or asynchronously.
> >>> Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from
> >>> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
> >>>
> >>> 1428     if (dev->power.request_pending) {
> >>> 1429         dev->power.request = RPM_REQ_NONE;
> >>> 1430         spin_unlock_irq(&dev->power.lock);
> >>> 1431
> >>> 1432         cancel_work_sync(&dev->power.work);
> >>> 1433
> >>> 1434         spin_lock_irq(&dev->power.lock);
> >>> 1435         dev->power.request_pending = false;
> >>> 1436     }
> >>>
> >>
> >> Since both synchronous and asynchronous resumes face similar issues,
> >
> > No, they don't.
> >
> >> it may be sufficient to keep using the asynchronous resume path as long as
> >> pending work items are not canceled while the PM workqueue is frozen.
> >
> > Except for two things:
> >
> > 1. If blk_queue_enter() or __bio_queue_enter() is allowed to race with
> > disabling runtime PM, queuing up the resume work item may fail in the
> > first place.
> >
>
> Perhaps my understanding is incorrect, but during the execution of
> device_suspend_late(), the PM workqueue should already be frozen.
> In that case, queuing a resume work item would not fail; it would
> simply not be executed until the workqueue is unfrozen, as long as
> it is not canceled.

rpm_resume() returns an error if runtime PM is disabled for the given
device and the device status is RPM_SUSPENDED even if it is called
with RPM_ASYNC or RPM_NOWAIT in the flags.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-01  9:46                     ` YangYang
  2025-12-01 12:56                       ` YangYang
@ 2025-12-01 18:47                       ` Rafael J. Wysocki
  2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
                                           ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-01 18:47 UTC (permalink / raw)
  To: YangYang
  Cc: Rafael J. Wysocki, Bart Van Assche, Jens Axboe, Pavel Machek,
	Len Brown, Greg Kroah-Hartman, Danilo Krummrich, linux-block,
	linux-kernel, linux-pm

On Mon, Dec 1, 2025 at 10:46 AM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>
> >> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> >>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>>
> >>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
> >>>>>                         return -EAGAIN;
> >>>>>
> >>>>> +             /* if necessary, resume .dev (assume success). */
> >>>>> +             blk_pm_resume_queue(pm, q);
> >>>>>                 /*
> >>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>>>
> >>>> blk_queue_enter() may be called from the suspend path so I don't think
> >>>> that the above change will work.
> >>>
> >>> Why would the existing code work then?
> >>
> >> The existing code works reliably on a very large number of devices.
> >
> > Well, except that it doesn't work during system suspend and
> > hibernation when the PM workqueue is frozen.  I think that we agree
> > here.
> >
> > This needs to be addressed because it may very well cause system
> > suspend to deadlock.
> >
> > There are two possible ways to address it I can think of:
> >
> > 1. Changing blk_pm_resume_queue() and its users to carry out a
> > synchronous resume of q->dev instead of calling pm_request_resume()
> > and (effectively) waiting for the queued-up runtime resume of q->dev
> > to take effect.
> >
> > This would be my preferred option, but at this point I'm not sure if
> > it's viable.
> >
>
> After __pm_runtime_disable() is called from device_suspend_late(),
> dev->power.disable_depth is set, preventing rpm_resume() from making
> progress until the system resume completes, regardless of whether
> rpm_resume() is invoked synchronously or asynchronously.

This isn't factually correct.  rpm_resume() will make progress when
runtime PM is disabled, but it will not resume the target device.
That's what disabling runtime PM means.

Of course, when runtime PM is disabled for the given device,
rpm_resume() will return an error code that can be checked.  However,
if pm_request_resume() is called before disabling runtime PM for the
device and runtime PM is disabled for it before the work item queued
by pm_request_resume() runs, the failure will be silent from the
caller's perspective.

> Performing a synchronous resume of q->dev seems to have a similar
> effect to removing the following code block from
> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
>
> 1428     if (dev->power.request_pending) {
> 1429         dev->power.request = RPM_REQ_NONE;
> 1430         spin_unlock_irq(&dev->power.lock);
> 1431
> 1432         cancel_work_sync(&dev->power.work);
> 1433
> 1434         spin_lock_irq(&dev->power.lock);
> 1435         dev->power.request_pending = false;
> 1436     }

It is different.

First of all, synchronous runtime resume is not affected by the
freezing of the runtime PM workqueue.  Next, see the remark above
regarding returning an error code.  Finally, so long as
__pm_runtime_resume() acquires power.lock before
__pm_runtime_disable(), the synchronous resume will be waited for by
the latter.

Generally speaking, if blk_queue_enter() or __bio_queue_enter() may
run in parallel with device_suspend_late() for q->dev, the driver of
that device is defective, because it is responsible for preventing
this situation from happening.  The most straightforward way to
achieve that is to provide a .suspend() callback for q->dev that will
runtime-resume it (and, of course, q->dev will need to be prepared for
system suspend as appropriate after that).

If blk_queue_enter() or __bio_queue_enter() is allowed to race with
disabling runtime PM for q->dev, failure to resume q->dev is alway
possible and there are no changes that can be made to
pm_runtime_disable() to prevent that from happening.  If
__pm_runtime_disable() wins the race, it will increment
power.disable_depth and rpm_resume() will bail out when it sees that
no matter what.

You should not conflate "runtime PM doesn't work when it is disabled"
with "asynchronous runtime PM doesn't work after freezing the PM
workqueue".  They are both true, but they are not the same.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-01 18:47                       ` Rafael J. Wysocki
@ 2025-12-01 19:58                         ` Rafael J. Wysocki
  2025-12-02  1:06                           ` Bart Van Assche
                                             ` (2 more replies)
  2025-12-02  0:40                         ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Bart Van Assche
  2025-12-05 15:24                         ` [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
  2 siblings, 3 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-01 19:58 UTC (permalink / raw)
  To: YangYang
  Cc: Bart Van Assche, Jens Axboe, Greg Kroah-Hartman, Danilo Krummrich,
	linux-block, linux-kernel, linux-pm, Ulf Hansson

On Monday, December 1, 2025 7:47:46 PM CET Rafael J. Wysocki wrote:
> On Mon, Dec 1, 2025 at 10:46 AM YangYang <yang.yang@vivo.com> wrote:

[cut]

> If blk_queue_enter() or __bio_queue_enter() is allowed to race with
> disabling runtime PM for q->dev, failure to resume q->dev is alway
> possible and there are no changes that can be made to
> pm_runtime_disable() to prevent that from happening.  If
> __pm_runtime_disable() wins the race, it will increment
> power.disable_depth and rpm_resume() will bail out when it sees that
> no matter what.
> 
> You should not conflate "runtime PM doesn't work when it is disabled"
> with "asynchronous runtime PM doesn't work after freezing the PM
> workqueue".  They are both true, but they are not the same.

So I've been testing the patch below for a few days and it will eliminate
the latter, but even after this patch runtime PM will be disabled in
device_suspend_late() and if the problem you are facing is still there
after this patch, it will need to dealt with at the driver level.

Generally speaking, driver involvement is needed to make runtime PM and
system suspend/resume work together in the majority of cases.

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: 

Till now, the runtime PM workqueue has been flagged as freezable, so it
does not process work items during system-wide PM transitions like
system suspend and resume.  The original reason to do that was to
reduce the likelihood of runtime PM getting in the way of system-wide
PM processing, but now it is mostly an optimization because (1) runtime
suspend of devices is prevented by bumping up their runtime PM usage
counters in device_prepare() and (2) device drivers are expected to
disable runtime PM for the devices handled by them before they embark
on system-wide PM activities that may change the state of the hardware
or otherwise interfere with runtime PM.  However, it prevents
asynchronous runtime resume of devices from working during system-wide
PM transitions, which is confusing because synchronous runtime resume
is not prevented at the same time, and it also sometimes turns out to
be problematic.

For example, it has been reported that blk_queue_enter() may deadlock
during a system suspend transition because of the pm_request_resume()
usage in it [1].  That happens because the asynchronous runtime resume
of the given device is not processed due to the freezing of the runtime
PM workqueue.  While it may be better to address this particular issue
in the block layer, the very presence of it means that similar problems
may be expected to occur elsewhere.

For this reason, remove the WQ_FREEZABLE flag from the runtime PM
workqueue and make device_suspend_late() use the generic variant of
pm_runtime_disable() that will carry out runtime PM of the device
synchronously if there is pending resume work for it.

Also update the comment before the pm_runtime_disable() call in
device_suspend_late() to document the fact that the runtime PM
should not be expected to work for the device until the end of
device_resume_early().

This change may, even though it is not expected to, uncover some
latent issues related to queuing up asynchronous runtime resume
work items during system suspend or hibernation.  However, they
should be limited to the interference between runtime resume and
system-wide PM callbacks in the cases when device drivers start
to handle system-wide PM before disabling runtime PM as described
above.

Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/base/power/main.c |    7 ++++---
 kernel/power/main.c       |    2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1647,10 +1647,11 @@ static void device_suspend_late(struct d
 		goto Complete;

 	/*
-	 * Disable runtime PM for the device without checking if there is a
-	 * pending resume request for it.
+	 * After this point, any runtime PM operations targeting the device
+	 * will fail until the corresponding pm_runtime_enable() call in
+	 * device_resume_early().
 	 */
-	__pm_runtime_disable(dev, false);
+	pm_runtime_disable(dev);

 	if (dev->power.syscore)
 		goto Skip;
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -1125,7 +1125,7 @@ EXPORT_SYMBOL_GPL(pm_wq);

 static int __init pm_start_workqueues(void)
 {
-	pm_wq = alloc_workqueue("pm", WQ_FREEZABLE | WQ_UNBOUND, 0);
+	pm_wq = alloc_workqueue("pm", WQ_UNBOUND, 0);
 	if (!pm_wq)
 		return -ENOMEM;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
@ 2025-12-02  1:06                           ` Bart Van Assche
  2025-12-02 11:53                             ` Rafael J. Wysocki
  2025-12-02 10:36                           ` YangYang
  2025-12-02 14:58                           ` Ulf Hansson
  2 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-12-02  1:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, YangYang
  Cc: Jens Axboe, Greg Kroah-Hartman, Danilo Krummrich, linux-block,
	linux-kernel, linux-pm, Ulf Hansson

On 12/1/25 11:58 AM, Rafael J. Wysocki wrote:
> So I've been testing the patch below for a few days and it will eliminate
> the latter, but even after this patch runtime PM will be disabled in
> device_suspend_late() and if the problem you are facing is still there
> after this patch, it will need to dealt with at the driver level.
> 
> Generally speaking, driver involvement is needed to make runtime PM and
> system suspend/resume work together in the majority of cases.

Thank you for having developed and shared this patch. Is the following
quote from the Linux kernel documentation still correct with this patch
applied or should an update for Documentation/power/runtime_pm.rst
perhaps be included in this patch?

  "The power management workqueue pm_wq in which bus types and device 
drivers can
   put their PM-related work items.  It is strongly recommended that 
pm_wq be
   used for queuing all work items related to runtime PM, because this 
allows
   them to be synchronized with system-wide power transitions (suspend 
to RAM,
   hibernation and resume from system sleep states).  pm_wq is declared in
   include/linux/pm_runtime.h and defined in kernel/power/main.c."

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-02  1:06                           ` Bart Van Assche
@ 2025-12-02 11:53                             ` Rafael J. Wysocki
  2025-12-02 13:29                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-02 11:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, YangYang, Jens Axboe, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm,
	Ulf Hansson

On Tue, Dec 2, 2025 at 2:06 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/1/25 11:58 AM, Rafael J. Wysocki wrote:
> > So I've been testing the patch below for a few days and it will eliminate
> > the latter, but even after this patch runtime PM will be disabled in
> > device_suspend_late() and if the problem you are facing is still there
> > after this patch, it will need to dealt with at the driver level.
> >
> > Generally speaking, driver involvement is needed to make runtime PM and
> > system suspend/resume work together in the majority of cases.
>
> Thank you for having developed and shared this patch. Is the following
> quote from the Linux kernel documentation still correct with this patch
> applied or should an update for Documentation/power/runtime_pm.rst
> perhaps be included in this patch?
>
>   "The power management workqueue pm_wq in which bus types and device
> drivers can
>    put their PM-related work items.  It is strongly recommended that
> pm_wq be
>    used for queuing all work items related to runtime PM, because this
> allows
>    them to be synchronized with system-wide power transitions (suspend
> to RAM,
>    hibernation and resume from system sleep states).  pm_wq is declared in
>    include/linux/pm_runtime.h and defined in kernel/power/main.c."

It doesn't say what the synchronization mechanism is in particular and
some synchronization is still provided after this patch, via the
pm_runtime_barrier() in device_suspend(), for example.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-02 11:53                             ` Rafael J. Wysocki
@ 2025-12-02 13:29                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-02 13:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: YangYang, Jens Axboe, Greg Kroah-Hartman, Danilo Krummrich,
	linux-block, linux-kernel, linux-pm, Ulf Hansson

On Tue, Dec 2, 2025 at 12:53 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Dec 2, 2025 at 2:06 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 12/1/25 11:58 AM, Rafael J. Wysocki wrote:
> > > So I've been testing the patch below for a few days and it will eliminate
> > > the latter, but even after this patch runtime PM will be disabled in
> > > device_suspend_late() and if the problem you are facing is still there
> > > after this patch, it will need to dealt with at the driver level.
> > >
> > > Generally speaking, driver involvement is needed to make runtime PM and
> > > system suspend/resume work together in the majority of cases.
> >
> > Thank you for having developed and shared this patch. Is the following
> > quote from the Linux kernel documentation still correct with this patch
> > applied or should an update for Documentation/power/runtime_pm.rst
> > perhaps be included in this patch?
> >
> >   "The power management workqueue pm_wq in which bus types and device
> > drivers can
> >    put their PM-related work items.  It is strongly recommended that
> > pm_wq be
> >    used for queuing all work items related to runtime PM, because this
> > allows
> >    them to be synchronized with system-wide power transitions (suspend
> > to RAM,
> >    hibernation and resume from system sleep states).  pm_wq is declared in
> >    include/linux/pm_runtime.h and defined in kernel/power/main.c."
>
> It doesn't say what the synchronization mechanism is in particular and
> some synchronization is still provided after this patch, via the
> pm_runtime_barrier() in device_suspend(), for example.

Though there is another piece of documentation that needs updating to
reflect the changes in this patch, so I'll send a v2 at one point.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
  2025-12-02  1:06                           ` Bart Van Assche
@ 2025-12-02 10:36                           ` YangYang
  2025-12-02 14:58                           ` Ulf Hansson
  2 siblings, 0 replies; 44+ messages in thread
From: YangYang @ 2025-12-02 10:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bart Van Assche, Jens Axboe, Greg Kroah-Hartman, Danilo Krummrich,
	linux-block, linux-kernel, linux-pm, Ulf Hansson

On 2025/12/2 3:58, Rafael J. Wysocki wrote:
> On Monday, December 1, 2025 7:47:46 PM CET Rafael J. Wysocki wrote:
>> On Mon, Dec 1, 2025 at 10:46 AM YangYang <yang.yang@vivo.com> wrote:
> 
> [cut]
> 
>> If blk_queue_enter() or __bio_queue_enter() is allowed to race with
>> disabling runtime PM for q->dev, failure to resume q->dev is alway
>> possible and there are no changes that can be made to
>> pm_runtime_disable() to prevent that from happening.  If
>> __pm_runtime_disable() wins the race, it will increment
>> power.disable_depth and rpm_resume() will bail out when it sees that
>> no matter what.
>>
>> You should not conflate "runtime PM doesn't work when it is disabled"
>> with "asynchronous runtime PM doesn't work after freezing the PM
>> workqueue".  They are both true, but they are not the same.
> 
> So I've been testing the patch below for a few days and it will eliminate
> the latter, but even after this patch runtime PM will be disabled in
> device_suspend_late() and if the problem you are facing is still there
> after this patch, it will need to dealt with at the driver level.
> 
> Generally speaking, driver involvement is needed to make runtime PM and
> system suspend/resume work together in the majority of cases.
> 

Thank you. I'll perform some tests with this patch applied.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
  2025-12-02  1:06                           ` Bart Van Assche
  2025-12-02 10:36                           ` YangYang
@ 2025-12-02 14:58                           ` Ulf Hansson
  2 siblings, 0 replies; 44+ messages in thread
From: Ulf Hansson @ 2025-12-02 14:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: YangYang, Bart Van Assche, Jens Axboe, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On Mon, 1 Dec 2025 at 20:58, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Monday, December 1, 2025 7:47:46 PM CET Rafael J. Wysocki wrote:
> > On Mon, Dec 1, 2025 at 10:46 AM YangYang <yang.yang@vivo.com> wrote:
>
> [cut]
>
> > If blk_queue_enter() or __bio_queue_enter() is allowed to race with
> > disabling runtime PM for q->dev, failure to resume q->dev is alway
> > possible and there are no changes that can be made to
> > pm_runtime_disable() to prevent that from happening.  If
> > __pm_runtime_disable() wins the race, it will increment
> > power.disable_depth and rpm_resume() will bail out when it sees that
> > no matter what.
> >
> > You should not conflate "runtime PM doesn't work when it is disabled"
> > with "asynchronous runtime PM doesn't work after freezing the PM
> > workqueue".  They are both true, but they are not the same.
>
> So I've been testing the patch below for a few days and it will eliminate
> the latter, but even after this patch runtime PM will be disabled in
> device_suspend_late() and if the problem you are facing is still there
> after this patch, it will need to dealt with at the driver level.
>
> Generally speaking, driver involvement is needed to make runtime PM and
> system suspend/resume work together in the majority of cases.
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject:
>
> Till now, the runtime PM workqueue has been flagged as freezable, so it
> does not process work items during system-wide PM transitions like
> system suspend and resume.  The original reason to do that was to
> reduce the likelihood of runtime PM getting in the way of system-wide
> PM processing, but now it is mostly an optimization because (1) runtime
> suspend of devices is prevented by bumping up their runtime PM usage
> counters in device_prepare() and (2) device drivers are expected to
> disable runtime PM for the devices handled by them before they embark
> on system-wide PM activities that may change the state of the hardware
> or otherwise interfere with runtime PM.  However, it prevents
> asynchronous runtime resume of devices from working during system-wide
> PM transitions, which is confusing because synchronous runtime resume
> is not prevented at the same time, and it also sometimes turns out to
> be problematic.
>
> For example, it has been reported that blk_queue_enter() may deadlock
> during a system suspend transition because of the pm_request_resume()
> usage in it [1].  That happens because the asynchronous runtime resume
> of the given device is not processed due to the freezing of the runtime
> PM workqueue.  While it may be better to address this particular issue
> in the block layer, the very presence of it means that similar problems
> may be expected to occur elsewhere.
>
> For this reason, remove the WQ_FREEZABLE flag from the runtime PM
> workqueue and make device_suspend_late() use the generic variant of
> pm_runtime_disable() that will carry out runtime PM of the device
> synchronously if there is pending resume work for it.
>
> Also update the comment before the pm_runtime_disable() call in
> device_suspend_late() to document the fact that the runtime PM
> should not be expected to work for the device until the end of
> device_resume_early().
>
> This change may, even though it is not expected to, uncover some
> latent issues related to queuing up asynchronous runtime resume
> work items during system suspend or hibernation.  However, they
> should be limited to the interference between runtime resume and
> system-wide PM callbacks in the cases when device drivers start
> to handle system-wide PM before disabling runtime PM as described
> above.
>
> Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

I agree with the above and this seems like a reasonable change to me.
Yep, it's not entirely easy to know whether all users of
pm_request_resume() (and similar) are fine with this too, but in
general I think they should.

So, feel free to add:

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Kind regards
Uffe

> ---
>  drivers/base/power/main.c |    7 ++++---
>  kernel/power/main.c       |    2 +-
>  2 files changed, 5 insertions(+), 4 deletions(-)
>
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -1647,10 +1647,11 @@ static void device_suspend_late(struct d
>                 goto Complete;
>
>         /*
> -        * Disable runtime PM for the device without checking if there is a
> -        * pending resume request for it.
> +        * After this point, any runtime PM operations targeting the device
> +        * will fail until the corresponding pm_runtime_enable() call in
> +        * device_resume_early().
>          */
> -       __pm_runtime_disable(dev, false);
> +       pm_runtime_disable(dev);
>
>         if (dev->power.syscore)
>                 goto Skip;
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -1125,7 +1125,7 @@ EXPORT_SYMBOL_GPL(pm_wq);
>
>  static int __init pm_start_workqueues(void)
>  {
> -       pm_wq = alloc_workqueue("pm", WQ_FREEZABLE | WQ_UNBOUND, 0);
> +       pm_wq = alloc_workqueue("pm", WQ_UNBOUND, 0);
>         if (!pm_wq)
>                 return -ENOMEM;
>
>
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-01 18:47                       ` Rafael J. Wysocki
  2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
@ 2025-12-02  0:40                         ` Bart Van Assche
  2025-12-02 12:14                           ` Rafael J. Wysocki
  2025-12-05 15:24                         ` [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
  2 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-12-02  0:40 UTC (permalink / raw)
  To: Rafael J. Wysocki, YangYang
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 12/1/25 10:47 AM, Rafael J. Wysocki wrote:
> Generally speaking, if blk_queue_enter() or __bio_queue_enter() may
> run in parallel with device_suspend_late() for q->dev, the driver of
> that device is defective, because it is responsible for preventing
> this situation from happening.  The most straightforward way to
> achieve that is to provide a .suspend() callback for q->dev that will
> runtime-resume it (and, of course, q->dev will need to be prepared for
> system suspend as appropriate after that).

Isn't the suspend / hibernation order such that no block I/O is
submitted while block devices transition to a lower power state? I'm
surprised to read that individual drivers are responsible for preventing
that blk_queue_enter() or __bio_queue_enter() run concurrently with
device_suspend_late().

Regarding the UFSHCI driver: if a UFS controller is already runtime
suspended, we want it to remain suspended during system suspend.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-02  0:40                         ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Bart Van Assche
@ 2025-12-02 12:14                           ` Rafael J. Wysocki
  2025-12-02 13:37                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-02 12:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, YangYang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Tue, Dec 2, 2025 at 1:41 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/1/25 10:47 AM, Rafael J. Wysocki wrote:
> > Generally speaking, if blk_queue_enter() or __bio_queue_enter() may
> > run in parallel with device_suspend_late() for q->dev, the driver of
> > that device is defective, because it is responsible for preventing
> > this situation from happening.  The most straightforward way to
> > achieve that is to provide a .suspend() callback for q->dev that will
> > runtime-resume it (and, of course, q->dev will need to be prepared for
> > system suspend as appropriate after that).
>
> Isn't the suspend / hibernation order such that no block I/O is
> submitted while block devices transition to a lower power state? I'm
> surprised to read that individual drivers are responsible for preventing
> that blk_queue_enter() or __bio_queue_enter() run concurrently with
> device_suspend_late().

To be more precise, they don't need to be prevented from running
concurrently with device_suspend_late() in general.  The driver needs
to ensure though that q->dev is not runtime-suspended in
device_suspend_late() if blk_queue_enter() or __bio_queue_enter() are
expected to run in parallel with it or later.

> Regarding the UFSHCI driver: if a UFS controller is already runtime
> suspended, we want it to remain suspended during system suspend.

That can be done, but still the driver is responsible for preparing
the device for system suspend.

The most popular strategy is to use pm_runtime_force_suspend/resume()
as driver suspend callbacks for the device, either as
.suspend()/.resume() or as .suspend_late()/resume_early(),
respectively.  In both cases, runtime PM will be disabled and runtime
PM callbacks will be used for stopping the device - or not, if it is
suspended already - but after that it must not be accessed in any way
until the resume part runs.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
  2025-12-02 12:14                           ` Rafael J. Wysocki
@ 2025-12-02 13:37                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-02 13:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: YangYang, Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On Tue, Dec 2, 2025 at 1:14 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Dec 2, 2025 at 1:41 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 12/1/25 10:47 AM, Rafael J. Wysocki wrote:
> > > Generally speaking, if blk_queue_enter() or __bio_queue_enter() may
> > > run in parallel with device_suspend_late() for q->dev, the driver of
> > > that device is defective, because it is responsible for preventing
> > > this situation from happening.  The most straightforward way to
> > > achieve that is to provide a .suspend() callback for q->dev that will
> > > runtime-resume it (and, of course, q->dev will need to be prepared for
> > > system suspend as appropriate after that).
> >
> > Isn't the suspend / hibernation order such that no block I/O is
> > submitted while block devices transition to a lower power state? I'm
> > surprised to read that individual drivers are responsible for preventing
> > that blk_queue_enter() or __bio_queue_enter() run concurrently with
> > device_suspend_late().
>
> To be more precise, they don't need to be prevented from running
> concurrently with device_suspend_late() in general.  The driver needs
> to ensure though that q->dev is not runtime-suspended in
> device_suspend_late() if blk_queue_enter() or __bio_queue_enter() are
> expected to run in parallel with it or later.
>
> > Regarding the UFSHCI driver: if a UFS controller is already runtime
> > suspended, we want it to remain suspended during system suspend.
>
> That can be done, but still the driver is responsible for preparing
> the device for system suspend.
>
> The most popular strategy is to use pm_runtime_force_suspend/resume()
> as driver suspend callbacks for the device, either as
> .suspend()/.resume() or as .suspend_late()/resume_early(),
> respectively.  In both cases, runtime PM will be disabled and runtime
> PM callbacks will be used for stopping the device - or not, if it is
> suspended already - but after that it must not be accessed in any way
> until the resume part runs.

One more thing that needs to be said here: The PM core expects the
decision on whether or not to leave a runtime-suspended device in
suspend across system-wide suspend-resume to be made before
device_suspend_late() is called for that device.  If the device is
suspended at that point, the expectation is that it will be left in
suspend.  Otherwise, the expectation is that it will be taken care of
by the .suspend_late() and .suspend_noirq() callbacks (and this goes
beyond runtime PM, quite obviously).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-01 18:47                       ` Rafael J. Wysocki
  2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
  2025-12-02  0:40                         ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Bart Van Assche
@ 2025-12-05 15:24                         ` Rafael J. Wysocki
  2025-12-05 19:10                           ` Bart Van Assche
  2 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-05 15:24 UTC (permalink / raw)
  To: linux-pm
  Cc: YangYang, Bart Van Assche, Jens Axboe, linux-block, linux-kernel,
	Ulf Hansson

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Till now, the runtime PM workqueue has been flagged as freezable, so it
does not process work items during system-wide PM transitions like
system suspend and resume.  The original reason to do that was to
reduce the likelihood of runtime PM getting in the way of system-wide
PM processing, but now it is mostly an optimization because (1) runtime
suspend of devices is prevented by bumping up their runtime PM usage
counters in device_prepare() and (2) device drivers are expected to
disable runtime PM for the devices handled by them before they embark
on system-wide PM activities that may change the state of the hardware
or otherwise interfere with runtime PM.  However, it prevents
asynchronous runtime resume of devices from working during system-wide
PM transitions, which is confusing because synchronous runtime resume
is not prevented at the same time, and it also sometimes turns out to
be problematic.

For example, it has been reported that blk_queue_enter() may deadlock
during a system suspend transition because of the pm_request_resume()
usage in it [1].  That happens because the asynchronous runtime resume
of the given device is not processed due to the freezing of the runtime
PM workqueue.  While it may be better to address this particular issue
in the block layer, the very presence of it means that similar problems
may be expected to occur elsewhere.

For this reason, remove the WQ_FREEZABLE flag from the runtime PM
workqueue and make device_suspend_late() use the generic variant of
pm_runtime_disable() that will carry out runtime PM of the device
synchronously if there is pending resume work for it.

Also update the comment before the pm_runtime_disable() call in
device_suspend_late(), to document the fact that the runtime PM
should not be expected to work for the device until the end of
device_resume_early(), and update the related documentation.

This change may, even though it is not expected to, uncover some
latent issues related to queuing up asynchronous runtime resume
work items during system suspend or hibernation.  However, they
should be limited to the interference between runtime resume and
system-wide PM callbacks in the cases when device drivers start
to handle system-wide PM before disabling runtime PM as described
above.

Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
---

v1 -> v2:
   * Update documentation in runtime_pm.rst.
   * Add R-by from Ulf.

---
 Documentation/power/runtime_pm.rst |    7 +++----
 drivers/base/power/main.c          |    7 ++++---
 kernel/power/main.c                |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

--- a/Documentation/power/runtime_pm.rst
+++ b/Documentation/power/runtime_pm.rst
@@ -714,10 +714,9 @@ out the following operations:
   * During system suspend pm_runtime_get_noresume() is called for every device
     right before executing the subsystem-level .prepare() callback for it and
     pm_runtime_barrier() is called for every device right before executing the
-    subsystem-level .suspend() callback for it.  In addition to that the PM core
-    calls __pm_runtime_disable() with 'false' as the second argument for every
-    device right before executing the subsystem-level .suspend_late() callback
-    for it.
+    subsystem-level .suspend() callback for it.  In addition to that, the PM
+    core disables runtime PM for every device right before executing the
+    subsystem-level .suspend_late() callback for it.

   * During system resume pm_runtime_enable() and pm_runtime_put() are called for
     every device right after executing the subsystem-level .resume_early()
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1647,10 +1647,11 @@ static void device_suspend_late(struct d
 		goto Complete;

 	/*
-	 * Disable runtime PM for the device without checking if there is a
-	 * pending resume request for it.
+	 * After this point, any runtime PM operations targeting the device
+	 * will fail until the corresponding pm_runtime_enable() call in
+	 * device_resume_early().
 	 */
-	__pm_runtime_disable(dev, false);
+	pm_runtime_disable(dev);

 	if (dev->power.syscore)
 		goto Skip;
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -1125,7 +1125,7 @@ EXPORT_SYMBOL_GPL(pm_wq);

 static int __init pm_start_workqueues(void)
 {
-	pm_wq = alloc_workqueue("pm", WQ_FREEZABLE | WQ_UNBOUND, 0);
+	pm_wq = alloc_workqueue("pm", WQ_UNBOUND, 0);
 	if (!pm_wq)
 		return -ENOMEM;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-05 15:24                         ` [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
@ 2025-12-05 19:10                           ` Bart Van Assche
  2025-12-07 11:23                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-12-05 19:10 UTC (permalink / raw)
  To: Rafael J. Wysocki, linux-pm
  Cc: YangYang, Jens Axboe, linux-block, linux-kernel, Ulf Hansson

On 12/5/25 5:24 AM, Rafael J. Wysocki wrote:
> For example, it has been reported that blk_queue_enter() may deadlock
> during a system suspend transition because of the pm_request_resume()
> usage in it [1].

System resume is also affected. If pm_request_resume() is called before
the device it applies to is resumed by the system resume code then the
pm_request_resume() call also hangs.

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable
  2025-12-05 19:10                           ` Bart Van Assche
@ 2025-12-07 11:23                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-12-07 11:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, linux-pm, YangYang, Jens Axboe, linux-block,
	linux-kernel, Ulf Hansson

On Fri, Dec 5, 2025 at 8:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/5/25 5:24 AM, Rafael J. Wysocki wrote:
> > For example, it has been reported that blk_queue_enter() may deadlock
> > during a system suspend transition because of the pm_request_resume()
> > usage in it [1].
>
> System resume is also affected. If pm_request_resume() is called before
> the device it applies to is resumed by the system resume code then the
> pm_request_resume() call also hangs.

Rather, the work item queued by it will not make progress.

OK, I'll add this information to the patch changelog while applying it.

> Otherwise this patch looks good to me.

Thank you!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/2] blk-mq: Fix I/O hang caused by incomplete device resume
  2025-11-26 10:16 [PATCH 0/2] PM: runtime: Fix potential I/O hang Yang Yang
  2025-11-26 10:16 ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Yang Yang
@ 2025-11-26 10:16 ` Yang Yang
  2025-11-26 11:31 ` [PATCH 0/2] PM: runtime: Fix potential I/O hang Rafael J. Wysocki
  2 siblings, 0 replies; 44+ messages in thread
From: Yang Yang @ 2025-11-26 10:16 UTC (permalink / raw)
  To: Jens Axboe, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm
  Cc: Yang Yang

Setting the force_check_resume flag ensures the device is resumed
properly.

Signed-off-by: Yang Yang <yang.yang@vivo.com>
---
 block/blk-pm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-pm.c b/block/blk-pm.c
index 8d3e052f91da..d23918fbd59f 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -28,6 +28,7 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
+	dev->power.force_check_resume = true;
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;
 	pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang
  2025-11-26 10:16 [PATCH 0/2] PM: runtime: Fix potential I/O hang Yang Yang
  2025-11-26 10:16 ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Yang Yang
  2025-11-26 10:16 ` [PATCH 2/2] blk-mq: Fix I/O hang caused by incomplete device resume Yang Yang
@ 2025-11-26 11:31 ` Rafael J. Wysocki
  2025-11-26 15:48   ` Bart Van Assche
  2 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 11:31 UTC (permalink / raw)
  To: Yang Yang
  Cc: Jens Axboe, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
>
>
> Yang Yang (2):
>   PM: runtime: Fix I/O hang due to race between resume and runtime
>     disable
>   blk-mq: Fix I/O hang caused by incomplete device resume

This is a no-go as far as I'm concerned.

Please address the issue differently.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang
  2025-11-26 11:31 ` [PATCH 0/2] PM: runtime: Fix potential I/O hang Rafael J. Wysocki
@ 2025-11-26 15:48   ` Bart Van Assche
  2025-11-26 16:59     ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-11-26 15:48 UTC (permalink / raw)
  To: Rafael J. Wysocki, Yang Yang
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On 11/26/25 3:31 AM, Rafael J. Wysocki wrote:
> Please address the issue differently.

It seems unfortunate to me that __pm_runtime_barrier() can cause 
pm_request_resume() to hang. Would it be safe to remove the
cancel_work_sync() call from __pm_runtime_barrier() since
pm_runtime_work() calls functions that check disable_depth
when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would
this be sufficient to fix the reported deadlock?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang
  2025-11-26 15:48   ` Bart Van Assche
@ 2025-11-26 16:59     ` Rafael J. Wysocki
  2025-11-26 17:21       ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 16:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Rafael J. Wysocki, Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 3:31 AM, Rafael J. Wysocki wrote:
> > Please address the issue differently.
>
> It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang.

I wouldn't call it a hang.

__pm_runtime_barrier() removes the work item queued by
pm_request_resume(), but at the time when it is called, which is
device_suspend_late(), the work item queued by pm_request_resume()
cannot make progress anyway.  It will only be able to make progress
when the PM workqueue is unfrozen at the end of the system resume
transition.

> Would it be safe to remove the
> cancel_work_sync() call from __pm_runtime_barrier() since
> pm_runtime_work() calls functions that check disable_depth
> when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would
> this be sufficient to fix the reported deadlock?

If you want the resume work item to survive the system suspend/resume
cycle, __pm_runtime_disable() may be changed to make that happen, but
this still will not allow the work to make progress until the system
resume ends.

I'm not sure if this would help to address the issue at hand though.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang
  2025-11-26 16:59     ` Rafael J. Wysocki
@ 2025-11-26 17:21       ` Rafael J. Wysocki
  2025-11-26 17:34         ` Rafael J. Wysocki
  0 siblings, 1 reply; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 17:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Yang Yang, Jens Axboe, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Danilo Krummrich, linux-block, linux-kernel,
	linux-pm

On Wed, Nov 26, 2025 at 5:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 11/26/25 3:31 AM, Rafael J. Wysocki wrote:
> > > Please address the issue differently.
> >
> > It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang.
>
> I wouldn't call it a hang.
>
> __pm_runtime_barrier() removes the work item queued by
> pm_request_resume(), but at the time when it is called, which is
> device_suspend_late(), the work item queued by pm_request_resume()
> cannot make progress anyway.  It will only be able to make progress
> when the PM workqueue is unfrozen at the end of the system resume
> transition.
>
> > Would it be safe to remove the
> > cancel_work_sync() call from __pm_runtime_barrier() since
> > pm_runtime_work() calls functions that check disable_depth
> > when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would
> > this be sufficient to fix the reported deadlock?
>
> If you want the resume work item to survive the system suspend/resume
> cycle, __pm_runtime_disable() may be changed to make that happen, but
> this still will not allow the work to make progress until the system
> resume ends.
>
> I'm not sure if this would help to address the issue at hand though.

I actually have a better idea: Why don't we resume all devices that
have runtime resume work items pending at the time when
device_suspend() is called?

Arguably, somebody wanted them to runtime-resume, so they should be
resumed before being prepared for system suspend and that will
eliminate the issue at hand (because devices cannot suspend during
system suspend/resume).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang
  2025-11-26 17:21       ` Rafael J. Wysocki
@ 2025-11-26 17:34         ` Rafael J. Wysocki
  0 siblings, 0 replies; 44+ messages in thread
From: Rafael J. Wysocki @ 2025-11-26 17:34 UTC (permalink / raw)
  To: Bart Van Assche, Yang Yang
  Cc: Jens Axboe, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Danilo Krummrich, linux-block, linux-kernel, linux-pm

On Wed, Nov 26, 2025 at 6:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Nov 26, 2025 at 5:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 11/26/25 3:31 AM, Rafael J. Wysocki wrote:
> > > > Please address the issue differently.
> > >
> > > It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang.
> >
> > I wouldn't call it a hang.
> >
> > __pm_runtime_barrier() removes the work item queued by
> > pm_request_resume(), but at the time when it is called, which is
> > device_suspend_late(), the work item queued by pm_request_resume()
> > cannot make progress anyway.  It will only be able to make progress
> > when the PM workqueue is unfrozen at the end of the system resume
> > transition.
> >
> > > Would it be safe to remove the
> > > cancel_work_sync() call from __pm_runtime_barrier() since
> > > pm_runtime_work() calls functions that check disable_depth
> > > when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would
> > > this be sufficient to fix the reported deadlock?
> >
> > If you want the resume work item to survive the system suspend/resume
> > cycle, __pm_runtime_disable() may be changed to make that happen, but
> > this still will not allow the work to make progress until the system
> > resume ends.
> >
> > I'm not sure if this would help to address the issue at hand though.
>
> I actually have a better idea: Why don't we resume all devices that
> have runtime resume work items pending at the time when
> device_suspend() is called?
>
> Arguably, somebody wanted them to runtime-resume, so they should be
> resumed before being prepared for system suspend and that will
> eliminate the issue at hand (because devices cannot suspend during
> system suspend/resume).

Wait, there is a pm_runtime_barrier() call in device_suspend() that
does just that and additionally it calls __pm_runtime_barrier(), so
all of the pending runtime PM work items should be cancelled by it.

So it looks like the device in question is runtime-suspended at that
point and only later blk_pm_resume_queue() is called to resume it.
I'm wondering where it is called from.  And maybe pm_runtime_resume()
should be called for it from its ->suspend() callback?

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-12-07 11:23 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-26 10:16 [PATCH 0/2] PM: runtime: Fix potential I/O hang Yang Yang
2025-11-26 10:16 ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Yang Yang
2025-11-26 11:30   ` Rafael J. Wysocki
2025-11-26 11:59     ` YangYang
2025-11-26 12:36       ` Rafael J. Wysocki
2025-11-26 15:33         ` Bart Van Assche
2025-11-26 15:41           ` Rafael J. Wysocki
2025-11-26 18:40             ` Bart Van Assche
2025-11-27 11:29               ` YangYang
2025-11-27 12:44                 ` Rafael J. Wysocki
2025-11-28  7:20                   ` YangYang
2025-12-01 16:40                 ` Bart Van Assche
2025-11-26 18:06     ` Bart Van Assche
2025-11-26 19:16       ` Rafael J. Wysocki
2025-11-26 19:34         ` Rafael J. Wysocki
2025-11-26 20:17           ` Rafael J. Wysocki
2025-11-26 21:10             ` Bart Van Assche
2025-11-26 21:30               ` Rafael J. Wysocki
2025-11-26 22:47                 ` Bart Van Assche
2025-11-27 12:34                   ` Rafael J. Wysocki
2025-12-01  9:46                     ` YangYang
2025-12-01 12:56                       ` YangYang
2025-12-01 18:55                         ` Rafael J. Wysocki
2025-12-02 10:33                           ` YangYang
2025-12-02 12:18                             ` Rafael J. Wysocki
2025-12-01 18:47                       ` Rafael J. Wysocki
2025-12-01 19:58                         ` [PATCH v1] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
2025-12-02  1:06                           ` Bart Van Assche
2025-12-02 11:53                             ` Rafael J. Wysocki
2025-12-02 13:29                               ` Rafael J. Wysocki
2025-12-02 10:36                           ` YangYang
2025-12-02 14:58                           ` Ulf Hansson
2025-12-02  0:40                         ` [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable Bart Van Assche
2025-12-02 12:14                           ` Rafael J. Wysocki
2025-12-02 13:37                             ` Rafael J. Wysocki
2025-12-05 15:24                         ` [PATCH v2] PM: sleep: Do not flag runtime PM workqueue as freezable Rafael J. Wysocki
2025-12-05 19:10                           ` Bart Van Assche
2025-12-07 11:23                             ` Rafael J. Wysocki
2025-11-26 10:16 ` [PATCH 2/2] blk-mq: Fix I/O hang caused by incomplete device resume Yang Yang
2025-11-26 11:31 ` [PATCH 0/2] PM: runtime: Fix potential I/O hang Rafael J. Wysocki
2025-11-26 15:48   ` Bart Van Assche
2025-11-26 16:59     ` Rafael J. Wysocki
2025-11-26 17:21       ` Rafael J. Wysocki
2025-11-26 17:34         ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox