Thinkpad E595 system deadlock on resume from S3

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

* Thinkpad E595 system deadlock on resume from S3
@ 2023-09-30 10:20 Petr Tesařík
  2023-10-03  9:31 ` Petr Tesařík
  0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-09-30 10:20 UTC (permalink / raw)
  To: linux-pm

Hi all,

this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
to resume from S3. It also fails the same way with Tumbleweed v6.5
kernel. I was able to capture a crash dump of the v6.5 kernel, and
here's my analysis:

The system never gets to waking up my SATA SSD disk:

[0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda 

There is a pending resume work for kworker/u32:12 (PID 11032), but this
worker is stuck in 'D' state:

>>> prog.stack_trace(11032)
#0  context_switch (../kernel/sched/core.c:5381:2)
#1  __schedule (../kernel/sched/core.c:6710:8)
#2  schedule (../kernel/sched/core.c:6786:3)
#3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
#4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
#5  __mutex_lock (../kernel/locking/mutex.c:747:9)
#6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
#7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
#8  process_one_work (../kernel/workqueue.c:2600:2)
#9  worker_thread (../kernel/workqueue.c:2751:4)
#10 kthread (../kernel/kthread.c:389:9)
#11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
#12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)

acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
systemd-sleep (PID 11002). This task is also in 'D' state:

>>> prog.stack_trace(11002)
#0  context_switch (../kernel/sched/core.c:5381:2)
#1  __schedule (../kernel/sched/core.c:6710:8)
#2  schedule (../kernel/sched/core.c:6786:3)
#3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
#4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
#5  __mutex_lock (../kernel/locking/mutex.c:747:9)
#6  device_lock (../include/linux/device.h:958:2)
#7  device_complete (../drivers/base/power/main.c:1063:2)
#8  dpm_complete (../drivers/base/power/main.c:1121:3)
#9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
#10 enter_state (../kernel/power/suspend.c:592:10)
#11 pm_suspend (../kernel/power/suspend.c:619:10)
#12 state_store (../kernel/power/main.c:707:11)
#13 kernfs_fop_write_iter (../fs/kernfs/file.c:334:9)
#14 call_write_iter (../include/linux/fs.h:1877:9)
#15 new_sync_write (../fs/read_write.c:491:8)
#16 vfs_write (../fs/read_write.c:584:9)
#17 ksys_write (../fs/read_write.c:637:9)
#18 do_syscall_x64 (../arch/x86/entry/common.c:50:14)
#19 do_syscall_64 (../arch/x86/entry/common.c:80:7)
#20 entry_SYSCALL_64+0xaa/0x1a6 (../arch/x86/entry/entry_64.S:120)

It is trying to acquire dev->mutex, which is in turn held by
kworker/0:0 (PID 10830), also in 'D' state:

>>> prog.stack_trace(10830)
#0  context_switch (../kernel/sched/core.c:5381:2)
#1  __schedule (../kernel/sched/core.c:6710:8)
#2  schedule (../kernel/sched/core.c:6786:3)
#3  blk_queue_enter (../block/blk-core.c:326:3)
#4  blk_mq_alloc_request (../block/blk-mq.c:592:9)
#5  scsi_alloc_request (../drivers/scsi/scsi_lib.c:1139:7)
#6  scsi_execute_cmd (../drivers/scsi/scsi_lib.c:217:8)
#7  scsi_vpd_inquiry (../drivers/scsi/scsi.c:312:11)
#8  scsi_get_vpd_size (../drivers/scsi/scsi.c:345:11)
#9  scsi_get_vpd_size (../drivers/scsi/scsi.c:336:5)
#10 scsi_get_vpd_buf (../drivers/scsi/scsi.c:415:12)
#11 scsi_attach_vpd (../drivers/scsi/scsi.c:483:12)
#12 scsi_rescan_device (../drivers/scsi/scsi_scan.c:1628:2)
#13 ata_scsi_dev_rescan (../drivers/ata/libata-scsi.c:4894:4)
#14 process_one_work (../kernel/workqueue.c:2600:2)
#15 worker_thread (../kernel/workqueue.c:2751:4)
#16 kthread (../kernel/kthread.c:389:9)
#17 ret_from_fork (../arch/x86/kernel/process.c:145:3)
#18 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)

And here we are, waiting for the completion of the resume work that is
scheduled on the blocked kworker/u32:12 (PID 11032), see above.

FWIW the deadlock does not happen if autosuspend is turned off for this
disk.

I'm willing to invest more time into debugging, but I'm not well-versed
in power management, so I may need some help from you as to what else I
should look at. Please, include me in Cc of all potential replies; I'm
not subscribed to linux-pm.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-09-30 10:20 Thinkpad E595 system deadlock on resume from S3 Petr Tesařík
@ 2023-10-03  9:31 ` Petr Tesařík
  2023-10-03 10:15   ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03  9:31 UTC (permalink / raw)
  To: linux-pm, linux-acpi, Rafael J. Wysocki

Hi again (adding more recipients),

On Sat, 30 Sep 2023 12:20:54 +0200
Petr Tesařík <petr@tesarici.cz> wrote:

> Hi all,
> 
> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> to resume from S3. It also fails the same way with Tumbleweed v6.5
> kernel. I was able to capture a crash dump of the v6.5 kernel, and
> here's my analysis:
> 
> The system never gets to waking up my SATA SSD disk:
> 
> [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda 
> 
> There is a pending resume work for kworker/u32:12 (PID 11032), but this
> worker is stuck in 'D' state:
> 
> >>> prog.stack_trace(11032)  
> #0  context_switch (../kernel/sched/core.c:5381:2)
> #1  __schedule (../kernel/sched/core.c:6710:8)
> #2  schedule (../kernel/sched/core.c:6786:3)
> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> #8  process_one_work (../kernel/workqueue.c:2600:2)
> #9  worker_thread (../kernel/workqueue.c:2751:4)
> #10 kthread (../kernel/kthread.c:389:9)
> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> 
> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> systemd-sleep (PID 11002). This task is also in 'D' state:
> 
> >>> prog.stack_trace(11002)  
> #0  context_switch (../kernel/sched/core.c:5381:2)
> #1  __schedule (../kernel/sched/core.c:6710:8)
> #2  schedule (../kernel/sched/core.c:6786:3)
> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> #6  device_lock (../include/linux/device.h:958:2)
> #7  device_complete (../drivers/base/power/main.c:1063:2)
> #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)

I believe the issue must be somewhere here. The whole suspend and
resume logic in suspend_devices_and_enter() is framed by
platform_suspend_begin() and platform_resume_end().

My system is an ACPI system, so suspend_ops contains:

	.begin = acpi_suspend_begin,
	.end = acpi_pm_end,

Now, acpi_suspend_begin() acquires acpi_scan_lock through
acpi_pm_start(), and the lock is not released until acpi_pm_end().
Since dpm_complete() waits for the completion of a work that tries to
acquire acpi_scan_lock, the system will deadlock.

AFAICS either:

a. the ACPI lock cannot be held while dpm_complete() runs, or
b. ata_scsi_dev_rescan() must not be scheduled before the system is
resumed, or
c. acpi_device_hotplug() must be implemented without taking dev->mutex.

My gut feeling is that b. is the right answer.

I'm afraid I can't contribute too much more ATM, at least not before I
find out what broke my USB hub between rc3 and rc4.

Regards
Petr T

> #10 enter_state (../kernel/power/suspend.c:592:10)
> #11 pm_suspend (../kernel/power/suspend.c:619:10)
> #12 state_store (../kernel/power/main.c:707:11)
> #13 kernfs_fop_write_iter (../fs/kernfs/file.c:334:9)
> #14 call_write_iter (../include/linux/fs.h:1877:9)
> #15 new_sync_write (../fs/read_write.c:491:8)
> #16 vfs_write (../fs/read_write.c:584:9)
> #17 ksys_write (../fs/read_write.c:637:9)
> #18 do_syscall_x64 (../arch/x86/entry/common.c:50:14)
> #19 do_syscall_64 (../arch/x86/entry/common.c:80:7)
> #20 entry_SYSCALL_64+0xaa/0x1a6 (../arch/x86/entry/entry_64.S:120)
> 
> It is trying to acquire dev->mutex, which is in turn held by
> kworker/0:0 (PID 10830), also in 'D' state:
> 
> >>> prog.stack_trace(10830)  
> #0  context_switch (../kernel/sched/core.c:5381:2)
> #1  __schedule (../kernel/sched/core.c:6710:8)
> #2  schedule (../kernel/sched/core.c:6786:3)
> #3  blk_queue_enter (../block/blk-core.c:326:3)
> #4  blk_mq_alloc_request (../block/blk-mq.c:592:9)
> #5  scsi_alloc_request (../drivers/scsi/scsi_lib.c:1139:7)
> #6  scsi_execute_cmd (../drivers/scsi/scsi_lib.c:217:8)
> #7  scsi_vpd_inquiry (../drivers/scsi/scsi.c:312:11)
> #8  scsi_get_vpd_size (../drivers/scsi/scsi.c:345:11)
> #9  scsi_get_vpd_size (../drivers/scsi/scsi.c:336:5)
> #10 scsi_get_vpd_buf (../drivers/scsi/scsi.c:415:12)
> #11 scsi_attach_vpd (../drivers/scsi/scsi.c:483:12)
> #12 scsi_rescan_device (../drivers/scsi/scsi_scan.c:1628:2)
> #13 ata_scsi_dev_rescan (../drivers/ata/libata-scsi.c:4894:4)
> #14 process_one_work (../kernel/workqueue.c:2600:2)
> #15 worker_thread (../kernel/workqueue.c:2751:4)
> #16 kthread (../kernel/kthread.c:389:9)
> #17 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> #18 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> 
> And here we are, waiting for the completion of the resume work that is
> scheduled on the blocked kworker/u32:12 (PID 11032), see above.
> 
> FWIW the deadlock does not happen if autosuspend is turned off for this
> disk.
> 
> I'm willing to invest more time into debugging, but I'm not well-versed
> in power management, so I may need some help from you as to what else I
> should look at. Please, include me in Cc of all potential replies; I'm
> not subscribed to linux-pm.
> 
> Petr T


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03  9:31 ` Petr Tesařík
@ 2023-10-03 10:15   ` Rafael J. Wysocki
  2023-10-03 11:02     ` Petr Tesařík
  0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 10:15 UTC (permalink / raw)
  To: Petr Tesařík; +Cc: linux-pm, linux-acpi, Rafael J. Wysocki

On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>
> Hi again (adding more recipients),
>
> On Sat, 30 Sep 2023 12:20:54 +0200
> Petr Tesařík <petr@tesarici.cz> wrote:
>
> > Hi all,
> >
> > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > here's my analysis:
> >
> > The system never gets to waking up my SATA SSD disk:
> >
> > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> >
> > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > worker is stuck in 'D' state:
> >
> > >>> prog.stack_trace(11032)
> > #0  context_switch (../kernel/sched/core.c:5381:2)
> > #1  __schedule (../kernel/sched/core.c:6710:8)
> > #2  schedule (../kernel/sched/core.c:6786:3)
> > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > #10 kthread (../kernel/kthread.c:389:9)
> > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> >
> > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > systemd-sleep (PID 11002). This task is also in 'D' state:
> >
> > >>> prog.stack_trace(11002)
> > #0  context_switch (../kernel/sched/core.c:5381:2)
> > #1  __schedule (../kernel/sched/core.c:6710:8)
> > #2  schedule (../kernel/sched/core.c:6786:3)
> > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > #6  device_lock (../include/linux/device.h:958:2)
> > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>
> I believe the issue must be somewhere here. The whole suspend and
> resume logic in suspend_devices_and_enter() is framed by
> platform_suspend_begin() and platform_resume_end().
>
> My system is an ACPI system, so suspend_ops contains:
>
>         .begin = acpi_suspend_begin,
>         .end = acpi_pm_end,
>
> Now, acpi_suspend_begin() acquires acpi_scan_lock through
> acpi_pm_start(), and the lock is not released until acpi_pm_end().
> Since dpm_complete() waits for the completion of a work that tries to
> acquire acpi_scan_lock, the system will deadlock.

So holding acpi_scan_lock across suspend-resume is basically to
prevent the hotplug from taking place then IIRC.

> AFAICS either:
>
> a. the ACPI lock cannot be held while dpm_complete() runs, or
> b. ata_scsi_dev_rescan() must not be scheduled before the system is
> resumed, or
> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>
> My gut feeling is that b. is the right answer.

It's been a while since I looked at that code last time, but then it
has not changed for quite some time too.

It looks like the acpi_device_hotplug() path attempts to acquire
acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
need to check that, though.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 10:15   ` Rafael J. Wysocki
@ 2023-10-03 11:02     ` Petr Tesařík
  2023-10-03 12:34       ` Rafael J. Wysocki
  2023-10-03 12:35       ` Petr Tesařík
  0 siblings, 2 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 11:02 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi

On Tue, 3 Oct 2023 12:15:10 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > Hi again (adding more recipients),
> >
> > On Sat, 30 Sep 2023 12:20:54 +0200
> > Petr Tesařík <petr@tesarici.cz> wrote:
> >  
> > > Hi all,
> > >
> > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > here's my analysis:
> > >
> > > The system never gets to waking up my SATA SSD disk:
> > >
> > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > >
> > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > worker is stuck in 'D' state:
> > >  
> > > >>> prog.stack_trace(11032)  
> > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > #10 kthread (../kernel/kthread.c:389:9)
> > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > >
> > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > >  
> > > >>> prog.stack_trace(11002)  
> > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > #6  device_lock (../include/linux/device.h:958:2)
> > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> >
> > I believe the issue must be somewhere here. The whole suspend and
> > resume logic in suspend_devices_and_enter() is framed by
> > platform_suspend_begin() and platform_resume_end().
> >
> > My system is an ACPI system, so suspend_ops contains:
> >
> >         .begin = acpi_suspend_begin,
> >         .end = acpi_pm_end,
> >
> > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > Since dpm_complete() waits for the completion of a work that tries to
> > acquire acpi_scan_lock, the system will deadlock.  
> 
> So holding acpi_scan_lock across suspend-resume is basically to
> prevent the hotplug from taking place then IIRC.
> 
> > AFAICS either:
> >
> > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > resumed, or
> > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> >
> > My gut feeling is that b. is the right answer.  
> 
> It's been a while since I looked at that code last time, but then it
> has not changed for quite some time too.
> 
> It looks like the acpi_device_hotplug() path attempts to acquire
> acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> need to check that, though.

Thanks for your willingness. Well, it's not quite what you describe. If
it was a simple ABBA deadlock, then it would be reported by lockdep.
No, it's more complicated:

1. suspend_devices_and_enter() holds acpi_scan_lock,
2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
   when it gets to acquiring acpi_scan_lock,
3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
   completion while holding dev->mutex,
4. the SCSI completion work happens to be put on the same workqueue as
   the ACPI hotplug work in step 2,
   ^^^--- THIS is how the two events are serialized!
5. suspend_devices_and_enter() calls dpm_complete() ->
   device_complete() -> device_lock() and cannot acquire dev->mutex.

HTH,
Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 11:02     ` Petr Tesařík
@ 2023-10-03 12:34       ` Rafael J. Wysocki
  2023-10-03 12:40         ` Petr Tesařík
  2023-10-03 12:35       ` Petr Tesařík
  1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 12:34 UTC (permalink / raw)
  To: Petr Tesařík; +Cc: Rafael J. Wysocki, linux-pm, linux-acpi

On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 12:15:10 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > Hi again (adding more recipients),
> > >
> > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > > Hi all,
> > > >
> > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > here's my analysis:
> > > >
> > > > The system never gets to waking up my SATA SSD disk:
> > > >
> > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > >
> > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > worker is stuck in 'D' state:
> > > >
> > > > >>> prog.stack_trace(11032)
> > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > >
> > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > >
> > > > >>> prog.stack_trace(11002)
> > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > >
> > > I believe the issue must be somewhere here. The whole suspend and
> > > resume logic in suspend_devices_and_enter() is framed by
> > > platform_suspend_begin() and platform_resume_end().
> > >
> > > My system is an ACPI system, so suspend_ops contains:
> > >
> > >         .begin = acpi_suspend_begin,
> > >         .end = acpi_pm_end,
> > >
> > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > Since dpm_complete() waits for the completion of a work that tries to
> > > acquire acpi_scan_lock, the system will deadlock.
> >
> > So holding acpi_scan_lock across suspend-resume is basically to
> > prevent the hotplug from taking place then IIRC.
> >
> > > AFAICS either:
> > >
> > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > resumed, or
> > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > >
> > > My gut feeling is that b. is the right answer.
> >
> > It's been a while since I looked at that code last time, but then it
> > has not changed for quite some time too.
> >
> > It looks like the acpi_device_hotplug() path attempts to acquire
> > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > need to check that, though.
>
> Thanks for your willingness. Well, it's not quite what you describe. If
> it was a simple ABBA deadlock, then it would be reported by lockdep.
> No, it's more complicated:
>
> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>    when it gets to acquiring acpi_scan_lock,
> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>    completion while holding dev->mutex,
> 4. the SCSI completion work happens to be put on the same workqueue as
>    the ACPI hotplug work in step 2,
>    ^^^--- THIS is how the two events are serialized!

Which is unexpected.

And quite honestly I'm not sure how this can happen, because
acpi_hotplug_schedule() uses a dedicated workqueue and it is called
from (a) the "eject" sysfs attribute (which cannot happen while system
suspend-resume is in progress) and (b) acpi_bus_notify() which has
nothing to do with SCSI.

Maybe the workqueue used for the SCSI completion is freezable?

> 5. suspend_devices_and_enter() calls dpm_complete() ->
>    device_complete() -> device_lock() and cannot acquire dev->mutex.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 12:34       ` Rafael J. Wysocki
@ 2023-10-03 12:40         ` Petr Tesařík
  2023-10-03 12:48           ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 12:40 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi

On Tue, 3 Oct 2023 14:34:56 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 12:15:10 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >  
> > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > >
> > > > Hi again (adding more recipients),
> > > >
> > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > >  
> > > > > Hi all,
> > > > >
> > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > here's my analysis:
> > > > >
> > > > > The system never gets to waking up my SATA SSD disk:
> > > > >
> > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > >
> > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > worker is stuck in 'D' state:
> > > > >  
> > > > > >>> prog.stack_trace(11032)  
> > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > >
> > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > >  
> > > > > >>> prog.stack_trace(11002)  
> > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> > > >
> > > > I believe the issue must be somewhere here. The whole suspend and
> > > > resume logic in suspend_devices_and_enter() is framed by
> > > > platform_suspend_begin() and platform_resume_end().
> > > >
> > > > My system is an ACPI system, so suspend_ops contains:
> > > >
> > > >         .begin = acpi_suspend_begin,
> > > >         .end = acpi_pm_end,
> > > >
> > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > acquire acpi_scan_lock, the system will deadlock.  
> > >
> > > So holding acpi_scan_lock across suspend-resume is basically to
> > > prevent the hotplug from taking place then IIRC.
> > >  
> > > > AFAICS either:
> > > >
> > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > resumed, or
> > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > >
> > > > My gut feeling is that b. is the right answer.  
> > >
> > > It's been a while since I looked at that code last time, but then it
> > > has not changed for quite some time too.
> > >
> > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > need to check that, though.  
> >
> > Thanks for your willingness. Well, it's not quite what you describe. If
> > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > No, it's more complicated:
> >
> > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> >    when it gets to acquiring acpi_scan_lock,
> > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> >    completion while holding dev->mutex,
> > 4. the SCSI completion work happens to be put on the same workqueue as
> >    the ACPI hotplug work in step 2,
> >    ^^^--- THIS is how the two events are serialized!  
> 
> Which is unexpected.
> 
> And quite honestly I'm not sure how this can happen, because
> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> from (a) the "eject" sysfs attribute (which cannot happen while system
> suspend-resume is in progress) and (b) acpi_bus_notify() which has
> nothing to do with SCSI.

Oh, you're right, and I was too quick. They cannot be on the same
queue...

> Maybe the workqueue used for the SCSI completion is freezable?

Yes, that's it:

*(struct workqueue_struct *)0xffff97d240b2fe00 = {
/* ... */
        .flags = (unsigned int)4,
/* WQ_FREEZABLE            = 1 << 2 */

Good. But if this workqueue is frozen, the system still cannot make
progress.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 12:40         ` Petr Tesařík
@ 2023-10-03 12:48           ` Rafael J. Wysocki
  2023-10-03 12:51             ` Petr Tesařík
  0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 12:48 UTC (permalink / raw)
  To: Petr Tesařík; +Cc: Rafael J. Wysocki, linux-pm, linux-acpi

On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:34:56 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > Hi again (adding more recipients),
> > > > >
> > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > here's my analysis:
> > > > > >
> > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > >
> > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > >
> > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > worker is stuck in 'D' state:
> > > > > >
> > > > > > >>> prog.stack_trace(11032)
> > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > >
> > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > >
> > > > > > >>> prog.stack_trace(11002)
> > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > >
> > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > platform_suspend_begin() and platform_resume_end().
> > > > >
> > > > > My system is an ACPI system, so suspend_ops contains:
> > > > >
> > > > >         .begin = acpi_suspend_begin,
> > > > >         .end = acpi_pm_end,
> > > > >
> > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > acquire acpi_scan_lock, the system will deadlock.
> > > >
> > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > prevent the hotplug from taking place then IIRC.
> > > >
> > > > > AFAICS either:
> > > > >
> > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > resumed, or
> > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > >
> > > > > My gut feeling is that b. is the right answer.
> > > >
> > > > It's been a while since I looked at that code last time, but then it
> > > > has not changed for quite some time too.
> > > >
> > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > need to check that, though.
> > >
> > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > No, it's more complicated:
> > >
> > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > >    when it gets to acquiring acpi_scan_lock,
> > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > >    completion while holding dev->mutex,
> > > 4. the SCSI completion work happens to be put on the same workqueue as
> > >    the ACPI hotplug work in step 2,
> > >    ^^^--- THIS is how the two events are serialized!
> >
> > Which is unexpected.
> >
> > And quite honestly I'm not sure how this can happen, because
> > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > from (a) the "eject" sysfs attribute (which cannot happen while system
> > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > nothing to do with SCSI.
>
> Oh, you're right, and I was too quick. They cannot be on the same
> queue...
>
> > Maybe the workqueue used for the SCSI completion is freezable?
>
> Yes, that's it:
>
> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> /* ... */
>         .flags = (unsigned int)4,
> /* WQ_FREEZABLE            = 1 << 2 */
>
> Good. But if this workqueue is frozen, the system still cannot make
> progress.

The problem seems to be that dev->mutex is held while the work item
goes to a freezable workqueue and is waited for, which is an almost
guaranteed deadlock scenario.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 12:48           ` Rafael J. Wysocki
@ 2023-10-03 12:51             ` Petr Tesařík
  2023-10-03 12:57               ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 12:51 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi

On Tue, 3 Oct 2023 14:48:13 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:34:56 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >  
> > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > >
> > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >  
> > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > >
> > > > > > Hi again (adding more recipients),
> > > > > >
> > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >  
> > > > > > > Hi all,
> > > > > > >
> > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > here's my analysis:
> > > > > > >
> > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > >
> > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > >
> > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > worker is stuck in 'D' state:
> > > > > > >  
> > > > > > > >>> prog.stack_trace(11032)  
> > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > >
> > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > >  
> > > > > > > >>> prog.stack_trace(11002)  
> > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> > > > > >
> > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > >
> > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > >
> > > > > >         .begin = acpi_suspend_begin,
> > > > > >         .end = acpi_pm_end,
> > > > > >
> > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > acquire acpi_scan_lock, the system will deadlock.  
> > > > >
> > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > prevent the hotplug from taking place then IIRC.
> > > > >  
> > > > > > AFAICS either:
> > > > > >
> > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > resumed, or
> > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > >
> > > > > > My gut feeling is that b. is the right answer.  
> > > > >
> > > > > It's been a while since I looked at that code last time, but then it
> > > > > has not changed for quite some time too.
> > > > >
> > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > need to check that, though.  
> > > >
> > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > No, it's more complicated:
> > > >
> > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > >    when it gets to acquiring acpi_scan_lock,
> > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > >    completion while holding dev->mutex,
> > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > >    the ACPI hotplug work in step 2,
> > > >    ^^^--- THIS is how the two events are serialized!  
> > >
> > > Which is unexpected.
> > >
> > > And quite honestly I'm not sure how this can happen, because
> > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > nothing to do with SCSI.  
> >
> > Oh, you're right, and I was too quick. They cannot be on the same
> > queue...
> >  
> > > Maybe the workqueue used for the SCSI completion is freezable?  
> >
> > Yes, that's it:
> >
> > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > /* ... */
> >         .flags = (unsigned int)4,
> > /* WQ_FREEZABLE            = 1 << 2 */
> >
> > Good. But if this workqueue is frozen, the system still cannot make
> > progress.  
> 
> The problem seems to be that dev->mutex is held while the work item
> goes to a freezable workqueue and is waited for, which is an almost
> guaranteed deadlock scenario.

Ah. Thanks for explanation and direction! I'm going to dive into the
block layer and/or SCSI code and bug other people with my findings.

I'm sorry for taking your time; I hope it wasn't too much.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 12:51             ` Petr Tesařík
@ 2023-10-03 12:57               ` Rafael J. Wysocki
  2023-10-03 15:18                 ` Petr Tesařík
  0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 12:57 UTC (permalink / raw)
  To: Petr Tesařík; +Cc: Rafael J. Wysocki, linux-pm, linux-acpi

On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:48:13 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > Hi again (adding more recipients),
> > > > > > >
> > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > here's my analysis:
> > > > > > > >
> > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > >
> > > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > > >
> > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > worker is stuck in 'D' state:
> > > > > > > >
> > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > >
> > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > >
> > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > >
> > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > >
> > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > >
> > > > > > >         .begin = acpi_suspend_begin,
> > > > > > >         .end = acpi_pm_end,
> > > > > > >
> > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > >
> > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > prevent the hotplug from taking place then IIRC.
> > > > > >
> > > > > > > AFAICS either:
> > > > > > >
> > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > resumed, or
> > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > >
> > > > > > > My gut feeling is that b. is the right answer.
> > > > > >
> > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > has not changed for quite some time too.
> > > > > >
> > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > > need to check that, though.
> > > > >
> > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > No, it's more complicated:
> > > > >
> > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > >    when it gets to acquiring acpi_scan_lock,
> > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > >    completion while holding dev->mutex,
> > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > >    the ACPI hotplug work in step 2,
> > > > >    ^^^--- THIS is how the two events are serialized!
> > > >
> > > > Which is unexpected.
> > > >
> > > > And quite honestly I'm not sure how this can happen, because
> > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > nothing to do with SCSI.
> > >
> > > Oh, you're right, and I was too quick. They cannot be on the same
> > > queue...
> > >
> > > > Maybe the workqueue used for the SCSI completion is freezable?
> > >
> > > Yes, that's it:
> > >
> > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > /* ... */
> > >         .flags = (unsigned int)4,
> > > /* WQ_FREEZABLE            = 1 << 2 */
> > >
> > > Good. But if this workqueue is frozen, the system still cannot make
> > > progress.
> >
> > The problem seems to be that dev->mutex is held while the work item
> > goes to a freezable workqueue and is waited for, which is an almost
> > guaranteed deadlock scenario.
>
> Ah. Thanks for explanation and direction! I'm going to dive into the
> block layer and/or SCSI code and bug other people with my findings.

Please feel free to CC me on that in case I can help.

> I'm sorry for taking your time;

No worries.

> I hope it wasn't too much.

Not at all, I'm generally interested in all issues related to
suspend-resume and this one looks like it may affect a bunch of users.

Cheers!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 12:57               ` Rafael J. Wysocki
@ 2023-10-03 15:18                 ` Petr Tesařík
  2023-10-03 16:16                   ` Petr Tesařík
                                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 15:18 UTC (permalink / raw)
  To: Rafael J. Wysocki, Damien Le Moal; +Cc: linux-pm, linux-ide

On Tue, 3 Oct 2023 14:57:46 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:48:13 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >  
> > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > >
> > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >  
> > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > >
> > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >  
> > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > > > >
> > > > > > > > Hi again (adding more recipients),
> > > > > > > >
> > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >  
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > here's my analysis:
> > > > > > > > >
> > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > >
> > > > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > > > >
> > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > >  
> > > > > > > > > >>> prog.stack_trace(11032)  
> > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > >
> > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > >  
> > > > > > > > > >>> prog.stack_trace(11002)  
> > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> > > > > > > >
> > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > >
> > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > >
> > > > > > > >         .begin = acpi_suspend_begin,
> > > > > > > >         .end = acpi_pm_end,
> > > > > > > >
> > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > acquire acpi_scan_lock, the system will deadlock.  
> > > > > > >
> > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > >  
> > > > > > > > AFAICS either:
> > > > > > > >
> > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > resumed, or
> > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > >
> > > > > > > > My gut feeling is that b. is the right answer.  
> > > > > > >
> > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > has not changed for quite some time too.
> > > > > > >
> > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > > > need to check that, though.  
> > > > > >
> > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > No, it's more complicated:
> > > > > >
> > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > >    when it gets to acquiring acpi_scan_lock,
> > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > >    completion while holding dev->mutex,
> > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > >    the ACPI hotplug work in step 2,
> > > > > >    ^^^--- THIS is how the two events are serialized!  
> > > > >
> > > > > Which is unexpected.
> > > > >
> > > > > And quite honestly I'm not sure how this can happen, because
> > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > nothing to do with SCSI.  
> > > >
> > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > queue...
> > > >  
> > > > > Maybe the workqueue used for the SCSI completion is freezable?  
> > > >
> > > > Yes, that's it:
> > > >
> > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > /* ... */
> > > >         .flags = (unsigned int)4,
> > > > /* WQ_FREEZABLE            = 1 << 2 */
> > > >
> > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > progress.  
> > >
> > > The problem seems to be that dev->mutex is held while the work item
> > > goes to a freezable workqueue and is waited for, which is an almost
> > > guaranteed deadlock scenario.  
> >
> > Ah. Thanks for explanation and direction! I'm going to dive into the
> > block layer and/or SCSI code and bug other people with my findings.  
> 
> Please feel free to CC me on that in case I can help.

And here I am again... The frozen workqueue is in fact pm_wq, and the
work item that is waited for is pm_runtime_work. The block layer calls
pm_request_resume() on the device to resume the queue.

I bet the queue should not be resumed this early. In fact, it seems
that this is somewhat known to the ATA developers, because
ata_scsi_dev_rescan() contains this beautiful comment and code:

			/*
			 * If the rescan work was scheduled because of a resume
			 * event, the port is already fully resumed, but the
			 * SCSI device may not yet be fully resumed. In such
			 * case, executing scsi_rescan_device() may cause a
			 * deadlock with the PM code on device_lock(). Prevent
			 * this by giving up and retrying rescan after a short
			 * delay.
			 */
			delay_rescan = sdev->sdev_gendev.power.is_suspended;
			if (delay_rescan) {
				scsi_device_put(sdev);
				break;
			}

It just doesn't seem to work as expected, at least not in my case.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 15:18                 ` Petr Tesařík
@ 2023-10-03 16:16                   ` Petr Tesařík
  2023-10-03 16:48                   ` Rafael J. Wysocki
  2023-10-04  6:13                   ` Hannes Reinecke
  2 siblings, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 16:16 UTC (permalink / raw)
  To: Rafael J. Wysocki, Damien Le Moal; +Cc: linux-pm, linux-ide

On Tue, 3 Oct 2023 17:18:24 +0200
Petr Tesařík <petr@tesarici.cz> wrote:

> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> 
> > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > >
> > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >    
> > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:    
> > > > >
> > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >    
> > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:    
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > >    
> > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:    
> > > > > > > > >
> > > > > > > > > Hi again (adding more recipients),
> > > > > > > > >
> > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >    
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > here's my analysis:
> > > > > > > > > >
> > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > >
> > > > > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > > > > >
> > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > >    
> > > > > > > > > > >>> prog.stack_trace(11032)    
> > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > >
> > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > >    
> > > > > > > > > > >>> prog.stack_trace(11002)    
> > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)    
> > > > > > > > >
> > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > >
> > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > >
> > > > > > > > >         .begin = acpi_suspend_begin,
> > > > > > > > >         .end = acpi_pm_end,
> > > > > > > > >
> > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > acquire acpi_scan_lock, the system will deadlock.    
> > > > > > > >
> > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > >    
> > > > > > > > > AFAICS either:
> > > > > > > > >
> > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > resumed, or
> > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > >
> > > > > > > > > My gut feeling is that b. is the right answer.    
> > > > > > > >
> > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > has not changed for quite some time too.
> > > > > > > >
> > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > > > > need to check that, though.    
> > > > > > >
> > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > No, it's more complicated:
> > > > > > >
> > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > >    when it gets to acquiring acpi_scan_lock,
> > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > >    completion while holding dev->mutex,
> > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > >    the ACPI hotplug work in step 2,
> > > > > > >    ^^^--- THIS is how the two events are serialized!    
> > > > > >
> > > > > > Which is unexpected.
> > > > > >
> > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > nothing to do with SCSI.    
> > > > >
> > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > queue...
> > > > >    
> > > > > > Maybe the workqueue used for the SCSI completion is freezable?    
> > > > >
> > > > > Yes, that's it:
> > > > >
> > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > /* ... */
> > > > >         .flags = (unsigned int)4,
> > > > > /* WQ_FREEZABLE            = 1 << 2 */
> > > > >
> > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > progress.    
> > > >
> > > > The problem seems to be that dev->mutex is held while the work item
> > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > guaranteed deadlock scenario.    
> > >
> > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > block layer and/or SCSI code and bug other people with my findings.    
> > 
> > Please feel free to CC me on that in case I can help.  
> 
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
> 
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
> 
> 			/*
> 			 * If the rescan work was scheduled because of a resume
> 			 * event, the port is already fully resumed, but the
> 			 * SCSI device may not yet be fully resumed. In such
> 			 * case, executing scsi_rescan_device() may cause a
> 			 * deadlock with the PM code on device_lock(). Prevent
> 			 * this by giving up and retrying rescan after a short
> 			 * delay.
> 			 */
> 			delay_rescan = sdev->sdev_gendev.power.is_suspended;
> 			if (delay_rescan) {
> 				scsi_device_put(sdev);
> 				break;
> 			}
> 
> It just doesn't seem to work as expected, at least not in my case.

Erm. This was v6.5.2, where I originally encountered the issue. I can
see that this snippet has just been changed by commit 8b4d9469d0b0.

@Daniel: I can still reproduce a deadlock on resume with v6.6-rc4.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 15:18                 ` Petr Tesařík
  2023-10-03 16:16                   ` Petr Tesařík
@ 2023-10-03 16:48                   ` Rafael J. Wysocki
  2023-10-03 17:19                     ` Petr Tesařík
  2023-10-03 20:07                     ` Petr Tesařík
  2023-10-04  6:13                   ` Hannes Reinecke
  2 siblings, 2 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 16:48 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: Rafael J. Wysocki, Damien Le Moal, linux-pm, linux-ide

On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi again (adding more recipients),
> > > > > > > > >
> > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > here's my analysis:
> > > > > > > > > >
> > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > >
> > > > > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > > > > >
> > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > >
> > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > >
> > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > >
> > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > >
> > > > > > > > >         .begin = acpi_suspend_begin,
> > > > > > > > >         .end = acpi_pm_end,
> > > > > > > > >
> > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > >
> > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > >
> > > > > > > > > AFAICS either:
> > > > > > > > >
> > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > resumed, or
> > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > >
> > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > >
> > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > has not changed for quite some time too.
> > > > > > > >
> > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > > > > need to check that, though.
> > > > > > >
> > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > No, it's more complicated:
> > > > > > >
> > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > >    when it gets to acquiring acpi_scan_lock,
> > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > >    completion while holding dev->mutex,
> > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > >    the ACPI hotplug work in step 2,
> > > > > > >    ^^^--- THIS is how the two events are serialized!
> > > > > >
> > > > > > Which is unexpected.
> > > > > >
> > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > nothing to do with SCSI.
> > > > >
> > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > queue...
> > > > >
> > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > >
> > > > > Yes, that's it:
> > > > >
> > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > /* ... */
> > > > >         .flags = (unsigned int)4,
> > > > > /* WQ_FREEZABLE            = 1 << 2 */
> > > > >
> > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > progress.
> > > >
> > > > The problem seems to be that dev->mutex is held while the work item
> > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > guaranteed deadlock scenario.
> > >
> > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > block layer and/or SCSI code and bug other people with my findings.
> >
> > Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.

If it called pm_runtime_resume() instead, this might work.

> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
>                         /*
>                          * If the rescan work was scheduled because of a resume
>                          * event, the port is already fully resumed, but the
>                          * SCSI device may not yet be fully resumed. In such
>                          * case, executing scsi_rescan_device() may cause a
>                          * deadlock with the PM code on device_lock(). Prevent
>                          * this by giving up and retrying rescan after a short
>                          * delay.
>                          */
>                         delay_rescan = sdev->sdev_gendev.power.is_suspended;
>                         if (delay_rescan) {
>                                 scsi_device_put(sdev);
>                                 break;
>                         }
>
> It just doesn't seem to work as expected, at least not in my case.

Well, calling pm_request_resume() and waiting for the resume to
actually happen is problematic regardless.  It is just better to call
pm_runtime_resume() to synchronously resume the device instead.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 16:48                   ` Rafael J. Wysocki
@ 2023-10-03 17:19                     ` Petr Tesařík
  2023-10-03 20:07                     ` Petr Tesařík
  1 sibling, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 17:19 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Damien Le Moal, linux-pm, linux-ide

On Tue, 3 Oct 2023 18:48:27 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:57:46 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >  
> > > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > >
> > > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >  
> > > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > >
> > > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >  
> > > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > > > >
> > > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > > >  
> > > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > > > > > >
> > > > > > > > > > Hi again (adding more recipients),
> > > > > > > > > >
> > > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >  
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > > here's my analysis:
> > > > > > > > > > >
> > > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > > >
> > > > > > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > > > > > >
> > > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > > >  
> > > > > > > > > > > >>> prog.stack_trace(11032)  
> > > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > > >
> > > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > > >  
> > > > > > > > > > > >>> prog.stack_trace(11002)  
> > > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> > > > > > > > > >
> > > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > > >
> > > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > > >
> > > > > > > > > >         .begin = acpi_suspend_begin,
> > > > > > > > > >         .end = acpi_pm_end,
> > > > > > > > > >
> > > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > > acquire acpi_scan_lock, the system will deadlock.  
> > > > > > > > >
> > > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > > >  
> > > > > > > > > > AFAICS either:
> > > > > > > > > >
> > > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > > resumed, or
> > > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > > >
> > > > > > > > > > My gut feeling is that b. is the right answer.  
> > > > > > > > >
> > > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > > has not changed for quite some time too.
> > > > > > > > >
> > > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > > > > > need to check that, though.  
> > > > > > > >
> > > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > > No, it's more complicated:
> > > > > > > >
> > > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > >    when it gets to acquiring acpi_scan_lock,
> > > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > >    completion while holding dev->mutex,
> > > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > >    the ACPI hotplug work in step 2,
> > > > > > > >    ^^^--- THIS is how the two events are serialized!  
> > > > > > >
> > > > > > > Which is unexpected.
> > > > > > >
> > > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > > nothing to do with SCSI.  
> > > > > >
> > > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > > queue...
> > > > > >  
> > > > > > > Maybe the workqueue used for the SCSI completion is freezable?  
> > > > > >
> > > > > > Yes, that's it:
> > > > > >
> > > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > > /* ... */
> > > > > >         .flags = (unsigned int)4,
> > > > > > /* WQ_FREEZABLE            = 1 << 2 */
> > > > > >
> > > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > > progress.  
> > > > >
> > > > > The problem seems to be that dev->mutex is held while the work item
> > > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > > guaranteed deadlock scenario.  
> > > >
> > > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > > block layer and/or SCSI code and bug other people with my findings.  
> > >
> > > Please feel free to CC me on that in case I can help.  
> >
> > And here I am again... The frozen workqueue is in fact pm_wq, and the
> > work item that is waited for is pm_runtime_work. The block layer calls
> > pm_request_resume() on the device to resume the queue.  
> 
> If it called pm_runtime_resume() instead, this might work.
> 
> > I bet the queue should not be resumed this early. In fact, it seems
> > that this is somewhat known to the ATA developers, because
> > ata_scsi_dev_rescan() contains this beautiful comment and code:
> >
> >                         /*
> >                          * If the rescan work was scheduled because of a resume
> >                          * event, the port is already fully resumed, but the
> >                          * SCSI device may not yet be fully resumed. In such
> >                          * case, executing scsi_rescan_device() may cause a
> >                          * deadlock with the PM code on device_lock(). Prevent
> >                          * this by giving up and retrying rescan after a short
> >                          * delay.
> >                          */
> >                         delay_rescan = sdev->sdev_gendev.power.is_suspended;
> >                         if (delay_rescan) {
> >                                 scsi_device_put(sdev);
> >                                 break;
> >                         }
> >
> > It just doesn't seem to work as expected, at least not in my case.  
> 
> Well, calling pm_request_resume() and waiting for the resume to
> actually happen is problematic regardless.  It is just better to call
> pm_runtime_resume() to synchronously resume the device instead.

That code is in blk_pm_resume_queue(), which is pretty deep in the
block layer. I can imagine that pm_runtime_resume() cannot be always
used there. We may have to add an alternative API just for resume.

OTOH this may in fact be the correct fix.

But I want to try out a few other things before proposing a solution.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 16:48                   ` Rafael J. Wysocki
  2023-10-03 17:19                     ` Petr Tesařík
@ 2023-10-03 20:07                     ` Petr Tesařík
  2023-10-04  1:25                       ` Damien Le Moal
  1 sibling, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 20:07 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Damien Le Moal, linux-pm, linux-ide

On Tue, 3 Oct 2023 18:48:27 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:57:46 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >  
> > > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > >
> > > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >  
> > > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > >
> > > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >  
> > > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > > > >
> > > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > > >  
> > > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > > > > > >
> > > > > > > > > > Hi again (adding more recipients),
> > > > > > > > > >
> > > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >  
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > > here's my analysis:
> > > > > > > > > > >
> > > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > > >
> > > > > > > > > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > > > > > > > > >
> > > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > > >  
> > > > > > > > > > > >>> prog.stack_trace(11032)  
> > > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > > >
> > > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > > >  
> > > > > > > > > > > >>> prog.stack_trace(11002)  
> > > > > > > > > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> > > > > > > > > >
> > > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > > >
> > > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > > >
> > > > > > > > > >         .begin = acpi_suspend_begin,
> > > > > > > > > >         .end = acpi_pm_end,
> > > > > > > > > >
> > > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > > acquire acpi_scan_lock, the system will deadlock.  
> > > > > > > > >
> > > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > > >  
> > > > > > > > > > AFAICS either:
> > > > > > > > > >
> > > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > > resumed, or
> > > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > > >
> > > > > > > > > > My gut feeling is that b. is the right answer.  
> > > > > > > > >
> > > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > > has not changed for quite some time too.
> > > > > > > > >
> > > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > > > > > > > > need to check that, though.  
> > > > > > > >
> > > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > > No, it's more complicated:
> > > > > > > >
> > > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > >    when it gets to acquiring acpi_scan_lock,
> > > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > >    completion while holding dev->mutex,
> > > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > >    the ACPI hotplug work in step 2,
> > > > > > > >    ^^^--- THIS is how the two events are serialized!  
> > > > > > >
> > > > > > > Which is unexpected.
> > > > > > >
> > > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > > nothing to do with SCSI.  
> > > > > >
> > > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > > queue...
> > > > > >  
> > > > > > > Maybe the workqueue used for the SCSI completion is freezable?  
> > > > > >
> > > > > > Yes, that's it:
> > > > > >
> > > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > > /* ... */
> > > > > >         .flags = (unsigned int)4,
> > > > > > /* WQ_FREEZABLE            = 1 << 2 */
> > > > > >
> > > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > > progress.  
> > > > >
> > > > > The problem seems to be that dev->mutex is held while the work item
> > > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > > guaranteed deadlock scenario.  
> > > >
> > > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > > block layer and/or SCSI code and bug other people with my findings.  
> > >
> > > Please feel free to CC me on that in case I can help.  
> >
> > And here I am again... The frozen workqueue is in fact pm_wq, and the
> > work item that is waited for is pm_runtime_work. The block layer calls
> > pm_request_resume() on the device to resume the queue.  
> 
> If it called pm_runtime_resume() instead, this might work.
> 
> > I bet the queue should not be resumed this early. In fact, it seems
> > that this is somewhat known to the ATA developers, because
> > ata_scsi_dev_rescan() contains this beautiful comment and code:
> >
> >                         /*
> >                          * If the rescan work was scheduled because of a resume
> >                          * event, the port is already fully resumed, but the
> >                          * SCSI device may not yet be fully resumed. In such
> >                          * case, executing scsi_rescan_device() may cause a
> >                          * deadlock with the PM code on device_lock(). Prevent
> >                          * this by giving up and retrying rescan after a short
> >                          * delay.
> >                          */
> >                         delay_rescan = sdev->sdev_gendev.power.is_suspended;
> >                         if (delay_rescan) {
> >                                 scsi_device_put(sdev);
> >                                 break;
> >                         }
> >
> > It just doesn't seem to work as expected, at least not in my case.  
> 
> Well, calling pm_request_resume() and waiting for the resume to
> actually happen is problematic regardless.  It is just better to call
> pm_runtime_resume() to synchronously resume the device instead.

I just want to confirm that my understanding of the issue is correct
now. After applying the patch below, my laptop has just survived half a
dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
disk. Without the patch, it usually freezes on first attempt.

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a371b497035e..87000f89319e 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
 		ata_for_each_dev(dev, link, ENABLED) {
 			struct scsi_device *sdev = dev->sdev;
 
+			/*
+			 * If the queue accepts only PM, bail out.
+			 */
+			if (blk_queue_pm_only(sdev->request_queue)) {
+				ret = -EAGAIN;
+				goto unlock;
+			}
+
 			/*
 			 * If the port was suspended before this was scheduled,
 			 * bail out.

Petr T

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 20:07                     ` Petr Tesařík
@ 2023-10-04  1:25                       ` Damien Le Moal
  2023-10-04  6:43                         ` Petr Tesařík
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04  1:25 UTC (permalink / raw)
  To: Petr Tesařík, Rafael J. Wysocki; +Cc: linux-pm, linux-ide

On 10/4/23 05:07, Petr Tesařík wrote:
> I just want to confirm that my understanding of the issue is correct
> now. After applying the patch below, my laptop has just survived half a
> dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
> disk. Without the patch, it usually freezes on first attempt.
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index a371b497035e..87000f89319e 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
>  		ata_for_each_dev(dev, link, ENABLED) {
>  			struct scsi_device *sdev = dev->sdev;
>  
> +			/*
> +			 * If the queue accepts only PM, bail out.
> +			 */
> +			if (blk_queue_pm_only(sdev->request_queue)) {
> +				ret = -EAGAIN;
> +				goto unlock;
> +			}
> +
>  			/*
>  			 * If the port was suspended before this was scheduled,
>  			 * bail out.

This seems sensible to me: scsi_rescan_device() only checks that the device is
running, without checking that the device queue is also resumed. So the right
place for this check is scsi_rescan_device():

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3db4d31a03a1..b05a55f498a2 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
        device_lock(dev);

        /*
-        * Bail out if the device is not running. Otherwise, the rescan may
-        * block waiting for commands to be executed, with us holding the
-        * device lock. This can result in a potential deadlock in the power
-        * management core code when system resume is on-going.
+        * Bail out if the device or its queue are not running. Otherwise,
+        * the rescan may block waiting for commands to be executed, with us
+        * holding the device lock. This can result in a potential deadlock
+        * in the power management core code when system resume is on-going.
         */
-       if (sdev->sdev_state != SDEV_RUNNING) {
+       if (sdev->sdev_state != SDEV_RUNNING ||
+           blk_queue_pm_only(sdev->request_queue)) {
                ret = -EWOULDBLOCK;
                goto unlock;
        }

Can you try the above to confirm it works for you ? It should, as that is
essentially the same as you did.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-04  1:25                       ` Damien Le Moal
@ 2023-10-04  6:43                         ` Petr Tesařík
  2023-10-04  7:38                           ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-04  6:43 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Rafael J. Wysocki, linux-pm, linux-ide, Hannes Reinecke

On Wed, 4 Oct 2023 10:25:47 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:

> On 10/4/23 05:07, Petr Tesařík wrote:
> > I just want to confirm that my understanding of the issue is correct
> > now. After applying the patch below, my laptop has just survived half a
> > dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
> > disk. Without the patch, it usually freezes on first attempt.
> > 
> > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> > index a371b497035e..87000f89319e 100644
> > --- a/drivers/ata/libata-scsi.c
> > +++ b/drivers/ata/libata-scsi.c
> > @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
> >  		ata_for_each_dev(dev, link, ENABLED) {
> >  			struct scsi_device *sdev = dev->sdev;
> >  
> > +			/*
> > +			 * If the queue accepts only PM, bail out.
> > +			 */
> > +			if (blk_queue_pm_only(sdev->request_queue)) {
> > +				ret = -EAGAIN;
> > +				goto unlock;
> > +			}
> > +
> >  			/*
> >  			 * If the port was suspended before this was scheduled,
> >  			 * bail out.  
> 
> This seems sensible to me: scsi_rescan_device() only checks that the device is
> running, without checking that the device queue is also resumed. So the right
> place for this check is scsi_rescan_device():
> 
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 3db4d31a03a1..b05a55f498a2 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
>         device_lock(dev);
> 
>         /*
> -        * Bail out if the device is not running. Otherwise, the rescan may
> -        * block waiting for commands to be executed, with us holding the
> -        * device lock. This can result in a potential deadlock in the power
> -        * management core code when system resume is on-going.
> +        * Bail out if the device or its queue are not running. Otherwise,
> +        * the rescan may block waiting for commands to be executed, with us
> +        * holding the device lock. This can result in a potential deadlock
> +        * in the power management core code when system resume is on-going.
>          */
> -       if (sdev->sdev_state != SDEV_RUNNING) {
> +       if (sdev->sdev_state != SDEV_RUNNING ||
> +           blk_queue_pm_only(sdev->request_queue)) {
>                 ret = -EWOULDBLOCK;
>                 goto unlock;
>         }
> 
> Can you try the above to confirm it works for you ? It should, as that is
> essentially the same as you did.

After revertung my patch and applying yours, my system can still resume
from S3. Congrats!

Tested-by: Petr Tesarik <petr@tesarici.cz>

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-04  6:43                         ` Petr Tesařík
@ 2023-10-04  7:38                           ` Damien Le Moal
  0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04  7:38 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: Rafael J. Wysocki, linux-pm, linux-ide, Hannes Reinecke

On 10/4/23 15:43, Petr Tesařík wrote:
> On Wed, 4 Oct 2023 10:25:47 +0900
> Damien Le Moal <dlemoal@kernel.org> wrote:
> 
>> On 10/4/23 05:07, Petr Tesařík wrote:
>>> I just want to confirm that my understanding of the issue is correct
>>> now. After applying the patch below, my laptop has just survived half a
>>> dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
>>> disk. Without the patch, it usually freezes on first attempt.
>>>
>>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>>> index a371b497035e..87000f89319e 100644
>>> --- a/drivers/ata/libata-scsi.c
>>> +++ b/drivers/ata/libata-scsi.c
>>> @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
>>>  		ata_for_each_dev(dev, link, ENABLED) {
>>>  			struct scsi_device *sdev = dev->sdev;
>>>  
>>> +			/*
>>> +			 * If the queue accepts only PM, bail out.
>>> +			 */
>>> +			if (blk_queue_pm_only(sdev->request_queue)) {
>>> +				ret = -EAGAIN;
>>> +				goto unlock;
>>> +			}
>>> +
>>>  			/*
>>>  			 * If the port was suspended before this was scheduled,
>>>  			 * bail out.  
>>
>> This seems sensible to me: scsi_rescan_device() only checks that the device is
>> running, without checking that the device queue is also resumed. So the right
>> place for this check is scsi_rescan_device():
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 3db4d31a03a1..b05a55f498a2 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
>>         device_lock(dev);
>>
>>         /*
>> -        * Bail out if the device is not running. Otherwise, the rescan may
>> -        * block waiting for commands to be executed, with us holding the
>> -        * device lock. This can result in a potential deadlock in the power
>> -        * management core code when system resume is on-going.
>> +        * Bail out if the device or its queue are not running. Otherwise,
>> +        * the rescan may block waiting for commands to be executed, with us
>> +        * holding the device lock. This can result in a potential deadlock
>> +        * in the power management core code when system resume is on-going.
>>          */
>> -       if (sdev->sdev_state != SDEV_RUNNING) {
>> +       if (sdev->sdev_state != SDEV_RUNNING ||
>> +           blk_queue_pm_only(sdev->request_queue)) {
>>                 ret = -EWOULDBLOCK;
>>                 goto unlock;
>>         }
>>
>> Can you try the above to confirm it works for you ? It should, as that is
>> essentially the same as you did.
> 
> After revertung my patch and applying yours, my system can still resume
> from S3. Congrats!
> 
> Tested-by: Petr Tesarik <petr@tesarici.cz>

Thanks. I will send a proper patch asap.

> 
> Petr T

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 15:18                 ` Petr Tesařík
  2023-10-03 16:16                   ` Petr Tesařík
  2023-10-03 16:48                   ` Rafael J. Wysocki
@ 2023-10-04  6:13                   ` Hannes Reinecke
  2023-10-04  6:18                     ` Damien Le Moal
  2 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2023-10-04  6:13 UTC (permalink / raw)
  To: Petr Tesařík, Rafael J. Wysocki, Damien Le Moal
  Cc: linux-pm, linux-ide

On 10/3/23 17:18, Petr Tesařík wrote:
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> 
>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>
>>> On Tue, 3 Oct 2023 14:48:13 +0200
>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>   
>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>
>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>   
>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>
>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>>   
>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>
>>>>>>>>> Hi again (adding more recipients),
>>>>>>>>>
>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>   
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
>>>>>>>>>> here's my analysis:
>>>>>>>>>>
>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
>>>>>>>>>>
>>>>>>>>>> [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
>>>>>>>>>>
>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
>>>>>>>>>> worker is stuck in 'D' state:
>>>>>>>>>>   
>>>>>>>>>>>>> prog.stack_trace(11032)
>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>> #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
>>>>>>>>>> #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
>>>>>>>>>> #8  process_one_work (../kernel/workqueue.c:2600:2)
>>>>>>>>>> #9  worker_thread (../kernel/workqueue.c:2751:4)
>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>>>>>>>>>>
>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
>>>>>>>>>>   
>>>>>>>>>>>>> prog.stack_trace(11002)
>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>> #6  device_lock (../include/linux/device.h:958:2)
>>>>>>>>>> #7  device_complete (../drivers/base/power/main.c:1063:2)
>>>>>>>>>> #8  dpm_complete (../drivers/base/power/main.c:1121:3)
>>>>>>>>>> #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>>>>>>>>>
>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
>>>>>>>>> platform_suspend_begin() and platform_resume_end().
>>>>>>>>>
>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
>>>>>>>>>
>>>>>>>>>          .begin = acpi_suspend_begin,
>>>>>>>>>          .end = acpi_pm_end,
>>>>>>>>>
>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
>>>>>>>>
>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
>>>>>>>> prevent the hotplug from taking place then IIRC.
>>>>>>>>   
>>>>>>>>> AFAICS either:
>>>>>>>>>
>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
>>>>>>>>> resumed, or
>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>>>>>>>>>
>>>>>>>>> My gut feeling is that b. is the right answer.
>>>>>>>>
>>>>>>>> It's been a while since I looked at that code last time, but then it
>>>>>>>> has not changed for quite some time too.
>>>>>>>>
>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
>>>>>>>> need to check that, though.
>>>>>>>
>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
>>>>>>> No, it's more complicated:
>>>>>>>
>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>>>>>>>     when it gets to acquiring acpi_scan_lock,
>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>>>>>>>     completion while holding dev->mutex,
>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
>>>>>>>     the ACPI hotplug work in step 2,
>>>>>>>     ^^^--- THIS is how the two events are serialized!
>>>>>>
>>>>>> Which is unexpected.
>>>>>>
>>>>>> And quite honestly I'm not sure how this can happen, because
>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
>>>>>> nothing to do with SCSI.
>>>>>
>>>>> Oh, you're right, and I was too quick. They cannot be on the same
>>>>> queue...
>>>>>   
>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
>>>>>
>>>>> Yes, that's it:
>>>>>
>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
>>>>> /* ... */
>>>>>          .flags = (unsigned int)4,
>>>>> /* WQ_FREEZABLE            = 1 << 2 */
>>>>>
>>>>> Good. But if this workqueue is frozen, the system still cannot make
>>>>> progress.
>>>>
>>>> The problem seems to be that dev->mutex is held while the work item
>>>> goes to a freezable workqueue and is waited for, which is an almost
>>>> guaranteed deadlock scenario.
>>>
>>> Ah. Thanks for explanation and direction! I'm going to dive into the
>>> block layer and/or SCSI code and bug other people with my findings.
>>
>> Please feel free to CC me on that in case I can help.
> 
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
> 
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
> 
> 			/*
> 			 * If the rescan work was scheduled because of a resume
> 			 * event, the port is already fully resumed, but the
> 			 * SCSI device may not yet be fully resumed. In such
> 			 * case, executing scsi_rescan_device() may cause a
> 			 * deadlock with the PM code on device_lock(). Prevent
> 			 * this by giving up and retrying rescan after a short
> 			 * delay.
> 			 */
> 			delay_rescan = sdev->sdev_gendev.power.is_suspended;
> 			if (delay_rescan) {
> 				scsi_device_put(sdev);
> 				break;
> 			}
> 
> It just doesn't seem to work as expected, at least not in my case.
> 
Did you test the libata suspend/resume patches from Damien?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-04  6:13                   ` Hannes Reinecke
@ 2023-10-04  6:18                     ` Damien Le Moal
  2023-10-04  6:42                       ` Petr Tesařík
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04  6:18 UTC (permalink / raw)
  To: Hannes Reinecke, Petr Tesařík, Rafael J. Wysocki
  Cc: linux-pm, linux-ide

On 10/4/23 15:13, Hannes Reinecke wrote:
> On 10/3/23 17:18, Petr Tesařík wrote:
>> On Tue, 3 Oct 2023 14:57:46 +0200
>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>
>>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>
>>>> On Tue, 3 Oct 2023 14:48:13 +0200
>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>   
>>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>
>>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>   
>>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>
>>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
>>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>>>   
>>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi again (adding more recipients),
>>>>>>>>>>
>>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
>>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>>   
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
>>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
>>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
>>>>>>>>>>> here's my analysis:
>>>>>>>>>>>
>>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
>>>>>>>>>>>
>>>>>>>>>>> [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
>>>>>>>>>>>
>>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
>>>>>>>>>>> worker is stuck in 'D' state:
>>>>>>>>>>>   
>>>>>>>>>>>>>> prog.stack_trace(11032)
>>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>>> #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
>>>>>>>>>>> #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
>>>>>>>>>>> #8  process_one_work (../kernel/workqueue.c:2600:2)
>>>>>>>>>>> #9  worker_thread (../kernel/workqueue.c:2751:4)
>>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
>>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
>>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>>>>>>>>>>>
>>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
>>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
>>>>>>>>>>>   
>>>>>>>>>>>>>> prog.stack_trace(11002)
>>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>>> #6  device_lock (../include/linux/device.h:958:2)
>>>>>>>>>>> #7  device_complete (../drivers/base/power/main.c:1063:2)
>>>>>>>>>>> #8  dpm_complete (../drivers/base/power/main.c:1121:3)
>>>>>>>>>>> #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>>>>>>>>>>
>>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
>>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
>>>>>>>>>> platform_suspend_begin() and platform_resume_end().
>>>>>>>>>>
>>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
>>>>>>>>>>
>>>>>>>>>>          .begin = acpi_suspend_begin,
>>>>>>>>>>          .end = acpi_pm_end,
>>>>>>>>>>
>>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
>>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
>>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
>>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
>>>>>>>>>
>>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
>>>>>>>>> prevent the hotplug from taking place then IIRC.
>>>>>>>>>   
>>>>>>>>>> AFAICS either:
>>>>>>>>>>
>>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
>>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
>>>>>>>>>> resumed, or
>>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>>>>>>>>>>
>>>>>>>>>> My gut feeling is that b. is the right answer.
>>>>>>>>>
>>>>>>>>> It's been a while since I looked at that code last time, but then it
>>>>>>>>> has not changed for quite some time too.
>>>>>>>>>
>>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
>>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
>>>>>>>>> need to check that, though.
>>>>>>>>
>>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
>>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
>>>>>>>> No, it's more complicated:
>>>>>>>>
>>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
>>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>>>>>>>>     when it gets to acquiring acpi_scan_lock,
>>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>>>>>>>>     completion while holding dev->mutex,
>>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
>>>>>>>>     the ACPI hotplug work in step 2,
>>>>>>>>     ^^^--- THIS is how the two events are serialized!
>>>>>>>
>>>>>>> Which is unexpected.
>>>>>>>
>>>>>>> And quite honestly I'm not sure how this can happen, because
>>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
>>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
>>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
>>>>>>> nothing to do with SCSI.
>>>>>>
>>>>>> Oh, you're right, and I was too quick. They cannot be on the same
>>>>>> queue...
>>>>>>   
>>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
>>>>>>
>>>>>> Yes, that's it:
>>>>>>
>>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
>>>>>> /* ... */
>>>>>>          .flags = (unsigned int)4,
>>>>>> /* WQ_FREEZABLE            = 1 << 2 */
>>>>>>
>>>>>> Good. But if this workqueue is frozen, the system still cannot make
>>>>>> progress.
>>>>>
>>>>> The problem seems to be that dev->mutex is held while the work item
>>>>> goes to a freezable workqueue and is waited for, which is an almost
>>>>> guaranteed deadlock scenario.
>>>>
>>>> Ah. Thanks for explanation and direction! I'm going to dive into the
>>>> block layer and/or SCSI code and bug other people with my findings.
>>>
>>> Please feel free to CC me on that in case I can help.
>>
>> And here I am again... The frozen workqueue is in fact pm_wq, and the
>> work item that is waited for is pm_runtime_work. The block layer calls
>> pm_request_resume() on the device to resume the queue.
>>
>> I bet the queue should not be resumed this early. In fact, it seems
>> that this is somewhat known to the ATA developers, because
>> ata_scsi_dev_rescan() contains this beautiful comment and code:
>>
>> 			/*
>> 			 * If the rescan work was scheduled because of a resume
>> 			 * event, the port is already fully resumed, but the
>> 			 * SCSI device may not yet be fully resumed. In such
>> 			 * case, executing scsi_rescan_device() may cause a
>> 			 * deadlock with the PM code on device_lock(). Prevent
>> 			 * this by giving up and retrying rescan after a short
>> 			 * delay.
>> 			 */
>> 			delay_rescan = sdev->sdev_gendev.power.is_suspended;
>> 			if (delay_rescan) {
>> 				scsi_device_put(sdev);
>> 				break;
>> 			}
>>
>> It just doesn't seem to work as expected, at least not in my case.
>>
> Did you test the libata suspend/resume patches from Damien?

I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
in this thread.

> 
> Cheers,
> 
> Hannes

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-04  6:18                     ` Damien Le Moal
@ 2023-10-04  6:42                       ` Petr Tesařík
  0 siblings, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-04  6:42 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke; +Cc: Rafael J. Wysocki, linux-pm, linux-ide

On Wed, 4 Oct 2023 15:18:34 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:

> On 10/4/23 15:13, Hannes Reinecke wrote:
> > On 10/3/23 17:18, Petr Tesařík wrote:  
> >> On Tue, 3 Oct 2023 14:57:46 +0200
> >> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>  
> >>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>
> >>>> On Tue, 3 Oct 2023 14:48:13 +0200
> >>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>     
> >>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>>>
> >>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
> >>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>     
> >>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>>>>>
> >>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
> >>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>>>     
> >>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>>>>>>>
> >>>>>>>>>> Hi again (adding more recipients),
> >>>>>>>>>>
> >>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
> >>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>>>     
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> >>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
> >>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
> >>>>>>>>>>> here's my analysis:
> >>>>>>>>>>>
> >>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
> >>>>>>>>>>>
> >>>>>>>>>>> [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> >>>>>>>>>>>
> >>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
> >>>>>>>>>>> worker is stuck in 'D' state:
> >>>>>>>>>>>     
> >>>>>>>>>>>>>> prog.stack_trace(11032)  
> >>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> >>>>>>>>>>> #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> >>>>>>>>>>> #8  process_one_work (../kernel/workqueue.c:2600:2)
> >>>>>>>>>>> #9  worker_thread (../kernel/workqueue.c:2751:4)
> >>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
> >>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> >>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> >>>>>>>>>>>
> >>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> >>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
> >>>>>>>>>>>     
> >>>>>>>>>>>>>> prog.stack_trace(11002)  
> >>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6  device_lock (../include/linux/device.h:958:2)
> >>>>>>>>>>> #7  device_complete (../drivers/base/power/main.c:1063:2)
> >>>>>>>>>>> #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> >>>>>>>>>>> #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> >>>>>>>>>>
> >>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
> >>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
> >>>>>>>>>> platform_suspend_begin() and platform_resume_end().
> >>>>>>>>>>
> >>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
> >>>>>>>>>>
> >>>>>>>>>>          .begin = acpi_suspend_begin,
> >>>>>>>>>>          .end = acpi_pm_end,
> >>>>>>>>>>
> >>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
> >>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
> >>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
> >>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.  
> >>>>>>>>>
> >>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
> >>>>>>>>> prevent the hotplug from taking place then IIRC.
> >>>>>>>>>     
> >>>>>>>>>> AFAICS either:
> >>>>>>>>>>
> >>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
> >>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
> >>>>>>>>>> resumed, or
> >>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> >>>>>>>>>>
> >>>>>>>>>> My gut feeling is that b. is the right answer.  
> >>>>>>>>>
> >>>>>>>>> It's been a while since I looked at that code last time, but then it
> >>>>>>>>> has not changed for quite some time too.
> >>>>>>>>>
> >>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
> >>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> >>>>>>>>> need to check that, though.  
> >>>>>>>>
> >>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
> >>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
> >>>>>>>> No, it's more complicated:
> >>>>>>>>
> >>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> >>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> >>>>>>>>     when it gets to acquiring acpi_scan_lock,
> >>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> >>>>>>>>     completion while holding dev->mutex,
> >>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
> >>>>>>>>     the ACPI hotplug work in step 2,
> >>>>>>>>     ^^^--- THIS is how the two events are serialized!  
> >>>>>>>
> >>>>>>> Which is unexpected.
> >>>>>>>
> >>>>>>> And quite honestly I'm not sure how this can happen, because
> >>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> >>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
> >>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
> >>>>>>> nothing to do with SCSI.  
> >>>>>>
> >>>>>> Oh, you're right, and I was too quick. They cannot be on the same
> >>>>>> queue...
> >>>>>>     
> >>>>>>> Maybe the workqueue used for the SCSI completion is freezable?  
> >>>>>>
> >>>>>> Yes, that's it:
> >>>>>>
> >>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> >>>>>> /* ... */
> >>>>>>          .flags = (unsigned int)4,
> >>>>>> /* WQ_FREEZABLE            = 1 << 2 */
> >>>>>>
> >>>>>> Good. But if this workqueue is frozen, the system still cannot make
> >>>>>> progress.  
> >>>>>
> >>>>> The problem seems to be that dev->mutex is held while the work item
> >>>>> goes to a freezable workqueue and is waited for, which is an almost
> >>>>> guaranteed deadlock scenario.  
> >>>>
> >>>> Ah. Thanks for explanation and direction! I'm going to dive into the
> >>>> block layer and/or SCSI code and bug other people with my findings.  
> >>>
> >>> Please feel free to CC me on that in case I can help.  
> >>
> >> And here I am again... The frozen workqueue is in fact pm_wq, and the
> >> work item that is waited for is pm_runtime_work. The block layer calls
> >> pm_request_resume() on the device to resume the queue.
> >>
> >> I bet the queue should not be resumed this early. In fact, it seems
> >> that this is somewhat known to the ATA developers, because
> >> ata_scsi_dev_rescan() contains this beautiful comment and code:
> >>
> >> 			/*
> >> 			 * If the rescan work was scheduled because of a resume
> >> 			 * event, the port is already fully resumed, but the
> >> 			 * SCSI device may not yet be fully resumed. In such
> >> 			 * case, executing scsi_rescan_device() may cause a
> >> 			 * deadlock with the PM code on device_lock(). Prevent
> >> 			 * this by giving up and retrying rescan after a short
> >> 			 * delay.
> >> 			 */
> >> 			delay_rescan = sdev->sdev_gendev.power.is_suspended;
> >> 			if (delay_rescan) {
> >> 				scsi_device_put(sdev);
> >> 				break;
> >> 			}
> >>
> >> It just doesn't seem to work as expected, at least not in my case.
> >>  
> > Did you test the libata suspend/resume patches from Damien?  
> 
> I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
> in this thread.

Indeed. I originally hit the hang on 6.5. I captured a coredump and did
most of my analysis on that version (hence the quoted code). But I
tested again with 6.6-rc4 just before sending my email, and the hang is
still there. The rescan work runs after dpm_resume(), but before
dpm_complete(), and I believe that's yet another issue.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Thinkpad E595 system deadlock on resume from S3
  2023-10-03 11:02     ` Petr Tesařík
  2023-10-03 12:34       ` Rafael J. Wysocki
@ 2023-10-03 12:35       ` Petr Tesařík
  1 sibling, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 12:35 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi

On Tue, 3 Oct 2023 13:02:40 +0200
Petr Tesařík <petr@tesarici.cz> wrote:

> On Tue, 3 Oct 2023 12:15:10 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> 
> > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > Hi again (adding more recipients),
> > >
> > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > Petr Tesařík <petr@tesarici.cz> wrote:
> > >  
> > > > Hi all,
> > > >
> > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > here's my analysis:
> > > >
> > > > The system never gets to waking up my SATA SSD disk:
> > > >
> > > > [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> > > >
> > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > worker is stuck in 'D' state:
> > > >  
> > > > >>> prog.stack_trace(11032)  
> > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > #8  process_one_work (../kernel/workqueue.c:2600:2)
> > > > #9  worker_thread (../kernel/workqueue.c:2751:4)
> > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > >
> > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > >  
> > > > >>> prog.stack_trace(11002)  
> > > > #0  context_switch (../kernel/sched/core.c:5381:2)
> > > > #1  __schedule (../kernel/sched/core.c:6710:8)
> > > > #2  schedule (../kernel/sched/core.c:6786:3)
> > > > #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6  device_lock (../include/linux/device.h:958:2)
> > > > #7  device_complete (../drivers/base/power/main.c:1063:2)
> > > > #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> > >
> > > I believe the issue must be somewhere here. The whole suspend and
> > > resume logic in suspend_devices_and_enter() is framed by
> > > platform_suspend_begin() and platform_resume_end().
> > >
> > > My system is an ACPI system, so suspend_ops contains:
> > >
> > >         .begin = acpi_suspend_begin,
> > >         .end = acpi_pm_end,
> > >
> > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > Since dpm_complete() waits for the completion of a work that tries to
> > > acquire acpi_scan_lock, the system will deadlock.  
> > 
> > So holding acpi_scan_lock across suspend-resume is basically to
> > prevent the hotplug from taking place then IIRC.
> > 
> > > AFAICS either:
> > >
> > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > resumed, or
> > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > >
> > > My gut feeling is that b. is the right answer.  
> > 
> > It's been a while since I looked at that code last time, but then it
> > has not changed for quite some time too.
> > 
> > It looks like the acpi_device_hotplug() path attempts to acquire
> > acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> > need to check that, though.
> 
> Thanks for your willingness. Well, it's not quite what you describe. If
> it was a simple ABBA deadlock, then it would be reported by lockdep.
> No, it's more complicated:
> 
> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>    when it gets to acquiring acpi_scan_lock,
> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>    completion while holding dev->mutex,
> 4. the SCSI completion work happens to be put on the same workqueue as
>    the ACPI hotplug work in step 2,
>    ^^^--- THIS is how the two events are serialized!

While wondering why the work was not taken by another worker, I noticed
that it was placed on the inactive_work list. Might be relevant.

Petr T

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-10-04  7:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-30 10:20 Thinkpad E595 system deadlock on resume from S3 Petr Tesařík
2023-10-03  9:31 ` Petr Tesařík
2023-10-03 10:15   ` Rafael J. Wysocki
2023-10-03 11:02     ` Petr Tesařík
2023-10-03 12:34       ` Rafael J. Wysocki
2023-10-03 12:40         ` Petr Tesařík
2023-10-03 12:48           ` Rafael J. Wysocki
2023-10-03 12:51             ` Petr Tesařík
2023-10-03 12:57               ` Rafael J. Wysocki
2023-10-03 15:18                 ` Petr Tesařík
2023-10-03 16:16                   ` Petr Tesařík
2023-10-03 16:48                   ` Rafael J. Wysocki
2023-10-03 17:19                     ` Petr Tesařík
2023-10-03 20:07                     ` Petr Tesařík
2023-10-04  1:25                       ` Damien Le Moal
2023-10-04  6:43                         ` Petr Tesařík
2023-10-04  7:38                           ` Damien Le Moal
2023-10-04  6:13                   ` Hannes Reinecke
2023-10-04  6:18                     ` Damien Le Moal
2023-10-04  6:42                       ` Petr Tesařík
2023-10-03 12:35       ` Petr Tesařík

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox