* Thinkpad E595 system deadlock on resume from S3
@ 2023-09-30 10:20 Petr Tesařík
2023-10-03 9:31 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-09-30 10:20 UTC (permalink / raw)
To: linux-pm
Hi all,
this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
to resume from S3. It also fails the same way with Tumbleweed v6.5
kernel. I was able to capture a crash dump of the v6.5 kernel, and
here's my analysis:
The system never gets to waking up my SATA SSD disk:
[0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
There is a pending resume work for kworker/u32:12 (PID 11032), but this
worker is stuck in 'D' state:
>>> prog.stack_trace(11032)
#0 context_switch (../kernel/sched/core.c:5381:2)
#1 __schedule (../kernel/sched/core.c:6710:8)
#2 schedule (../kernel/sched/core.c:6786:3)
#3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
#4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
#5 __mutex_lock (../kernel/locking/mutex.c:747:9)
#6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
#7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
#8 process_one_work (../kernel/workqueue.c:2600:2)
#9 worker_thread (../kernel/workqueue.c:2751:4)
#10 kthread (../kernel/kthread.c:389:9)
#11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
#12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
systemd-sleep (PID 11002). This task is also in 'D' state:
>>> prog.stack_trace(11002)
#0 context_switch (../kernel/sched/core.c:5381:2)
#1 __schedule (../kernel/sched/core.c:6710:8)
#2 schedule (../kernel/sched/core.c:6786:3)
#3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
#4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
#5 __mutex_lock (../kernel/locking/mutex.c:747:9)
#6 device_lock (../include/linux/device.h:958:2)
#7 device_complete (../drivers/base/power/main.c:1063:2)
#8 dpm_complete (../drivers/base/power/main.c:1121:3)
#9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
#10 enter_state (../kernel/power/suspend.c:592:10)
#11 pm_suspend (../kernel/power/suspend.c:619:10)
#12 state_store (../kernel/power/main.c:707:11)
#13 kernfs_fop_write_iter (../fs/kernfs/file.c:334:9)
#14 call_write_iter (../include/linux/fs.h:1877:9)
#15 new_sync_write (../fs/read_write.c:491:8)
#16 vfs_write (../fs/read_write.c:584:9)
#17 ksys_write (../fs/read_write.c:637:9)
#18 do_syscall_x64 (../arch/x86/entry/common.c:50:14)
#19 do_syscall_64 (../arch/x86/entry/common.c:80:7)
#20 entry_SYSCALL_64+0xaa/0x1a6 (../arch/x86/entry/entry_64.S:120)
It is trying to acquire dev->mutex, which is in turn held by
kworker/0:0 (PID 10830), also in 'D' state:
>>> prog.stack_trace(10830)
#0 context_switch (../kernel/sched/core.c:5381:2)
#1 __schedule (../kernel/sched/core.c:6710:8)
#2 schedule (../kernel/sched/core.c:6786:3)
#3 blk_queue_enter (../block/blk-core.c:326:3)
#4 blk_mq_alloc_request (../block/blk-mq.c:592:9)
#5 scsi_alloc_request (../drivers/scsi/scsi_lib.c:1139:7)
#6 scsi_execute_cmd (../drivers/scsi/scsi_lib.c:217:8)
#7 scsi_vpd_inquiry (../drivers/scsi/scsi.c:312:11)
#8 scsi_get_vpd_size (../drivers/scsi/scsi.c:345:11)
#9 scsi_get_vpd_size (../drivers/scsi/scsi.c:336:5)
#10 scsi_get_vpd_buf (../drivers/scsi/scsi.c:415:12)
#11 scsi_attach_vpd (../drivers/scsi/scsi.c:483:12)
#12 scsi_rescan_device (../drivers/scsi/scsi_scan.c:1628:2)
#13 ata_scsi_dev_rescan (../drivers/ata/libata-scsi.c:4894:4)
#14 process_one_work (../kernel/workqueue.c:2600:2)
#15 worker_thread (../kernel/workqueue.c:2751:4)
#16 kthread (../kernel/kthread.c:389:9)
#17 ret_from_fork (../arch/x86/kernel/process.c:145:3)
#18 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
And here we are, waiting for the completion of the resume work that is
scheduled on the blocked kworker/u32:12 (PID 11032), see above.
FWIW the deadlock does not happen if autosuspend is turned off for this
disk.
I'm willing to invest more time into debugging, but I'm not well-versed
in power management, so I may need some help from you as to what else I
should look at. Please, include me in Cc of all potential replies; I'm
not subscribed to linux-pm.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-09-30 10:20 Thinkpad E595 system deadlock on resume from S3 Petr Tesařík
@ 2023-10-03 9:31 ` Petr Tesařík
2023-10-03 10:15 ` Rafael J. Wysocki
0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 9:31 UTC (permalink / raw)
To: linux-pm, linux-acpi, Rafael J. Wysocki
Hi again (adding more recipients),
On Sat, 30 Sep 2023 12:20:54 +0200
Petr Tesařík <petr@tesarici.cz> wrote:
> Hi all,
>
> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> to resume from S3. It also fails the same way with Tumbleweed v6.5
> kernel. I was able to capture a crash dump of the v6.5 kernel, and
> here's my analysis:
>
> The system never gets to waking up my SATA SSD disk:
>
> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
>
> There is a pending resume work for kworker/u32:12 (PID 11032), but this
> worker is stuck in 'D' state:
>
> >>> prog.stack_trace(11032)
> #0 context_switch (../kernel/sched/core.c:5381:2)
> #1 __schedule (../kernel/sched/core.c:6710:8)
> #2 schedule (../kernel/sched/core.c:6786:3)
> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> #8 process_one_work (../kernel/workqueue.c:2600:2)
> #9 worker_thread (../kernel/workqueue.c:2751:4)
> #10 kthread (../kernel/kthread.c:389:9)
> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>
> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> systemd-sleep (PID 11002). This task is also in 'D' state:
>
> >>> prog.stack_trace(11002)
> #0 context_switch (../kernel/sched/core.c:5381:2)
> #1 __schedule (../kernel/sched/core.c:6710:8)
> #2 schedule (../kernel/sched/core.c:6786:3)
> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> #6 device_lock (../include/linux/device.h:958:2)
> #7 device_complete (../drivers/base/power/main.c:1063:2)
> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
I believe the issue must be somewhere here. The whole suspend and
resume logic in suspend_devices_and_enter() is framed by
platform_suspend_begin() and platform_resume_end().
My system is an ACPI system, so suspend_ops contains:
.begin = acpi_suspend_begin,
.end = acpi_pm_end,
Now, acpi_suspend_begin() acquires acpi_scan_lock through
acpi_pm_start(), and the lock is not released until acpi_pm_end().
Since dpm_complete() waits for the completion of a work that tries to
acquire acpi_scan_lock, the system will deadlock.
AFAICS either:
a. the ACPI lock cannot be held while dpm_complete() runs, or
b. ata_scsi_dev_rescan() must not be scheduled before the system is
resumed, or
c. acpi_device_hotplug() must be implemented without taking dev->mutex.
My gut feeling is that b. is the right answer.
I'm afraid I can't contribute too much more ATM, at least not before I
find out what broke my USB hub between rc3 and rc4.
Regards
Petr T
> #10 enter_state (../kernel/power/suspend.c:592:10)
> #11 pm_suspend (../kernel/power/suspend.c:619:10)
> #12 state_store (../kernel/power/main.c:707:11)
> #13 kernfs_fop_write_iter (../fs/kernfs/file.c:334:9)
> #14 call_write_iter (../include/linux/fs.h:1877:9)
> #15 new_sync_write (../fs/read_write.c:491:8)
> #16 vfs_write (../fs/read_write.c:584:9)
> #17 ksys_write (../fs/read_write.c:637:9)
> #18 do_syscall_x64 (../arch/x86/entry/common.c:50:14)
> #19 do_syscall_64 (../arch/x86/entry/common.c:80:7)
> #20 entry_SYSCALL_64+0xaa/0x1a6 (../arch/x86/entry/entry_64.S:120)
>
> It is trying to acquire dev->mutex, which is in turn held by
> kworker/0:0 (PID 10830), also in 'D' state:
>
> >>> prog.stack_trace(10830)
> #0 context_switch (../kernel/sched/core.c:5381:2)
> #1 __schedule (../kernel/sched/core.c:6710:8)
> #2 schedule (../kernel/sched/core.c:6786:3)
> #3 blk_queue_enter (../block/blk-core.c:326:3)
> #4 blk_mq_alloc_request (../block/blk-mq.c:592:9)
> #5 scsi_alloc_request (../drivers/scsi/scsi_lib.c:1139:7)
> #6 scsi_execute_cmd (../drivers/scsi/scsi_lib.c:217:8)
> #7 scsi_vpd_inquiry (../drivers/scsi/scsi.c:312:11)
> #8 scsi_get_vpd_size (../drivers/scsi/scsi.c:345:11)
> #9 scsi_get_vpd_size (../drivers/scsi/scsi.c:336:5)
> #10 scsi_get_vpd_buf (../drivers/scsi/scsi.c:415:12)
> #11 scsi_attach_vpd (../drivers/scsi/scsi.c:483:12)
> #12 scsi_rescan_device (../drivers/scsi/scsi_scan.c:1628:2)
> #13 ata_scsi_dev_rescan (../drivers/ata/libata-scsi.c:4894:4)
> #14 process_one_work (../kernel/workqueue.c:2600:2)
> #15 worker_thread (../kernel/workqueue.c:2751:4)
> #16 kthread (../kernel/kthread.c:389:9)
> #17 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> #18 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>
> And here we are, waiting for the completion of the resume work that is
> scheduled on the blocked kworker/u32:12 (PID 11032), see above.
>
> FWIW the deadlock does not happen if autosuspend is turned off for this
> disk.
>
> I'm willing to invest more time into debugging, but I'm not well-versed
> in power management, so I may need some help from you as to what else I
> should look at. Please, include me in Cc of all potential replies; I'm
> not subscribed to linux-pm.
>
> Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 9:31 ` Petr Tesařík
@ 2023-10-03 10:15 ` Rafael J. Wysocki
2023-10-03 11:02 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 10:15 UTC (permalink / raw)
To: Petr Tesařík; +Cc: linux-pm, linux-acpi, Rafael J. Wysocki
On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>
> Hi again (adding more recipients),
>
> On Sat, 30 Sep 2023 12:20:54 +0200
> Petr Tesařík <petr@tesarici.cz> wrote:
>
> > Hi all,
> >
> > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > here's my analysis:
> >
> > The system never gets to waking up my SATA SSD disk:
> >
> > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> >
> > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > worker is stuck in 'D' state:
> >
> > >>> prog.stack_trace(11032)
> > #0 context_switch (../kernel/sched/core.c:5381:2)
> > #1 __schedule (../kernel/sched/core.c:6710:8)
> > #2 schedule (../kernel/sched/core.c:6786:3)
> > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > #10 kthread (../kernel/kthread.c:389:9)
> > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> >
> > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > systemd-sleep (PID 11002). This task is also in 'D' state:
> >
> > >>> prog.stack_trace(11002)
> > #0 context_switch (../kernel/sched/core.c:5381:2)
> > #1 __schedule (../kernel/sched/core.c:6710:8)
> > #2 schedule (../kernel/sched/core.c:6786:3)
> > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > #6 device_lock (../include/linux/device.h:958:2)
> > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>
> I believe the issue must be somewhere here. The whole suspend and
> resume logic in suspend_devices_and_enter() is framed by
> platform_suspend_begin() and platform_resume_end().
>
> My system is an ACPI system, so suspend_ops contains:
>
> .begin = acpi_suspend_begin,
> .end = acpi_pm_end,
>
> Now, acpi_suspend_begin() acquires acpi_scan_lock through
> acpi_pm_start(), and the lock is not released until acpi_pm_end().
> Since dpm_complete() waits for the completion of a work that tries to
> acquire acpi_scan_lock, the system will deadlock.
So holding acpi_scan_lock across suspend-resume is basically to
prevent the hotplug from taking place then IIRC.
> AFAICS either:
>
> a. the ACPI lock cannot be held while dpm_complete() runs, or
> b. ata_scsi_dev_rescan() must not be scheduled before the system is
> resumed, or
> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>
> My gut feeling is that b. is the right answer.
It's been a while since I looked at that code last time, but then it
has not changed for quite some time too.
It looks like the acpi_device_hotplug() path attempts to acquire
acpi_scan_lock() while holding dev->mutex which is kind of silly. I
need to check that, though.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 10:15 ` Rafael J. Wysocki
@ 2023-10-03 11:02 ` Petr Tesařík
2023-10-03 12:34 ` Rafael J. Wysocki
2023-10-03 12:35 ` Petr Tesařík
0 siblings, 2 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 11:02 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi
On Tue, 3 Oct 2023 12:15:10 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > Hi again (adding more recipients),
> >
> > On Sat, 30 Sep 2023 12:20:54 +0200
> > Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > > Hi all,
> > >
> > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > here's my analysis:
> > >
> > > The system never gets to waking up my SATA SSD disk:
> > >
> > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > >
> > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > worker is stuck in 'D' state:
> > >
> > > >>> prog.stack_trace(11032)
> > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > #10 kthread (../kernel/kthread.c:389:9)
> > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > >
> > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > >
> > > >>> prog.stack_trace(11002)
> > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > #6 device_lock (../include/linux/device.h:958:2)
> > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> >
> > I believe the issue must be somewhere here. The whole suspend and
> > resume logic in suspend_devices_and_enter() is framed by
> > platform_suspend_begin() and platform_resume_end().
> >
> > My system is an ACPI system, so suspend_ops contains:
> >
> > .begin = acpi_suspend_begin,
> > .end = acpi_pm_end,
> >
> > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > Since dpm_complete() waits for the completion of a work that tries to
> > acquire acpi_scan_lock, the system will deadlock.
>
> So holding acpi_scan_lock across suspend-resume is basically to
> prevent the hotplug from taking place then IIRC.
>
> > AFAICS either:
> >
> > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > resumed, or
> > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> >
> > My gut feeling is that b. is the right answer.
>
> It's been a while since I looked at that code last time, but then it
> has not changed for quite some time too.
>
> It looks like the acpi_device_hotplug() path attempts to acquire
> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> need to check that, though.
Thanks for your willingness. Well, it's not quite what you describe. If
it was a simple ABBA deadlock, then it would be reported by lockdep.
No, it's more complicated:
1. suspend_devices_and_enter() holds acpi_scan_lock,
2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
when it gets to acquiring acpi_scan_lock,
3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
completion while holding dev->mutex,
4. the SCSI completion work happens to be put on the same workqueue as
the ACPI hotplug work in step 2,
^^^--- THIS is how the two events are serialized!
5. suspend_devices_and_enter() calls dpm_complete() ->
device_complete() -> device_lock() and cannot acquire dev->mutex.
HTH,
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 11:02 ` Petr Tesařík
@ 2023-10-03 12:34 ` Rafael J. Wysocki
2023-10-03 12:40 ` Petr Tesařík
2023-10-03 12:35 ` Petr Tesařík
1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 12:34 UTC (permalink / raw)
To: Petr Tesařík; +Cc: Rafael J. Wysocki, linux-pm, linux-acpi
On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 12:15:10 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > Hi again (adding more recipients),
> > >
> > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > > Hi all,
> > > >
> > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > here's my analysis:
> > > >
> > > > The system never gets to waking up my SATA SSD disk:
> > > >
> > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > >
> > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > worker is stuck in 'D' state:
> > > >
> > > > >>> prog.stack_trace(11032)
> > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > >
> > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > >
> > > > >>> prog.stack_trace(11002)
> > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > >
> > > I believe the issue must be somewhere here. The whole suspend and
> > > resume logic in suspend_devices_and_enter() is framed by
> > > platform_suspend_begin() and platform_resume_end().
> > >
> > > My system is an ACPI system, so suspend_ops contains:
> > >
> > > .begin = acpi_suspend_begin,
> > > .end = acpi_pm_end,
> > >
> > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > Since dpm_complete() waits for the completion of a work that tries to
> > > acquire acpi_scan_lock, the system will deadlock.
> >
> > So holding acpi_scan_lock across suspend-resume is basically to
> > prevent the hotplug from taking place then IIRC.
> >
> > > AFAICS either:
> > >
> > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > resumed, or
> > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > >
> > > My gut feeling is that b. is the right answer.
> >
> > It's been a while since I looked at that code last time, but then it
> > has not changed for quite some time too.
> >
> > It looks like the acpi_device_hotplug() path attempts to acquire
> > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > need to check that, though.
>
> Thanks for your willingness. Well, it's not quite what you describe. If
> it was a simple ABBA deadlock, then it would be reported by lockdep.
> No, it's more complicated:
>
> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> when it gets to acquiring acpi_scan_lock,
> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> completion while holding dev->mutex,
> 4. the SCSI completion work happens to be put on the same workqueue as
> the ACPI hotplug work in step 2,
> ^^^--- THIS is how the two events are serialized!
Which is unexpected.
And quite honestly I'm not sure how this can happen, because
acpi_hotplug_schedule() uses a dedicated workqueue and it is called
from (a) the "eject" sysfs attribute (which cannot happen while system
suspend-resume is in progress) and (b) acpi_bus_notify() which has
nothing to do with SCSI.
Maybe the workqueue used for the SCSI completion is freezable?
> 5. suspend_devices_and_enter() calls dpm_complete() ->
> device_complete() -> device_lock() and cannot acquire dev->mutex.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 11:02 ` Petr Tesařík
2023-10-03 12:34 ` Rafael J. Wysocki
@ 2023-10-03 12:35 ` Petr Tesařík
1 sibling, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 12:35 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi
On Tue, 3 Oct 2023 13:02:40 +0200
Petr Tesařík <petr@tesarici.cz> wrote:
> On Tue, 3 Oct 2023 12:15:10 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > Hi again (adding more recipients),
> > >
> > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > > Hi all,
> > > >
> > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > here's my analysis:
> > > >
> > > > The system never gets to waking up my SATA SSD disk:
> > > >
> > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > >
> > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > worker is stuck in 'D' state:
> > > >
> > > > >>> prog.stack_trace(11032)
> > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > >
> > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > >
> > > > >>> prog.stack_trace(11002)
> > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > >
> > > I believe the issue must be somewhere here. The whole suspend and
> > > resume logic in suspend_devices_and_enter() is framed by
> > > platform_suspend_begin() and platform_resume_end().
> > >
> > > My system is an ACPI system, so suspend_ops contains:
> > >
> > > .begin = acpi_suspend_begin,
> > > .end = acpi_pm_end,
> > >
> > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > Since dpm_complete() waits for the completion of a work that tries to
> > > acquire acpi_scan_lock, the system will deadlock.
> >
> > So holding acpi_scan_lock across suspend-resume is basically to
> > prevent the hotplug from taking place then IIRC.
> >
> > > AFAICS either:
> > >
> > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > resumed, or
> > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > >
> > > My gut feeling is that b. is the right answer.
> >
> > It's been a while since I looked at that code last time, but then it
> > has not changed for quite some time too.
> >
> > It looks like the acpi_device_hotplug() path attempts to acquire
> > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > need to check that, though.
>
> Thanks for your willingness. Well, it's not quite what you describe. If
> it was a simple ABBA deadlock, then it would be reported by lockdep.
> No, it's more complicated:
>
> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> when it gets to acquiring acpi_scan_lock,
> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> completion while holding dev->mutex,
> 4. the SCSI completion work happens to be put on the same workqueue as
> the ACPI hotplug work in step 2,
> ^^^--- THIS is how the two events are serialized!
While wondering why the work was not taken by another worker, I noticed
that it was placed on the inactive_work list. Might be relevant.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 12:34 ` Rafael J. Wysocki
@ 2023-10-03 12:40 ` Petr Tesařík
2023-10-03 12:48 ` Rafael J. Wysocki
0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 12:40 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi
On Tue, 3 Oct 2023 14:34:56 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 12:15:10 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > Hi again (adding more recipients),
> > > >
> > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > here's my analysis:
> > > > >
> > > > > The system never gets to waking up my SATA SSD disk:
> > > > >
> > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > >
> > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > worker is stuck in 'D' state:
> > > > >
> > > > > >>> prog.stack_trace(11032)
> > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > >
> > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > >
> > > > > >>> prog.stack_trace(11002)
> > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > >
> > > > I believe the issue must be somewhere here. The whole suspend and
> > > > resume logic in suspend_devices_and_enter() is framed by
> > > > platform_suspend_begin() and platform_resume_end().
> > > >
> > > > My system is an ACPI system, so suspend_ops contains:
> > > >
> > > > .begin = acpi_suspend_begin,
> > > > .end = acpi_pm_end,
> > > >
> > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > acquire acpi_scan_lock, the system will deadlock.
> > >
> > > So holding acpi_scan_lock across suspend-resume is basically to
> > > prevent the hotplug from taking place then IIRC.
> > >
> > > > AFAICS either:
> > > >
> > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > resumed, or
> > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > >
> > > > My gut feeling is that b. is the right answer.
> > >
> > > It's been a while since I looked at that code last time, but then it
> > > has not changed for quite some time too.
> > >
> > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > need to check that, though.
> >
> > Thanks for your willingness. Well, it's not quite what you describe. If
> > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > No, it's more complicated:
> >
> > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > when it gets to acquiring acpi_scan_lock,
> > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > completion while holding dev->mutex,
> > 4. the SCSI completion work happens to be put on the same workqueue as
> > the ACPI hotplug work in step 2,
> > ^^^--- THIS is how the two events are serialized!
>
> Which is unexpected.
>
> And quite honestly I'm not sure how this can happen, because
> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> from (a) the "eject" sysfs attribute (which cannot happen while system
> suspend-resume is in progress) and (b) acpi_bus_notify() which has
> nothing to do with SCSI.
Oh, you're right, and I was too quick. They cannot be on the same
queue...
> Maybe the workqueue used for the SCSI completion is freezable?
Yes, that's it:
*(struct workqueue_struct *)0xffff97d240b2fe00 = {
/* ... */
.flags = (unsigned int)4,
/* WQ_FREEZABLE = 1 << 2 */
Good. But if this workqueue is frozen, the system still cannot make
progress.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 12:40 ` Petr Tesařík
@ 2023-10-03 12:48 ` Rafael J. Wysocki
2023-10-03 12:51 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 12:48 UTC (permalink / raw)
To: Petr Tesařík; +Cc: Rafael J. Wysocki, linux-pm, linux-acpi
On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:34:56 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > Hi again (adding more recipients),
> > > > >
> > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > here's my analysis:
> > > > > >
> > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > >
> > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > >
> > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > worker is stuck in 'D' state:
> > > > > >
> > > > > > >>> prog.stack_trace(11032)
> > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > >
> > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > >
> > > > > > >>> prog.stack_trace(11002)
> > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > >
> > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > platform_suspend_begin() and platform_resume_end().
> > > > >
> > > > > My system is an ACPI system, so suspend_ops contains:
> > > > >
> > > > > .begin = acpi_suspend_begin,
> > > > > .end = acpi_pm_end,
> > > > >
> > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > acquire acpi_scan_lock, the system will deadlock.
> > > >
> > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > prevent the hotplug from taking place then IIRC.
> > > >
> > > > > AFAICS either:
> > > > >
> > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > resumed, or
> > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > >
> > > > > My gut feeling is that b. is the right answer.
> > > >
> > > > It's been a while since I looked at that code last time, but then it
> > > > has not changed for quite some time too.
> > > >
> > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > need to check that, though.
> > >
> > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > No, it's more complicated:
> > >
> > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > when it gets to acquiring acpi_scan_lock,
> > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > completion while holding dev->mutex,
> > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > the ACPI hotplug work in step 2,
> > > ^^^--- THIS is how the two events are serialized!
> >
> > Which is unexpected.
> >
> > And quite honestly I'm not sure how this can happen, because
> > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > from (a) the "eject" sysfs attribute (which cannot happen while system
> > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > nothing to do with SCSI.
>
> Oh, you're right, and I was too quick. They cannot be on the same
> queue...
>
> > Maybe the workqueue used for the SCSI completion is freezable?
>
> Yes, that's it:
>
> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> /* ... */
> .flags = (unsigned int)4,
> /* WQ_FREEZABLE = 1 << 2 */
>
> Good. But if this workqueue is frozen, the system still cannot make
> progress.
The problem seems to be that dev->mutex is held while the work item
goes to a freezable workqueue and is waited for, which is an almost
guaranteed deadlock scenario.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 12:48 ` Rafael J. Wysocki
@ 2023-10-03 12:51 ` Petr Tesařík
2023-10-03 12:57 ` Rafael J. Wysocki
0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 12:51 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: linux-pm, linux-acpi
On Tue, 3 Oct 2023 14:48:13 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:34:56 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >
> > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > Hi again (adding more recipients),
> > > > > >
> > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > here's my analysis:
> > > > > > >
> > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > >
> > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > >
> > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > worker is stuck in 'D' state:
> > > > > > >
> > > > > > > >>> prog.stack_trace(11032)
> > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > >
> > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > >
> > > > > > > >>> prog.stack_trace(11002)
> > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > >
> > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > >
> > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > >
> > > > > > .begin = acpi_suspend_begin,
> > > > > > .end = acpi_pm_end,
> > > > > >
> > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > >
> > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > prevent the hotplug from taking place then IIRC.
> > > > >
> > > > > > AFAICS either:
> > > > > >
> > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > resumed, or
> > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > >
> > > > > > My gut feeling is that b. is the right answer.
> > > > >
> > > > > It's been a while since I looked at that code last time, but then it
> > > > > has not changed for quite some time too.
> > > > >
> > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > need to check that, though.
> > > >
> > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > No, it's more complicated:
> > > >
> > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > when it gets to acquiring acpi_scan_lock,
> > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > completion while holding dev->mutex,
> > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > the ACPI hotplug work in step 2,
> > > > ^^^--- THIS is how the two events are serialized!
> > >
> > > Which is unexpected.
> > >
> > > And quite honestly I'm not sure how this can happen, because
> > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > nothing to do with SCSI.
> >
> > Oh, you're right, and I was too quick. They cannot be on the same
> > queue...
> >
> > > Maybe the workqueue used for the SCSI completion is freezable?
> >
> > Yes, that's it:
> >
> > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > /* ... */
> > .flags = (unsigned int)4,
> > /* WQ_FREEZABLE = 1 << 2 */
> >
> > Good. But if this workqueue is frozen, the system still cannot make
> > progress.
>
> The problem seems to be that dev->mutex is held while the work item
> goes to a freezable workqueue and is waited for, which is an almost
> guaranteed deadlock scenario.
Ah. Thanks for explanation and direction! I'm going to dive into the
block layer and/or SCSI code and bug other people with my findings.
I'm sorry for taking your time; I hope it wasn't too much.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 12:51 ` Petr Tesařík
@ 2023-10-03 12:57 ` Rafael J. Wysocki
2023-10-03 15:18 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 12:57 UTC (permalink / raw)
To: Petr Tesařík; +Cc: Rafael J. Wysocki, linux-pm, linux-acpi
On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:48:13 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > Hi again (adding more recipients),
> > > > > > >
> > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > here's my analysis:
> > > > > > > >
> > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > >
> > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > >
> > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > worker is stuck in 'D' state:
> > > > > > > >
> > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > >
> > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > >
> > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > >
> > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > >
> > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > >
> > > > > > > .begin = acpi_suspend_begin,
> > > > > > > .end = acpi_pm_end,
> > > > > > >
> > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > >
> > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > prevent the hotplug from taking place then IIRC.
> > > > > >
> > > > > > > AFAICS either:
> > > > > > >
> > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > resumed, or
> > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > >
> > > > > > > My gut feeling is that b. is the right answer.
> > > > > >
> > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > has not changed for quite some time too.
> > > > > >
> > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > need to check that, though.
> > > > >
> > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > No, it's more complicated:
> > > > >
> > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > when it gets to acquiring acpi_scan_lock,
> > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > completion while holding dev->mutex,
> > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > the ACPI hotplug work in step 2,
> > > > > ^^^--- THIS is how the two events are serialized!
> > > >
> > > > Which is unexpected.
> > > >
> > > > And quite honestly I'm not sure how this can happen, because
> > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > nothing to do with SCSI.
> > >
> > > Oh, you're right, and I was too quick. They cannot be on the same
> > > queue...
> > >
> > > > Maybe the workqueue used for the SCSI completion is freezable?
> > >
> > > Yes, that's it:
> > >
> > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > /* ... */
> > > .flags = (unsigned int)4,
> > > /* WQ_FREEZABLE = 1 << 2 */
> > >
> > > Good. But if this workqueue is frozen, the system still cannot make
> > > progress.
> >
> > The problem seems to be that dev->mutex is held while the work item
> > goes to a freezable workqueue and is waited for, which is an almost
> > guaranteed deadlock scenario.
>
> Ah. Thanks for explanation and direction! I'm going to dive into the
> block layer and/or SCSI code and bug other people with my findings.
Please feel free to CC me on that in case I can help.
> I'm sorry for taking your time;
No worries.
> I hope it wasn't too much.
Not at all, I'm generally interested in all issues related to
suspend-resume and this one looks like it may affect a bunch of users.
Cheers!
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 12:57 ` Rafael J. Wysocki
@ 2023-10-03 15:18 ` Petr Tesařík
2023-10-03 16:16 ` Petr Tesařík
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 15:18 UTC (permalink / raw)
To: Rafael J. Wysocki, Damien Le Moal; +Cc: linux-pm, linux-ide
On Tue, 3 Oct 2023 14:57:46 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:48:13 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >
> > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >
> > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >
> > > > > > > > Hi again (adding more recipients),
> > > > > > > >
> > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > here's my analysis:
> > > > > > > > >
> > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > >
> > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > >
> > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > >
> > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > >
> > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > >
> > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > >
> > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > >
> > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > >
> > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > .end = acpi_pm_end,
> > > > > > > >
> > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > >
> > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > >
> > > > > > > > AFAICS either:
> > > > > > > >
> > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > resumed, or
> > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > >
> > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > >
> > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > has not changed for quite some time too.
> > > > > > >
> > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > need to check that, though.
> > > > > >
> > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > No, it's more complicated:
> > > > > >
> > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > completion while holding dev->mutex,
> > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > the ACPI hotplug work in step 2,
> > > > > > ^^^--- THIS is how the two events are serialized!
> > > > >
> > > > > Which is unexpected.
> > > > >
> > > > > And quite honestly I'm not sure how this can happen, because
> > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > nothing to do with SCSI.
> > > >
> > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > queue...
> > > >
> > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > >
> > > > Yes, that's it:
> > > >
> > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > /* ... */
> > > > .flags = (unsigned int)4,
> > > > /* WQ_FREEZABLE = 1 << 2 */
> > > >
> > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > progress.
> > >
> > > The problem seems to be that dev->mutex is held while the work item
> > > goes to a freezable workqueue and is waited for, which is an almost
> > > guaranteed deadlock scenario.
> >
> > Ah. Thanks for explanation and direction! I'm going to dive into the
> > block layer and/or SCSI code and bug other people with my findings.
>
> Please feel free to CC me on that in case I can help.
And here I am again... The frozen workqueue is in fact pm_wq, and the
work item that is waited for is pm_runtime_work. The block layer calls
pm_request_resume() on the device to resume the queue.
I bet the queue should not be resumed this early. In fact, it seems
that this is somewhat known to the ATA developers, because
ata_scsi_dev_rescan() contains this beautiful comment and code:
/*
* If the rescan work was scheduled because of a resume
* event, the port is already fully resumed, but the
* SCSI device may not yet be fully resumed. In such
* case, executing scsi_rescan_device() may cause a
* deadlock with the PM code on device_lock(). Prevent
* this by giving up and retrying rescan after a short
* delay.
*/
delay_rescan = sdev->sdev_gendev.power.is_suspended;
if (delay_rescan) {
scsi_device_put(sdev);
break;
}
It just doesn't seem to work as expected, at least not in my case.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 15:18 ` Petr Tesařík
@ 2023-10-03 16:16 ` Petr Tesařík
2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-04 6:13 ` Hannes Reinecke
2 siblings, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 16:16 UTC (permalink / raw)
To: Rafael J. Wysocki, Damien Le Moal; +Cc: linux-pm, linux-ide
On Tue, 3 Oct 2023 17:18:24 +0200
Petr Tesařík <petr@tesarici.cz> wrote:
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi again (adding more recipients),
> > > > > > > > >
> > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > here's my analysis:
> > > > > > > > > >
> > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > >
> > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > >
> > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > >
> > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > >
> > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > >
> > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > >
> > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > .end = acpi_pm_end,
> > > > > > > > >
> > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > >
> > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > >
> > > > > > > > > AFAICS either:
> > > > > > > > >
> > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > resumed, or
> > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > >
> > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > >
> > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > has not changed for quite some time too.
> > > > > > > >
> > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > need to check that, though.
> > > > > > >
> > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > No, it's more complicated:
> > > > > > >
> > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > completion while holding dev->mutex,
> > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > the ACPI hotplug work in step 2,
> > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > >
> > > > > > Which is unexpected.
> > > > > >
> > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > nothing to do with SCSI.
> > > > >
> > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > queue...
> > > > >
> > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > >
> > > > > Yes, that's it:
> > > > >
> > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > /* ... */
> > > > > .flags = (unsigned int)4,
> > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > >
> > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > progress.
> > > >
> > > > The problem seems to be that dev->mutex is held while the work item
> > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > guaranteed deadlock scenario.
> > >
> > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > block layer and/or SCSI code and bug other people with my findings.
> >
> > Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
>
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
> /*
> * If the rescan work was scheduled because of a resume
> * event, the port is already fully resumed, but the
> * SCSI device may not yet be fully resumed. In such
> * case, executing scsi_rescan_device() may cause a
> * deadlock with the PM code on device_lock(). Prevent
> * this by giving up and retrying rescan after a short
> * delay.
> */
> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> if (delay_rescan) {
> scsi_device_put(sdev);
> break;
> }
>
> It just doesn't seem to work as expected, at least not in my case.
Erm. This was v6.5.2, where I originally encountered the issue. I can
see that this snippet has just been changed by commit 8b4d9469d0b0.
@Daniel: I can still reproduce a deadlock on resume with v6.6-rc4.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 15:18 ` Petr Tesařík
2023-10-03 16:16 ` Petr Tesařík
@ 2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-03 17:19 ` Petr Tesařík
2023-10-03 20:07 ` Petr Tesařík
2023-10-04 6:13 ` Hannes Reinecke
2 siblings, 2 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 16:48 UTC (permalink / raw)
To: Petr Tesařík
Cc: Rafael J. Wysocki, Damien Le Moal, linux-pm, linux-ide
On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi again (adding more recipients),
> > > > > > > > >
> > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > here's my analysis:
> > > > > > > > > >
> > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > >
> > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > >
> > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > >
> > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > >
> > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > >
> > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > >
> > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > .end = acpi_pm_end,
> > > > > > > > >
> > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > >
> > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > >
> > > > > > > > > AFAICS either:
> > > > > > > > >
> > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > resumed, or
> > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > >
> > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > >
> > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > has not changed for quite some time too.
> > > > > > > >
> > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > need to check that, though.
> > > > > > >
> > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > No, it's more complicated:
> > > > > > >
> > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > completion while holding dev->mutex,
> > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > the ACPI hotplug work in step 2,
> > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > >
> > > > > > Which is unexpected.
> > > > > >
> > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > nothing to do with SCSI.
> > > > >
> > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > queue...
> > > > >
> > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > >
> > > > > Yes, that's it:
> > > > >
> > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > /* ... */
> > > > > .flags = (unsigned int)4,
> > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > >
> > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > progress.
> > > >
> > > > The problem seems to be that dev->mutex is held while the work item
> > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > guaranteed deadlock scenario.
> > >
> > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > block layer and/or SCSI code and bug other people with my findings.
> >
> > Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
If it called pm_runtime_resume() instead, this might work.
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
> /*
> * If the rescan work was scheduled because of a resume
> * event, the port is already fully resumed, but the
> * SCSI device may not yet be fully resumed. In such
> * case, executing scsi_rescan_device() may cause a
> * deadlock with the PM code on device_lock(). Prevent
> * this by giving up and retrying rescan after a short
> * delay.
> */
> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> if (delay_rescan) {
> scsi_device_put(sdev);
> break;
> }
>
> It just doesn't seem to work as expected, at least not in my case.
Well, calling pm_request_resume() and waiting for the resume to
actually happen is problematic regardless. It is just better to call
pm_runtime_resume() to synchronously resume the device instead.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 16:48 ` Rafael J. Wysocki
@ 2023-10-03 17:19 ` Petr Tesařík
2023-10-03 20:07 ` Petr Tesařík
1 sibling, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 17:19 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Damien Le Moal, linux-pm, linux-ide
On Tue, 3 Oct 2023 18:48:27 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:57:46 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >
> > > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >
> > > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >
> > > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi again (adding more recipients),
> > > > > > > > > >
> > > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > > here's my analysis:
> > > > > > > > > > >
> > > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > > >
> > > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > > >
> > > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > > >
> > > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > > >
> > > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > > >
> > > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > > >
> > > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > > .end = acpi_pm_end,
> > > > > > > > > >
> > > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > > >
> > > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > > >
> > > > > > > > > > AFAICS either:
> > > > > > > > > >
> > > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > > resumed, or
> > > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > > >
> > > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > > >
> > > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > > has not changed for quite some time too.
> > > > > > > > >
> > > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > > need to check that, though.
> > > > > > > >
> > > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > > No, it's more complicated:
> > > > > > > >
> > > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > > completion while holding dev->mutex,
> > > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > > the ACPI hotplug work in step 2,
> > > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > > >
> > > > > > > Which is unexpected.
> > > > > > >
> > > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > > nothing to do with SCSI.
> > > > > >
> > > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > > queue...
> > > > > >
> > > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > > >
> > > > > > Yes, that's it:
> > > > > >
> > > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > > /* ... */
> > > > > > .flags = (unsigned int)4,
> > > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > > >
> > > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > > progress.
> > > > >
> > > > > The problem seems to be that dev->mutex is held while the work item
> > > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > > guaranteed deadlock scenario.
> > > >
> > > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > > block layer and/or SCSI code and bug other people with my findings.
> > >
> > > Please feel free to CC me on that in case I can help.
> >
> > And here I am again... The frozen workqueue is in fact pm_wq, and the
> > work item that is waited for is pm_runtime_work. The block layer calls
> > pm_request_resume() on the device to resume the queue.
>
> If it called pm_runtime_resume() instead, this might work.
>
> > I bet the queue should not be resumed this early. In fact, it seems
> > that this is somewhat known to the ATA developers, because
> > ata_scsi_dev_rescan() contains this beautiful comment and code:
> >
> > /*
> > * If the rescan work was scheduled because of a resume
> > * event, the port is already fully resumed, but the
> > * SCSI device may not yet be fully resumed. In such
> > * case, executing scsi_rescan_device() may cause a
> > * deadlock with the PM code on device_lock(). Prevent
> > * this by giving up and retrying rescan after a short
> > * delay.
> > */
> > delay_rescan = sdev->sdev_gendev.power.is_suspended;
> > if (delay_rescan) {
> > scsi_device_put(sdev);
> > break;
> > }
> >
> > It just doesn't seem to work as expected, at least not in my case.
>
> Well, calling pm_request_resume() and waiting for the resume to
> actually happen is problematic regardless. It is just better to call
> pm_runtime_resume() to synchronously resume the device instead.
That code is in blk_pm_resume_queue(), which is pretty deep in the
block layer. I can imagine that pm_runtime_resume() cannot be always
used there. We may have to add an alternative API just for resume.
OTOH this may in fact be the correct fix.
But I want to try out a few other things before proposing a solution.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-03 17:19 ` Petr Tesařík
@ 2023-10-03 20:07 ` Petr Tesařík
2023-10-04 1:25 ` Damien Le Moal
1 sibling, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 20:07 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Damien Le Moal, linux-pm, linux-ide
On Tue, 3 Oct 2023 18:48:27 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:57:46 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >
> > > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >
> > > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >
> > > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi again (adding more recipients),
> > > > > > > > > >
> > > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > > here's my analysis:
> > > > > > > > > > >
> > > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > > >
> > > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > > >
> > > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > > >
> > > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > > >
> > > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > > >
> > > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > > >
> > > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > > .end = acpi_pm_end,
> > > > > > > > > >
> > > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > > >
> > > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > > >
> > > > > > > > > > AFAICS either:
> > > > > > > > > >
> > > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > > resumed, or
> > > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > > >
> > > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > > >
> > > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > > has not changed for quite some time too.
> > > > > > > > >
> > > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > > need to check that, though.
> > > > > > > >
> > > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > > No, it's more complicated:
> > > > > > > >
> > > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > > completion while holding dev->mutex,
> > > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > > the ACPI hotplug work in step 2,
> > > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > > >
> > > > > > > Which is unexpected.
> > > > > > >
> > > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > > nothing to do with SCSI.
> > > > > >
> > > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > > queue...
> > > > > >
> > > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > > >
> > > > > > Yes, that's it:
> > > > > >
> > > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > > /* ... */
> > > > > > .flags = (unsigned int)4,
> > > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > > >
> > > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > > progress.
> > > > >
> > > > > The problem seems to be that dev->mutex is held while the work item
> > > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > > guaranteed deadlock scenario.
> > > >
> > > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > > block layer and/or SCSI code and bug other people with my findings.
> > >
> > > Please feel free to CC me on that in case I can help.
> >
> > And here I am again... The frozen workqueue is in fact pm_wq, and the
> > work item that is waited for is pm_runtime_work. The block layer calls
> > pm_request_resume() on the device to resume the queue.
>
> If it called pm_runtime_resume() instead, this might work.
>
> > I bet the queue should not be resumed this early. In fact, it seems
> > that this is somewhat known to the ATA developers, because
> > ata_scsi_dev_rescan() contains this beautiful comment and code:
> >
> > /*
> > * If the rescan work was scheduled because of a resume
> > * event, the port is already fully resumed, but the
> > * SCSI device may not yet be fully resumed. In such
> > * case, executing scsi_rescan_device() may cause a
> > * deadlock with the PM code on device_lock(). Prevent
> > * this by giving up and retrying rescan after a short
> > * delay.
> > */
> > delay_rescan = sdev->sdev_gendev.power.is_suspended;
> > if (delay_rescan) {
> > scsi_device_put(sdev);
> > break;
> > }
> >
> > It just doesn't seem to work as expected, at least not in my case.
>
> Well, calling pm_request_resume() and waiting for the resume to
> actually happen is problematic regardless. It is just better to call
> pm_runtime_resume() to synchronously resume the device instead.
I just want to confirm that my understanding of the issue is correct
now. After applying the patch below, my laptop has just survived half a
dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
disk. Without the patch, it usually freezes on first attempt.
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a371b497035e..87000f89319e 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
ata_for_each_dev(dev, link, ENABLED) {
struct scsi_device *sdev = dev->sdev;
+ /*
+ * If the queue accepts only PM, bail out.
+ */
+ if (blk_queue_pm_only(sdev->request_queue)) {
+ ret = -EAGAIN;
+ goto unlock;
+ }
+
/*
* If the port was suspended before this was scheduled,
* bail out.
Petr T
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 20:07 ` Petr Tesařík
@ 2023-10-04 1:25 ` Damien Le Moal
2023-10-04 6:43 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04 1:25 UTC (permalink / raw)
To: Petr Tesařík, Rafael J. Wysocki; +Cc: linux-pm, linux-ide
On 10/4/23 05:07, Petr Tesařík wrote:
> I just want to confirm that my understanding of the issue is correct
> now. After applying the patch below, my laptop has just survived half a
> dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
> disk. Without the patch, it usually freezes on first attempt.
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index a371b497035e..87000f89319e 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
> ata_for_each_dev(dev, link, ENABLED) {
> struct scsi_device *sdev = dev->sdev;
>
> + /*
> + * If the queue accepts only PM, bail out.
> + */
> + if (blk_queue_pm_only(sdev->request_queue)) {
> + ret = -EAGAIN;
> + goto unlock;
> + }
> +
> /*
> * If the port was suspended before this was scheduled,
> * bail out.
This seems sensible to me: scsi_rescan_device() only checks that the device is
running, without checking that the device queue is also resumed. So the right
place for this check is scsi_rescan_device():
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3db4d31a03a1..b05a55f498a2 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
device_lock(dev);
/*
- * Bail out if the device is not running. Otherwise, the rescan may
- * block waiting for commands to be executed, with us holding the
- * device lock. This can result in a potential deadlock in the power
- * management core code when system resume is on-going.
+ * Bail out if the device or its queue are not running. Otherwise,
+ * the rescan may block waiting for commands to be executed, with us
+ * holding the device lock. This can result in a potential deadlock
+ * in the power management core code when system resume is on-going.
*/
- if (sdev->sdev_state != SDEV_RUNNING) {
+ if (sdev->sdev_state != SDEV_RUNNING ||
+ blk_queue_pm_only(sdev->request_queue)) {
ret = -EWOULDBLOCK;
goto unlock;
}
Can you try the above to confirm it works for you ? It should, as that is
essentially the same as you did.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 15:18 ` Petr Tesařík
2023-10-03 16:16 ` Petr Tesařík
2023-10-03 16:48 ` Rafael J. Wysocki
@ 2023-10-04 6:13 ` Hannes Reinecke
2023-10-04 6:18 ` Damien Le Moal
2 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2023-10-04 6:13 UTC (permalink / raw)
To: Petr Tesařík, Rafael J. Wysocki, Damien Le Moal
Cc: linux-pm, linux-ide
On 10/3/23 17:18, Petr Tesařík wrote:
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>
>>> On Tue, 3 Oct 2023 14:48:13 +0200
>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>
>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>
>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>
>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>
>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>
>>>>>>>>> Hi again (adding more recipients),
>>>>>>>>>
>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
>>>>>>>>>> here's my analysis:
>>>>>>>>>>
>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
>>>>>>>>>>
>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
>>>>>>>>>>
>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
>>>>>>>>>> worker is stuck in 'D' state:
>>>>>>>>>>
>>>>>>>>>>>>> prog.stack_trace(11032)
>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2)
>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4)
>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>>>>>>>>>>
>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
>>>>>>>>>>
>>>>>>>>>>>>> prog.stack_trace(11002)
>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2)
>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2)
>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>>>>>>>>>
>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
>>>>>>>>> platform_suspend_begin() and platform_resume_end().
>>>>>>>>>
>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
>>>>>>>>>
>>>>>>>>> .begin = acpi_suspend_begin,
>>>>>>>>> .end = acpi_pm_end,
>>>>>>>>>
>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
>>>>>>>>
>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
>>>>>>>> prevent the hotplug from taking place then IIRC.
>>>>>>>>
>>>>>>>>> AFAICS either:
>>>>>>>>>
>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
>>>>>>>>> resumed, or
>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>>>>>>>>>
>>>>>>>>> My gut feeling is that b. is the right answer.
>>>>>>>>
>>>>>>>> It's been a while since I looked at that code last time, but then it
>>>>>>>> has not changed for quite some time too.
>>>>>>>>
>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
>>>>>>>> need to check that, though.
>>>>>>>
>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
>>>>>>> No, it's more complicated:
>>>>>>>
>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>>>>>>> when it gets to acquiring acpi_scan_lock,
>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>>>>>>> completion while holding dev->mutex,
>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
>>>>>>> the ACPI hotplug work in step 2,
>>>>>>> ^^^--- THIS is how the two events are serialized!
>>>>>>
>>>>>> Which is unexpected.
>>>>>>
>>>>>> And quite honestly I'm not sure how this can happen, because
>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
>>>>>> nothing to do with SCSI.
>>>>>
>>>>> Oh, you're right, and I was too quick. They cannot be on the same
>>>>> queue...
>>>>>
>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
>>>>>
>>>>> Yes, that's it:
>>>>>
>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
>>>>> /* ... */
>>>>> .flags = (unsigned int)4,
>>>>> /* WQ_FREEZABLE = 1 << 2 */
>>>>>
>>>>> Good. But if this workqueue is frozen, the system still cannot make
>>>>> progress.
>>>>
>>>> The problem seems to be that dev->mutex is held while the work item
>>>> goes to a freezable workqueue and is waited for, which is an almost
>>>> guaranteed deadlock scenario.
>>>
>>> Ah. Thanks for explanation and direction! I'm going to dive into the
>>> block layer and/or SCSI code and bug other people with my findings.
>>
>> Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
>
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
> /*
> * If the rescan work was scheduled because of a resume
> * event, the port is already fully resumed, but the
> * SCSI device may not yet be fully resumed. In such
> * case, executing scsi_rescan_device() may cause a
> * deadlock with the PM code on device_lock(). Prevent
> * this by giving up and retrying rescan after a short
> * delay.
> */
> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> if (delay_rescan) {
> scsi_device_put(sdev);
> break;
> }
>
> It just doesn't seem to work as expected, at least not in my case.
>
Did you test the libata suspend/resume patches from Damien?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 6:13 ` Hannes Reinecke
@ 2023-10-04 6:18 ` Damien Le Moal
2023-10-04 6:42 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04 6:18 UTC (permalink / raw)
To: Hannes Reinecke, Petr Tesařík, Rafael J. Wysocki
Cc: linux-pm, linux-ide
On 10/4/23 15:13, Hannes Reinecke wrote:
> On 10/3/23 17:18, Petr Tesařík wrote:
>> On Tue, 3 Oct 2023 14:57:46 +0200
>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>
>>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>
>>>> On Tue, 3 Oct 2023 14:48:13 +0200
>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>
>>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>
>>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>
>>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>
>>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
>>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi again (adding more recipients),
>>>>>>>>>>
>>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
>>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
>>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
>>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
>>>>>>>>>>> here's my analysis:
>>>>>>>>>>>
>>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
>>>>>>>>>>>
>>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
>>>>>>>>>>>
>>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
>>>>>>>>>>> worker is stuck in 'D' state:
>>>>>>>>>>>
>>>>>>>>>>>>>> prog.stack_trace(11032)
>>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
>>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
>>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2)
>>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4)
>>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
>>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
>>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>>>>>>>>>>>
>>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
>>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
>>>>>>>>>>>
>>>>>>>>>>>>>> prog.stack_trace(11002)
>>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2)
>>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2)
>>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
>>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>>>>>>>>>>
>>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
>>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
>>>>>>>>>> platform_suspend_begin() and platform_resume_end().
>>>>>>>>>>
>>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
>>>>>>>>>>
>>>>>>>>>> .begin = acpi_suspend_begin,
>>>>>>>>>> .end = acpi_pm_end,
>>>>>>>>>>
>>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
>>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
>>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
>>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
>>>>>>>>>
>>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
>>>>>>>>> prevent the hotplug from taking place then IIRC.
>>>>>>>>>
>>>>>>>>>> AFAICS either:
>>>>>>>>>>
>>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
>>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
>>>>>>>>>> resumed, or
>>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>>>>>>>>>>
>>>>>>>>>> My gut feeling is that b. is the right answer.
>>>>>>>>>
>>>>>>>>> It's been a while since I looked at that code last time, but then it
>>>>>>>>> has not changed for quite some time too.
>>>>>>>>>
>>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
>>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
>>>>>>>>> need to check that, though.
>>>>>>>>
>>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
>>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
>>>>>>>> No, it's more complicated:
>>>>>>>>
>>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
>>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>>>>>>>> when it gets to acquiring acpi_scan_lock,
>>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>>>>>>>> completion while holding dev->mutex,
>>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
>>>>>>>> the ACPI hotplug work in step 2,
>>>>>>>> ^^^--- THIS is how the two events are serialized!
>>>>>>>
>>>>>>> Which is unexpected.
>>>>>>>
>>>>>>> And quite honestly I'm not sure how this can happen, because
>>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
>>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
>>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
>>>>>>> nothing to do with SCSI.
>>>>>>
>>>>>> Oh, you're right, and I was too quick. They cannot be on the same
>>>>>> queue...
>>>>>>
>>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
>>>>>>
>>>>>> Yes, that's it:
>>>>>>
>>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
>>>>>> /* ... */
>>>>>> .flags = (unsigned int)4,
>>>>>> /* WQ_FREEZABLE = 1 << 2 */
>>>>>>
>>>>>> Good. But if this workqueue is frozen, the system still cannot make
>>>>>> progress.
>>>>>
>>>>> The problem seems to be that dev->mutex is held while the work item
>>>>> goes to a freezable workqueue and is waited for, which is an almost
>>>>> guaranteed deadlock scenario.
>>>>
>>>> Ah. Thanks for explanation and direction! I'm going to dive into the
>>>> block layer and/or SCSI code and bug other people with my findings.
>>>
>>> Please feel free to CC me on that in case I can help.
>>
>> And here I am again... The frozen workqueue is in fact pm_wq, and the
>> work item that is waited for is pm_runtime_work. The block layer calls
>> pm_request_resume() on the device to resume the queue.
>>
>> I bet the queue should not be resumed this early. In fact, it seems
>> that this is somewhat known to the ATA developers, because
>> ata_scsi_dev_rescan() contains this beautiful comment and code:
>>
>> /*
>> * If the rescan work was scheduled because of a resume
>> * event, the port is already fully resumed, but the
>> * SCSI device may not yet be fully resumed. In such
>> * case, executing scsi_rescan_device() may cause a
>> * deadlock with the PM code on device_lock(). Prevent
>> * this by giving up and retrying rescan after a short
>> * delay.
>> */
>> delay_rescan = sdev->sdev_gendev.power.is_suspended;
>> if (delay_rescan) {
>> scsi_device_put(sdev);
>> break;
>> }
>>
>> It just doesn't seem to work as expected, at least not in my case.
>>
> Did you test the libata suspend/resume patches from Damien?
I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
in this thread.
>
> Cheers,
>
> Hannes
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 6:18 ` Damien Le Moal
@ 2023-10-04 6:42 ` Petr Tesařík
0 siblings, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-04 6:42 UTC (permalink / raw)
To: Damien Le Moal, Hannes Reinecke; +Cc: Rafael J. Wysocki, linux-pm, linux-ide
On Wed, 4 Oct 2023 15:18:34 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:
> On 10/4/23 15:13, Hannes Reinecke wrote:
> > On 10/3/23 17:18, Petr Tesařík wrote:
> >> On Tue, 3 Oct 2023 14:57:46 +0200
> >> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>
> >>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>
> >>>> On Tue, 3 Oct 2023 14:48:13 +0200
> >>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>
> >>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>
> >>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
> >>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>
> >>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
> >>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi again (adding more recipients),
> >>>>>>>>>>
> >>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
> >>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> >>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
> >>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
> >>>>>>>>>>> here's my analysis:
> >>>>>>>>>>>
> >>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
> >>>>>>>>>>>
> >>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> >>>>>>>>>>>
> >>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
> >>>>>>>>>>> worker is stuck in 'D' state:
> >>>>>>>>>>>
> >>>>>>>>>>>>>> prog.stack_trace(11032)
> >>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> >>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> >>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2)
> >>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4)
> >>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
> >>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> >>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> >>>>>>>>>>>
> >>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> >>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
> >>>>>>>>>>>
> >>>>>>>>>>>>>> prog.stack_trace(11002)
> >>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2)
> >>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2)
> >>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> >>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> >>>>>>>>>>
> >>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
> >>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
> >>>>>>>>>> platform_suspend_begin() and platform_resume_end().
> >>>>>>>>>>
> >>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
> >>>>>>>>>>
> >>>>>>>>>> .begin = acpi_suspend_begin,
> >>>>>>>>>> .end = acpi_pm_end,
> >>>>>>>>>>
> >>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
> >>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
> >>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
> >>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
> >>>>>>>>>
> >>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
> >>>>>>>>> prevent the hotplug from taking place then IIRC.
> >>>>>>>>>
> >>>>>>>>>> AFAICS either:
> >>>>>>>>>>
> >>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
> >>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
> >>>>>>>>>> resumed, or
> >>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> >>>>>>>>>>
> >>>>>>>>>> My gut feeling is that b. is the right answer.
> >>>>>>>>>
> >>>>>>>>> It's been a while since I looked at that code last time, but then it
> >>>>>>>>> has not changed for quite some time too.
> >>>>>>>>>
> >>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
> >>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> >>>>>>>>> need to check that, though.
> >>>>>>>>
> >>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
> >>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
> >>>>>>>> No, it's more complicated:
> >>>>>>>>
> >>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> >>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> >>>>>>>> when it gets to acquiring acpi_scan_lock,
> >>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> >>>>>>>> completion while holding dev->mutex,
> >>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
> >>>>>>>> the ACPI hotplug work in step 2,
> >>>>>>>> ^^^--- THIS is how the two events are serialized!
> >>>>>>>
> >>>>>>> Which is unexpected.
> >>>>>>>
> >>>>>>> And quite honestly I'm not sure how this can happen, because
> >>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> >>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
> >>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
> >>>>>>> nothing to do with SCSI.
> >>>>>>
> >>>>>> Oh, you're right, and I was too quick. They cannot be on the same
> >>>>>> queue...
> >>>>>>
> >>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
> >>>>>>
> >>>>>> Yes, that's it:
> >>>>>>
> >>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> >>>>>> /* ... */
> >>>>>> .flags = (unsigned int)4,
> >>>>>> /* WQ_FREEZABLE = 1 << 2 */
> >>>>>>
> >>>>>> Good. But if this workqueue is frozen, the system still cannot make
> >>>>>> progress.
> >>>>>
> >>>>> The problem seems to be that dev->mutex is held while the work item
> >>>>> goes to a freezable workqueue and is waited for, which is an almost
> >>>>> guaranteed deadlock scenario.
> >>>>
> >>>> Ah. Thanks for explanation and direction! I'm going to dive into the
> >>>> block layer and/or SCSI code and bug other people with my findings.
> >>>
> >>> Please feel free to CC me on that in case I can help.
> >>
> >> And here I am again... The frozen workqueue is in fact pm_wq, and the
> >> work item that is waited for is pm_runtime_work. The block layer calls
> >> pm_request_resume() on the device to resume the queue.
> >>
> >> I bet the queue should not be resumed this early. In fact, it seems
> >> that this is somewhat known to the ATA developers, because
> >> ata_scsi_dev_rescan() contains this beautiful comment and code:
> >>
> >> /*
> >> * If the rescan work was scheduled because of a resume
> >> * event, the port is already fully resumed, but the
> >> * SCSI device may not yet be fully resumed. In such
> >> * case, executing scsi_rescan_device() may cause a
> >> * deadlock with the PM code on device_lock(). Prevent
> >> * this by giving up and retrying rescan after a short
> >> * delay.
> >> */
> >> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> >> if (delay_rescan) {
> >> scsi_device_put(sdev);
> >> break;
> >> }
> >>
> >> It just doesn't seem to work as expected, at least not in my case.
> >>
> > Did you test the libata suspend/resume patches from Damien?
>
> I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
> in this thread.
Indeed. I originally hit the hang on 6.5. I captured a coredump and did
most of my analysis on that version (hence the quoted code). But I
tested again with 6.6-rc4 just before sending my email, and the hang is
still there. The rescan work runs after dpm_resume(), but before
dpm_complete(), and I believe that's yet another issue.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 1:25 ` Damien Le Moal
@ 2023-10-04 6:43 ` Petr Tesařík
2023-10-04 7:38 ` Damien Le Moal
0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-04 6:43 UTC (permalink / raw)
To: Damien Le Moal; +Cc: Rafael J. Wysocki, linux-pm, linux-ide, Hannes Reinecke
On Wed, 4 Oct 2023 10:25:47 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:
> On 10/4/23 05:07, Petr Tesařík wrote:
> > I just want to confirm that my understanding of the issue is correct
> > now. After applying the patch below, my laptop has just survived half a
> > dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
> > disk. Without the patch, it usually freezes on first attempt.
> >
> > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> > index a371b497035e..87000f89319e 100644
> > --- a/drivers/ata/libata-scsi.c
> > +++ b/drivers/ata/libata-scsi.c
> > @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
> > ata_for_each_dev(dev, link, ENABLED) {
> > struct scsi_device *sdev = dev->sdev;
> >
> > + /*
> > + * If the queue accepts only PM, bail out.
> > + */
> > + if (blk_queue_pm_only(sdev->request_queue)) {
> > + ret = -EAGAIN;
> > + goto unlock;
> > + }
> > +
> > /*
> > * If the port was suspended before this was scheduled,
> > * bail out.
>
> This seems sensible to me: scsi_rescan_device() only checks that the device is
> running, without checking that the device queue is also resumed. So the right
> place for this check is scsi_rescan_device():
>
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 3db4d31a03a1..b05a55f498a2 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
> device_lock(dev);
>
> /*
> - * Bail out if the device is not running. Otherwise, the rescan may
> - * block waiting for commands to be executed, with us holding the
> - * device lock. This can result in a potential deadlock in the power
> - * management core code when system resume is on-going.
> + * Bail out if the device or its queue are not running. Otherwise,
> + * the rescan may block waiting for commands to be executed, with us
> + * holding the device lock. This can result in a potential deadlock
> + * in the power management core code when system resume is on-going.
> */
> - if (sdev->sdev_state != SDEV_RUNNING) {
> + if (sdev->sdev_state != SDEV_RUNNING ||
> + blk_queue_pm_only(sdev->request_queue)) {
> ret = -EWOULDBLOCK;
> goto unlock;
> }
>
> Can you try the above to confirm it works for you ? It should, as that is
> essentially the same as you did.
After revertung my patch and applying yours, my system can still resume
from S3. Congrats!
Tested-by: Petr Tesarik <petr@tesarici.cz>
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 6:43 ` Petr Tesařík
@ 2023-10-04 7:38 ` Damien Le Moal
0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04 7:38 UTC (permalink / raw)
To: Petr Tesařík
Cc: Rafael J. Wysocki, linux-pm, linux-ide, Hannes Reinecke
On 10/4/23 15:43, Petr Tesařík wrote:
> On Wed, 4 Oct 2023 10:25:47 +0900
> Damien Le Moal <dlemoal@kernel.org> wrote:
>
>> On 10/4/23 05:07, Petr Tesařík wrote:
>>> I just want to confirm that my understanding of the issue is correct
>>> now. After applying the patch below, my laptop has just survived half a
>>> dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
>>> disk. Without the patch, it usually freezes on first attempt.
>>>
>>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>>> index a371b497035e..87000f89319e 100644
>>> --- a/drivers/ata/libata-scsi.c
>>> +++ b/drivers/ata/libata-scsi.c
>>> @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
>>> ata_for_each_dev(dev, link, ENABLED) {
>>> struct scsi_device *sdev = dev->sdev;
>>>
>>> + /*
>>> + * If the queue accepts only PM, bail out.
>>> + */
>>> + if (blk_queue_pm_only(sdev->request_queue)) {
>>> + ret = -EAGAIN;
>>> + goto unlock;
>>> + }
>>> +
>>> /*
>>> * If the port was suspended before this was scheduled,
>>> * bail out.
>>
>> This seems sensible to me: scsi_rescan_device() only checks that the device is
>> running, without checking that the device queue is also resumed. So the right
>> place for this check is scsi_rescan_device():
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 3db4d31a03a1..b05a55f498a2 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
>> device_lock(dev);
>>
>> /*
>> - * Bail out if the device is not running. Otherwise, the rescan may
>> - * block waiting for commands to be executed, with us holding the
>> - * device lock. This can result in a potential deadlock in the power
>> - * management core code when system resume is on-going.
>> + * Bail out if the device or its queue are not running. Otherwise,
>> + * the rescan may block waiting for commands to be executed, with us
>> + * holding the device lock. This can result in a potential deadlock
>> + * in the power management core code when system resume is on-going.
>> */
>> - if (sdev->sdev_state != SDEV_RUNNING) {
>> + if (sdev->sdev_state != SDEV_RUNNING ||
>> + blk_queue_pm_only(sdev->request_queue)) {
>> ret = -EWOULDBLOCK;
>> goto unlock;
>> }
>>
>> Can you try the above to confirm it works for you ? It should, as that is
>> essentially the same as you did.
>
> After revertung my patch and applying yours, my system can still resume
> from S3. Congrats!
>
> Tested-by: Petr Tesarik <petr@tesarici.cz>
Thanks. I will send a proper patch asap.
>
> Petr T
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2023-10-04 7:38 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-30 10:20 Thinkpad E595 system deadlock on resume from S3 Petr Tesařík
2023-10-03 9:31 ` Petr Tesařík
2023-10-03 10:15 ` Rafael J. Wysocki
2023-10-03 11:02 ` Petr Tesařík
2023-10-03 12:34 ` Rafael J. Wysocki
2023-10-03 12:40 ` Petr Tesařík
2023-10-03 12:48 ` Rafael J. Wysocki
2023-10-03 12:51 ` Petr Tesařík
2023-10-03 12:57 ` Rafael J. Wysocki
2023-10-03 15:18 ` Petr Tesařík
2023-10-03 16:16 ` Petr Tesařík
2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-03 17:19 ` Petr Tesařík
2023-10-03 20:07 ` Petr Tesařík
2023-10-04 1:25 ` Damien Le Moal
2023-10-04 6:43 ` Petr Tesařík
2023-10-04 7:38 ` Damien Le Moal
2023-10-04 6:13 ` Hannes Reinecke
2023-10-04 6:18 ` Damien Le Moal
2023-10-04 6:42 ` Petr Tesařík
2023-10-03 12:35 ` Petr Tesařík
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox