* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 15:18 ` Petr Tesařík
@ 2023-10-03 16:16 ` Petr Tesařík
2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-04 6:13 ` Hannes Reinecke
2 siblings, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 16:16 UTC (permalink / raw)
To: Rafael J. Wysocki, Damien Le Moal; +Cc: linux-pm, linux-ide
On Tue, 3 Oct 2023 17:18:24 +0200
Petr Tesařík <petr@tesarici.cz> wrote:
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi again (adding more recipients),
> > > > > > > > >
> > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > here's my analysis:
> > > > > > > > > >
> > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > >
> > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > >
> > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > >
> > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > >
> > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > >
> > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > >
> > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > .end = acpi_pm_end,
> > > > > > > > >
> > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > >
> > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > >
> > > > > > > > > AFAICS either:
> > > > > > > > >
> > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > resumed, or
> > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > >
> > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > >
> > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > has not changed for quite some time too.
> > > > > > > >
> > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > need to check that, though.
> > > > > > >
> > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > No, it's more complicated:
> > > > > > >
> > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > completion while holding dev->mutex,
> > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > the ACPI hotplug work in step 2,
> > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > >
> > > > > > Which is unexpected.
> > > > > >
> > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > nothing to do with SCSI.
> > > > >
> > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > queue...
> > > > >
> > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > >
> > > > > Yes, that's it:
> > > > >
> > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > /* ... */
> > > > > .flags = (unsigned int)4,
> > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > >
> > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > progress.
> > > >
> > > > The problem seems to be that dev->mutex is held while the work item
> > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > guaranteed deadlock scenario.
> > >
> > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > block layer and/or SCSI code and bug other people with my findings.
> >
> > Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
>
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
> /*
> * If the rescan work was scheduled because of a resume
> * event, the port is already fully resumed, but the
> * SCSI device may not yet be fully resumed. In such
> * case, executing scsi_rescan_device() may cause a
> * deadlock with the PM code on device_lock(). Prevent
> * this by giving up and retrying rescan after a short
> * delay.
> */
> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> if (delay_rescan) {
> scsi_device_put(sdev);
> break;
> }
>
> It just doesn't seem to work as expected, at least not in my case.
Erm. This was v6.5.2, where I originally encountered the issue. I can
see that this snippet has just been changed by commit 8b4d9469d0b0.
@Daniel: I can still reproduce a deadlock on resume with v6.6-rc4.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 15:18 ` Petr Tesařík
2023-10-03 16:16 ` Petr Tesařík
@ 2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-03 17:19 ` Petr Tesařík
2023-10-03 20:07 ` Petr Tesařík
2023-10-04 6:13 ` Hannes Reinecke
2 siblings, 2 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2023-10-03 16:48 UTC (permalink / raw)
To: Petr Tesařík
Cc: Rafael J. Wysocki, Damien Le Moal, linux-pm, linux-ide
On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > >
> > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > >
> > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > >
> > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi again (adding more recipients),
> > > > > > > > >
> > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > here's my analysis:
> > > > > > > > > >
> > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > >
> > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > >
> > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > >
> > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > >
> > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > >
> > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > >
> > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > >
> > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > .end = acpi_pm_end,
> > > > > > > > >
> > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > >
> > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > >
> > > > > > > > > AFAICS either:
> > > > > > > > >
> > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > resumed, or
> > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > >
> > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > >
> > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > has not changed for quite some time too.
> > > > > > > >
> > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > need to check that, though.
> > > > > > >
> > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > No, it's more complicated:
> > > > > > >
> > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > completion while holding dev->mutex,
> > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > the ACPI hotplug work in step 2,
> > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > >
> > > > > > Which is unexpected.
> > > > > >
> > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > nothing to do with SCSI.
> > > > >
> > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > queue...
> > > > >
> > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > >
> > > > > Yes, that's it:
> > > > >
> > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > /* ... */
> > > > > .flags = (unsigned int)4,
> > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > >
> > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > progress.
> > > >
> > > > The problem seems to be that dev->mutex is held while the work item
> > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > guaranteed deadlock scenario.
> > >
> > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > block layer and/or SCSI code and bug other people with my findings.
> >
> > Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
If it called pm_runtime_resume() instead, this might work.
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
> /*
> * If the rescan work was scheduled because of a resume
> * event, the port is already fully resumed, but the
> * SCSI device may not yet be fully resumed. In such
> * case, executing scsi_rescan_device() may cause a
> * deadlock with the PM code on device_lock(). Prevent
> * this by giving up and retrying rescan after a short
> * delay.
> */
> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> if (delay_rescan) {
> scsi_device_put(sdev);
> break;
> }
>
> It just doesn't seem to work as expected, at least not in my case.
Well, calling pm_request_resume() and waiting for the resume to
actually happen is problematic regardless. It is just better to call
pm_runtime_resume() to synchronously resume the device instead.
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 16:48 ` Rafael J. Wysocki
@ 2023-10-03 17:19 ` Petr Tesařík
2023-10-03 20:07 ` Petr Tesařík
1 sibling, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 17:19 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Damien Le Moal, linux-pm, linux-ide
On Tue, 3 Oct 2023 18:48:27 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:57:46 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >
> > > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >
> > > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >
> > > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi again (adding more recipients),
> > > > > > > > > >
> > > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > > here's my analysis:
> > > > > > > > > > >
> > > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > > >
> > > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > > >
> > > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > > >
> > > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > > >
> > > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > > >
> > > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > > >
> > > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > > .end = acpi_pm_end,
> > > > > > > > > >
> > > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > > >
> > > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > > >
> > > > > > > > > > AFAICS either:
> > > > > > > > > >
> > > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > > resumed, or
> > > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > > >
> > > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > > >
> > > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > > has not changed for quite some time too.
> > > > > > > > >
> > > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > > need to check that, though.
> > > > > > > >
> > > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > > No, it's more complicated:
> > > > > > > >
> > > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > > completion while holding dev->mutex,
> > > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > > the ACPI hotplug work in step 2,
> > > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > > >
> > > > > > > Which is unexpected.
> > > > > > >
> > > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > > nothing to do with SCSI.
> > > > > >
> > > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > > queue...
> > > > > >
> > > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > > >
> > > > > > Yes, that's it:
> > > > > >
> > > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > > /* ... */
> > > > > > .flags = (unsigned int)4,
> > > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > > >
> > > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > > progress.
> > > > >
> > > > > The problem seems to be that dev->mutex is held while the work item
> > > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > > guaranteed deadlock scenario.
> > > >
> > > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > > block layer and/or SCSI code and bug other people with my findings.
> > >
> > > Please feel free to CC me on that in case I can help.
> >
> > And here I am again... The frozen workqueue is in fact pm_wq, and the
> > work item that is waited for is pm_runtime_work. The block layer calls
> > pm_request_resume() on the device to resume the queue.
>
> If it called pm_runtime_resume() instead, this might work.
>
> > I bet the queue should not be resumed this early. In fact, it seems
> > that this is somewhat known to the ATA developers, because
> > ata_scsi_dev_rescan() contains this beautiful comment and code:
> >
> > /*
> > * If the rescan work was scheduled because of a resume
> > * event, the port is already fully resumed, but the
> > * SCSI device may not yet be fully resumed. In such
> > * case, executing scsi_rescan_device() may cause a
> > * deadlock with the PM code on device_lock(). Prevent
> > * this by giving up and retrying rescan after a short
> > * delay.
> > */
> > delay_rescan = sdev->sdev_gendev.power.is_suspended;
> > if (delay_rescan) {
> > scsi_device_put(sdev);
> > break;
> > }
> >
> > It just doesn't seem to work as expected, at least not in my case.
>
> Well, calling pm_request_resume() and waiting for the resume to
> actually happen is problematic regardless. It is just better to call
> pm_runtime_resume() to synchronously resume the device instead.
That code is in blk_pm_resume_queue(), which is pretty deep in the
block layer. I can imagine that pm_runtime_resume() cannot be always
used there. We may have to add an alternative API just for resume.
OTOH this may in fact be the correct fix.
But I want to try out a few other things before proposing a solution.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 16:48 ` Rafael J. Wysocki
2023-10-03 17:19 ` Petr Tesařík
@ 2023-10-03 20:07 ` Petr Tesařík
2023-10-04 1:25 ` Damien Le Moal
1 sibling, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-03 20:07 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Damien Le Moal, linux-pm, linux-ide
On Tue, 3 Oct 2023 18:48:27 +0200
"Rafael J. Wysocki" <rafael@kernel.org> wrote:
> On Tue, Oct 3, 2023 at 5:18 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Tue, 3 Oct 2023 14:57:46 +0200
> > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >
> > > On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 14:48:13 +0200
> > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > >
> > > > > On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > On Tue, 3 Oct 2023 14:34:56 +0200
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > >
> > > > > > > On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > >
> > > > > > > > On Tue, 3 Oct 2023 12:15:10 +0200
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi again (adding more recipients),
> > > > > > > > > >
> > > > > > > > > > On Sat, 30 Sep 2023 12:20:54 +0200
> > > > > > > > > > Petr Tesařík <petr@tesarici.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> > > > > > > > > > > to resume from S3. It also fails the same way with Tumbleweed v6.5
> > > > > > > > > > > kernel. I was able to capture a crash dump of the v6.5 kernel, and
> > > > > > > > > > > here's my analysis:
> > > > > > > > > > >
> > > > > > > > > > > The system never gets to waking up my SATA SSD disk:
> > > > > > > > > > >
> > > > > > > > > > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> > > > > > > > > > >
> > > > > > > > > > > There is a pending resume work for kworker/u32:12 (PID 11032), but this
> > > > > > > > > > > worker is stuck in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11032)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> > > > > > > > > > > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> > > > > > > > > > > #8 process_one_work (../kernel/workqueue.c:2600:2)
> > > > > > > > > > > #9 worker_thread (../kernel/workqueue.c:2751:4)
> > > > > > > > > > > #10 kthread (../kernel/kthread.c:389:9)
> > > > > > > > > > > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> > > > > > > > > > > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> > > > > > > > > > >
> > > > > > > > > > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> > > > > > > > > > > systemd-sleep (PID 11002). This task is also in 'D' state:
> > > > > > > > > > >
> > > > > > > > > > > >>> prog.stack_trace(11002)
> > > > > > > > > > > #0 context_switch (../kernel/sched/core.c:5381:2)
> > > > > > > > > > > #1 __schedule (../kernel/sched/core.c:6710:8)
> > > > > > > > > > > #2 schedule (../kernel/sched/core.c:6786:3)
> > > > > > > > > > > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> > > > > > > > > > > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> > > > > > > > > > > #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> > > > > > > > > > > #6 device_lock (../include/linux/device.h:958:2)
> > > > > > > > > > > #7 device_complete (../drivers/base/power/main.c:1063:2)
> > > > > > > > > > > #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> > > > > > > > > > > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> > > > > > > > > >
> > > > > > > > > > I believe the issue must be somewhere here. The whole suspend and
> > > > > > > > > > resume logic in suspend_devices_and_enter() is framed by
> > > > > > > > > > platform_suspend_begin() and platform_resume_end().
> > > > > > > > > >
> > > > > > > > > > My system is an ACPI system, so suspend_ops contains:
> > > > > > > > > >
> > > > > > > > > > .begin = acpi_suspend_begin,
> > > > > > > > > > .end = acpi_pm_end,
> > > > > > > > > >
> > > > > > > > > > Now, acpi_suspend_begin() acquires acpi_scan_lock through
> > > > > > > > > > acpi_pm_start(), and the lock is not released until acpi_pm_end().
> > > > > > > > > > Since dpm_complete() waits for the completion of a work that tries to
> > > > > > > > > > acquire acpi_scan_lock, the system will deadlock.
> > > > > > > > >
> > > > > > > > > So holding acpi_scan_lock across suspend-resume is basically to
> > > > > > > > > prevent the hotplug from taking place then IIRC.
> > > > > > > > >
> > > > > > > > > > AFAICS either:
> > > > > > > > > >
> > > > > > > > > > a. the ACPI lock cannot be held while dpm_complete() runs, or
> > > > > > > > > > b. ata_scsi_dev_rescan() must not be scheduled before the system is
> > > > > > > > > > resumed, or
> > > > > > > > > > c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> > > > > > > > > >
> > > > > > > > > > My gut feeling is that b. is the right answer.
> > > > > > > > >
> > > > > > > > > It's been a while since I looked at that code last time, but then it
> > > > > > > > > has not changed for quite some time too.
> > > > > > > > >
> > > > > > > > > It looks like the acpi_device_hotplug() path attempts to acquire
> > > > > > > > > acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> > > > > > > > > need to check that, though.
> > > > > > > >
> > > > > > > > Thanks for your willingness. Well, it's not quite what you describe. If
> > > > > > > > it was a simple ABBA deadlock, then it would be reported by lockdep.
> > > > > > > > No, it's more complicated:
> > > > > > > >
> > > > > > > > 1. suspend_devices_and_enter() holds acpi_scan_lock,
> > > > > > > > 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> > > > > > > > when it gets to acquiring acpi_scan_lock,
> > > > > > > > 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> > > > > > > > completion while holding dev->mutex,
> > > > > > > > 4. the SCSI completion work happens to be put on the same workqueue as
> > > > > > > > the ACPI hotplug work in step 2,
> > > > > > > > ^^^--- THIS is how the two events are serialized!
> > > > > > >
> > > > > > > Which is unexpected.
> > > > > > >
> > > > > > > And quite honestly I'm not sure how this can happen, because
> > > > > > > acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> > > > > > > from (a) the "eject" sysfs attribute (which cannot happen while system
> > > > > > > suspend-resume is in progress) and (b) acpi_bus_notify() which has
> > > > > > > nothing to do with SCSI.
> > > > > >
> > > > > > Oh, you're right, and I was too quick. They cannot be on the same
> > > > > > queue...
> > > > > >
> > > > > > > Maybe the workqueue used for the SCSI completion is freezable?
> > > > > >
> > > > > > Yes, that's it:
> > > > > >
> > > > > > *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> > > > > > /* ... */
> > > > > > .flags = (unsigned int)4,
> > > > > > /* WQ_FREEZABLE = 1 << 2 */
> > > > > >
> > > > > > Good. But if this workqueue is frozen, the system still cannot make
> > > > > > progress.
> > > > >
> > > > > The problem seems to be that dev->mutex is held while the work item
> > > > > goes to a freezable workqueue and is waited for, which is an almost
> > > > > guaranteed deadlock scenario.
> > > >
> > > > Ah. Thanks for explanation and direction! I'm going to dive into the
> > > > block layer and/or SCSI code and bug other people with my findings.
> > >
> > > Please feel free to CC me on that in case I can help.
> >
> > And here I am again... The frozen workqueue is in fact pm_wq, and the
> > work item that is waited for is pm_runtime_work. The block layer calls
> > pm_request_resume() on the device to resume the queue.
>
> If it called pm_runtime_resume() instead, this might work.
>
> > I bet the queue should not be resumed this early. In fact, it seems
> > that this is somewhat known to the ATA developers, because
> > ata_scsi_dev_rescan() contains this beautiful comment and code:
> >
> > /*
> > * If the rescan work was scheduled because of a resume
> > * event, the port is already fully resumed, but the
> > * SCSI device may not yet be fully resumed. In such
> > * case, executing scsi_rescan_device() may cause a
> > * deadlock with the PM code on device_lock(). Prevent
> > * this by giving up and retrying rescan after a short
> > * delay.
> > */
> > delay_rescan = sdev->sdev_gendev.power.is_suspended;
> > if (delay_rescan) {
> > scsi_device_put(sdev);
> > break;
> > }
> >
> > It just doesn't seem to work as expected, at least not in my case.
>
> Well, calling pm_request_resume() and waiting for the resume to
> actually happen is problematic regardless. It is just better to call
> pm_runtime_resume() to synchronously resume the device instead.
I just want to confirm that my understanding of the issue is correct
now. After applying the patch below, my laptop has just survived half a
dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
disk. Without the patch, it usually freezes on first attempt.
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a371b497035e..87000f89319e 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
ata_for_each_dev(dev, link, ENABLED) {
struct scsi_device *sdev = dev->sdev;
+ /*
+ * If the queue accepts only PM, bail out.
+ */
+ if (blk_queue_pm_only(sdev->request_queue)) {
+ ret = -EAGAIN;
+ goto unlock;
+ }
+
/*
* If the port was suspended before this was scheduled,
* bail out.
Petr T
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 20:07 ` Petr Tesařík
@ 2023-10-04 1:25 ` Damien Le Moal
2023-10-04 6:43 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04 1:25 UTC (permalink / raw)
To: Petr Tesařík, Rafael J. Wysocki; +Cc: linux-pm, linux-ide
On 10/4/23 05:07, Petr Tesařík wrote:
> I just want to confirm that my understanding of the issue is correct
> now. After applying the patch below, my laptop has just survived half a
> dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
> disk. Without the patch, it usually freezes on first attempt.
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index a371b497035e..87000f89319e 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
> ata_for_each_dev(dev, link, ENABLED) {
> struct scsi_device *sdev = dev->sdev;
>
> + /*
> + * If the queue accepts only PM, bail out.
> + */
> + if (blk_queue_pm_only(sdev->request_queue)) {
> + ret = -EAGAIN;
> + goto unlock;
> + }
> +
> /*
> * If the port was suspended before this was scheduled,
> * bail out.
This seems sensible to me: scsi_rescan_device() only checks that the device is
running, without checking that the device queue is also resumed. So the right
place for this check is scsi_rescan_device():
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3db4d31a03a1..b05a55f498a2 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
device_lock(dev);
/*
- * Bail out if the device is not running. Otherwise, the rescan may
- * block waiting for commands to be executed, with us holding the
- * device lock. This can result in a potential deadlock in the power
- * management core code when system resume is on-going.
+ * Bail out if the device or its queue are not running. Otherwise,
+ * the rescan may block waiting for commands to be executed, with us
+ * holding the device lock. This can result in a potential deadlock
+ * in the power management core code when system resume is on-going.
*/
- if (sdev->sdev_state != SDEV_RUNNING) {
+ if (sdev->sdev_state != SDEV_RUNNING ||
+ blk_queue_pm_only(sdev->request_queue)) {
ret = -EWOULDBLOCK;
goto unlock;
}
Can you try the above to confirm it works for you ? It should, as that is
essentially the same as you did.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 1:25 ` Damien Le Moal
@ 2023-10-04 6:43 ` Petr Tesařík
2023-10-04 7:38 ` Damien Le Moal
0 siblings, 1 reply; 21+ messages in thread
From: Petr Tesařík @ 2023-10-04 6:43 UTC (permalink / raw)
To: Damien Le Moal; +Cc: Rafael J. Wysocki, linux-pm, linux-ide, Hannes Reinecke
On Wed, 4 Oct 2023 10:25:47 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:
> On 10/4/23 05:07, Petr Tesařík wrote:
> > I just want to confirm that my understanding of the issue is correct
> > now. After applying the patch below, my laptop has just survived half a
> > dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
> > disk. Without the patch, it usually freezes on first attempt.
> >
> > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> > index a371b497035e..87000f89319e 100644
> > --- a/drivers/ata/libata-scsi.c
> > +++ b/drivers/ata/libata-scsi.c
> > @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
> > ata_for_each_dev(dev, link, ENABLED) {
> > struct scsi_device *sdev = dev->sdev;
> >
> > + /*
> > + * If the queue accepts only PM, bail out.
> > + */
> > + if (blk_queue_pm_only(sdev->request_queue)) {
> > + ret = -EAGAIN;
> > + goto unlock;
> > + }
> > +
> > /*
> > * If the port was suspended before this was scheduled,
> > * bail out.
>
> This seems sensible to me: scsi_rescan_device() only checks that the device is
> running, without checking that the device queue is also resumed. So the right
> place for this check is scsi_rescan_device():
>
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 3db4d31a03a1..b05a55f498a2 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
> device_lock(dev);
>
> /*
> - * Bail out if the device is not running. Otherwise, the rescan may
> - * block waiting for commands to be executed, with us holding the
> - * device lock. This can result in a potential deadlock in the power
> - * management core code when system resume is on-going.
> + * Bail out if the device or its queue are not running. Otherwise,
> + * the rescan may block waiting for commands to be executed, with us
> + * holding the device lock. This can result in a potential deadlock
> + * in the power management core code when system resume is on-going.
> */
> - if (sdev->sdev_state != SDEV_RUNNING) {
> + if (sdev->sdev_state != SDEV_RUNNING ||
> + blk_queue_pm_only(sdev->request_queue)) {
> ret = -EWOULDBLOCK;
> goto unlock;
> }
>
> Can you try the above to confirm it works for you ? It should, as that is
> essentially the same as you did.
After revertung my patch and applying yours, my system can still resume
from S3. Congrats!
Tested-by: Petr Tesarik <petr@tesarici.cz>
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 6:43 ` Petr Tesařík
@ 2023-10-04 7:38 ` Damien Le Moal
0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04 7:38 UTC (permalink / raw)
To: Petr Tesařík
Cc: Rafael J. Wysocki, linux-pm, linux-ide, Hannes Reinecke
On 10/4/23 15:43, Petr Tesařík wrote:
> On Wed, 4 Oct 2023 10:25:47 +0900
> Damien Le Moal <dlemoal@kernel.org> wrote:
>
>> On 10/4/23 05:07, Petr Tesařík wrote:
>>> I just want to confirm that my understanding of the issue is correct
>>> now. After applying the patch below, my laptop has just survived half a
>>> dozen suspend/resume cycles with autosuspend enabled for the SATA SSD
>>> disk. Without the patch, it usually freezes on first attempt.
>>>
>>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>>> index a371b497035e..87000f89319e 100644
>>> --- a/drivers/ata/libata-scsi.c
>>> +++ b/drivers/ata/libata-scsi.c
>>> @@ -4768,6 +4768,14 @@ void ata_scsi_dev_rescan(struct work_struct *work)
>>> ata_for_each_dev(dev, link, ENABLED) {
>>> struct scsi_device *sdev = dev->sdev;
>>>
>>> + /*
>>> + * If the queue accepts only PM, bail out.
>>> + */
>>> + if (blk_queue_pm_only(sdev->request_queue)) {
>>> + ret = -EAGAIN;
>>> + goto unlock;
>>> + }
>>> +
>>> /*
>>> * If the port was suspended before this was scheduled,
>>> * bail out.
>>
>> This seems sensible to me: scsi_rescan_device() only checks that the device is
>> running, without checking that the device queue is also resumed. So the right
>> place for this check is scsi_rescan_device():
>>
>> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
>> index 3db4d31a03a1..b05a55f498a2 100644
>> --- a/drivers/scsi/scsi_scan.c
>> +++ b/drivers/scsi/scsi_scan.c
>> @@ -1627,12 +1627,13 @@ int scsi_rescan_device(struct scsi_device *sdev)
>> device_lock(dev);
>>
>> /*
>> - * Bail out if the device is not running. Otherwise, the rescan may
>> - * block waiting for commands to be executed, with us holding the
>> - * device lock. This can result in a potential deadlock in the power
>> - * management core code when system resume is on-going.
>> + * Bail out if the device or its queue are not running. Otherwise,
>> + * the rescan may block waiting for commands to be executed, with us
>> + * holding the device lock. This can result in a potential deadlock
>> + * in the power management core code when system resume is on-going.
>> */
>> - if (sdev->sdev_state != SDEV_RUNNING) {
>> + if (sdev->sdev_state != SDEV_RUNNING ||
>> + blk_queue_pm_only(sdev->request_queue)) {
>> ret = -EWOULDBLOCK;
>> goto unlock;
>> }
>>
>> Can you try the above to confirm it works for you ? It should, as that is
>> essentially the same as you did.
>
> After revertung my patch and applying yours, my system can still resume
> from S3. Congrats!
>
> Tested-by: Petr Tesarik <petr@tesarici.cz>
Thanks. I will send a proper patch asap.
>
> Petr T
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-03 15:18 ` Petr Tesařík
2023-10-03 16:16 ` Petr Tesařík
2023-10-03 16:48 ` Rafael J. Wysocki
@ 2023-10-04 6:13 ` Hannes Reinecke
2023-10-04 6:18 ` Damien Le Moal
2 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2023-10-04 6:13 UTC (permalink / raw)
To: Petr Tesařík, Rafael J. Wysocki, Damien Le Moal
Cc: linux-pm, linux-ide
On 10/3/23 17:18, Petr Tesařík wrote:
> On Tue, 3 Oct 2023 14:57:46 +0200
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>
>>> On Tue, 3 Oct 2023 14:48:13 +0200
>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>
>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>
>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>
>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>
>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>
>>>>>>>>> Hi again (adding more recipients),
>>>>>>>>>
>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
>>>>>>>>>> here's my analysis:
>>>>>>>>>>
>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
>>>>>>>>>>
>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
>>>>>>>>>>
>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
>>>>>>>>>> worker is stuck in 'D' state:
>>>>>>>>>>
>>>>>>>>>>>>> prog.stack_trace(11032)
>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2)
>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4)
>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>>>>>>>>>>
>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
>>>>>>>>>>
>>>>>>>>>>>>> prog.stack_trace(11002)
>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2)
>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2)
>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>>>>>>>>>
>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
>>>>>>>>> platform_suspend_begin() and platform_resume_end().
>>>>>>>>>
>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
>>>>>>>>>
>>>>>>>>> .begin = acpi_suspend_begin,
>>>>>>>>> .end = acpi_pm_end,
>>>>>>>>>
>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
>>>>>>>>
>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
>>>>>>>> prevent the hotplug from taking place then IIRC.
>>>>>>>>
>>>>>>>>> AFAICS either:
>>>>>>>>>
>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
>>>>>>>>> resumed, or
>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>>>>>>>>>
>>>>>>>>> My gut feeling is that b. is the right answer.
>>>>>>>>
>>>>>>>> It's been a while since I looked at that code last time, but then it
>>>>>>>> has not changed for quite some time too.
>>>>>>>>
>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
>>>>>>>> need to check that, though.
>>>>>>>
>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
>>>>>>> No, it's more complicated:
>>>>>>>
>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>>>>>>> when it gets to acquiring acpi_scan_lock,
>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>>>>>>> completion while holding dev->mutex,
>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
>>>>>>> the ACPI hotplug work in step 2,
>>>>>>> ^^^--- THIS is how the two events are serialized!
>>>>>>
>>>>>> Which is unexpected.
>>>>>>
>>>>>> And quite honestly I'm not sure how this can happen, because
>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
>>>>>> nothing to do with SCSI.
>>>>>
>>>>> Oh, you're right, and I was too quick. They cannot be on the same
>>>>> queue...
>>>>>
>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
>>>>>
>>>>> Yes, that's it:
>>>>>
>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
>>>>> /* ... */
>>>>> .flags = (unsigned int)4,
>>>>> /* WQ_FREEZABLE = 1 << 2 */
>>>>>
>>>>> Good. But if this workqueue is frozen, the system still cannot make
>>>>> progress.
>>>>
>>>> The problem seems to be that dev->mutex is held while the work item
>>>> goes to a freezable workqueue and is waited for, which is an almost
>>>> guaranteed deadlock scenario.
>>>
>>> Ah. Thanks for explanation and direction! I'm going to dive into the
>>> block layer and/or SCSI code and bug other people with my findings.
>>
>> Please feel free to CC me on that in case I can help.
>
> And here I am again... The frozen workqueue is in fact pm_wq, and the
> work item that is waited for is pm_runtime_work. The block layer calls
> pm_request_resume() on the device to resume the queue.
>
> I bet the queue should not be resumed this early. In fact, it seems
> that this is somewhat known to the ATA developers, because
> ata_scsi_dev_rescan() contains this beautiful comment and code:
>
> /*
> * If the rescan work was scheduled because of a resume
> * event, the port is already fully resumed, but the
> * SCSI device may not yet be fully resumed. In such
> * case, executing scsi_rescan_device() may cause a
> * deadlock with the PM code on device_lock(). Prevent
> * this by giving up and retrying rescan after a short
> * delay.
> */
> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> if (delay_rescan) {
> scsi_device_put(sdev);
> break;
> }
>
> It just doesn't seem to work as expected, at least not in my case.
>
Did you test the libata suspend/resume patches from Damien?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 6:13 ` Hannes Reinecke
@ 2023-10-04 6:18 ` Damien Le Moal
2023-10-04 6:42 ` Petr Tesařík
0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2023-10-04 6:18 UTC (permalink / raw)
To: Hannes Reinecke, Petr Tesařík, Rafael J. Wysocki
Cc: linux-pm, linux-ide
On 10/4/23 15:13, Hannes Reinecke wrote:
> On 10/3/23 17:18, Petr Tesařík wrote:
>> On Tue, 3 Oct 2023 14:57:46 +0200
>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>
>>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>
>>>> On Tue, 3 Oct 2023 14:48:13 +0200
>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>
>>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>
>>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>
>>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>
>>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
>>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi again (adding more recipients),
>>>>>>>>>>
>>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
>>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
>>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
>>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
>>>>>>>>>>> here's my analysis:
>>>>>>>>>>>
>>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
>>>>>>>>>>>
>>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
>>>>>>>>>>>
>>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
>>>>>>>>>>> worker is stuck in 'D' state:
>>>>>>>>>>>
>>>>>>>>>>>>>> prog.stack_trace(11032)
>>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
>>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
>>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2)
>>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4)
>>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
>>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
>>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
>>>>>>>>>>>
>>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
>>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
>>>>>>>>>>>
>>>>>>>>>>>>>> prog.stack_trace(11002)
>>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
>>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
>>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
>>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
>>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
>>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
>>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2)
>>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2)
>>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
>>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
>>>>>>>>>>
>>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
>>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
>>>>>>>>>> platform_suspend_begin() and platform_resume_end().
>>>>>>>>>>
>>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
>>>>>>>>>>
>>>>>>>>>> .begin = acpi_suspend_begin,
>>>>>>>>>> .end = acpi_pm_end,
>>>>>>>>>>
>>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
>>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
>>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
>>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
>>>>>>>>>
>>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
>>>>>>>>> prevent the hotplug from taking place then IIRC.
>>>>>>>>>
>>>>>>>>>> AFAICS either:
>>>>>>>>>>
>>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
>>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
>>>>>>>>>> resumed, or
>>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
>>>>>>>>>>
>>>>>>>>>> My gut feeling is that b. is the right answer.
>>>>>>>>>
>>>>>>>>> It's been a while since I looked at that code last time, but then it
>>>>>>>>> has not changed for quite some time too.
>>>>>>>>>
>>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
>>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
>>>>>>>>> need to check that, though.
>>>>>>>>
>>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
>>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
>>>>>>>> No, it's more complicated:
>>>>>>>>
>>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
>>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
>>>>>>>> when it gets to acquiring acpi_scan_lock,
>>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
>>>>>>>> completion while holding dev->mutex,
>>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
>>>>>>>> the ACPI hotplug work in step 2,
>>>>>>>> ^^^--- THIS is how the two events are serialized!
>>>>>>>
>>>>>>> Which is unexpected.
>>>>>>>
>>>>>>> And quite honestly I'm not sure how this can happen, because
>>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
>>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
>>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
>>>>>>> nothing to do with SCSI.
>>>>>>
>>>>>> Oh, you're right, and I was too quick. They cannot be on the same
>>>>>> queue...
>>>>>>
>>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
>>>>>>
>>>>>> Yes, that's it:
>>>>>>
>>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
>>>>>> /* ... */
>>>>>> .flags = (unsigned int)4,
>>>>>> /* WQ_FREEZABLE = 1 << 2 */
>>>>>>
>>>>>> Good. But if this workqueue is frozen, the system still cannot make
>>>>>> progress.
>>>>>
>>>>> The problem seems to be that dev->mutex is held while the work item
>>>>> goes to a freezable workqueue and is waited for, which is an almost
>>>>> guaranteed deadlock scenario.
>>>>
>>>> Ah. Thanks for explanation and direction! I'm going to dive into the
>>>> block layer and/or SCSI code and bug other people with my findings.
>>>
>>> Please feel free to CC me on that in case I can help.
>>
>> And here I am again... The frozen workqueue is in fact pm_wq, and the
>> work item that is waited for is pm_runtime_work. The block layer calls
>> pm_request_resume() on the device to resume the queue.
>>
>> I bet the queue should not be resumed this early. In fact, it seems
>> that this is somewhat known to the ATA developers, because
>> ata_scsi_dev_rescan() contains this beautiful comment and code:
>>
>> /*
>> * If the rescan work was scheduled because of a resume
>> * event, the port is already fully resumed, but the
>> * SCSI device may not yet be fully resumed. In such
>> * case, executing scsi_rescan_device() may cause a
>> * deadlock with the PM code on device_lock(). Prevent
>> * this by giving up and retrying rescan after a short
>> * delay.
>> */
>> delay_rescan = sdev->sdev_gendev.power.is_suspended;
>> if (delay_rescan) {
>> scsi_device_put(sdev);
>> break;
>> }
>>
>> It just doesn't seem to work as expected, at least not in my case.
>>
> Did you test the libata suspend/resume patches from Damien?
I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
in this thread.
>
> Cheers,
>
> Hannes
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: Thinkpad E595 system deadlock on resume from S3
2023-10-04 6:18 ` Damien Le Moal
@ 2023-10-04 6:42 ` Petr Tesařík
0 siblings, 0 replies; 21+ messages in thread
From: Petr Tesařík @ 2023-10-04 6:42 UTC (permalink / raw)
To: Damien Le Moal, Hannes Reinecke; +Cc: Rafael J. Wysocki, linux-pm, linux-ide
On Wed, 4 Oct 2023 15:18:34 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:
> On 10/4/23 15:13, Hannes Reinecke wrote:
> > On 10/3/23 17:18, Petr Tesařík wrote:
> >> On Tue, 3 Oct 2023 14:57:46 +0200
> >> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>
> >>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>
> >>>> On Tue, 3 Oct 2023 14:48:13 +0200
> >>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>
> >>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>
> >>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
> >>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>
> >>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
> >>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi again (adding more recipients),
> >>>>>>>>>>
> >>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
> >>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> >>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
> >>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
> >>>>>>>>>>> here's my analysis:
> >>>>>>>>>>>
> >>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
> >>>>>>>>>>>
> >>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda
> >>>>>>>>>>>
> >>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
> >>>>>>>>>>> worker is stuck in 'D' state:
> >>>>>>>>>>>
> >>>>>>>>>>>>>> prog.stack_trace(11032)
> >>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> >>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> >>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2)
> >>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4)
> >>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
> >>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> >>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> >>>>>>>>>>>
> >>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> >>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
> >>>>>>>>>>>
> >>>>>>>>>>>>>> prog.stack_trace(11002)
> >>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2)
> >>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2)
> >>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3)
> >>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2)
> >>>>>>>>>>
> >>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
> >>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
> >>>>>>>>>> platform_suspend_begin() and platform_resume_end().
> >>>>>>>>>>
> >>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
> >>>>>>>>>>
> >>>>>>>>>> .begin = acpi_suspend_begin,
> >>>>>>>>>> .end = acpi_pm_end,
> >>>>>>>>>>
> >>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
> >>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
> >>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
> >>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.
> >>>>>>>>>
> >>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
> >>>>>>>>> prevent the hotplug from taking place then IIRC.
> >>>>>>>>>
> >>>>>>>>>> AFAICS either:
> >>>>>>>>>>
> >>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
> >>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
> >>>>>>>>>> resumed, or
> >>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> >>>>>>>>>>
> >>>>>>>>>> My gut feeling is that b. is the right answer.
> >>>>>>>>>
> >>>>>>>>> It's been a while since I looked at that code last time, but then it
> >>>>>>>>> has not changed for quite some time too.
> >>>>>>>>>
> >>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
> >>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I
> >>>>>>>>> need to check that, though.
> >>>>>>>>
> >>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
> >>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
> >>>>>>>> No, it's more complicated:
> >>>>>>>>
> >>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> >>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> >>>>>>>> when it gets to acquiring acpi_scan_lock,
> >>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> >>>>>>>> completion while holding dev->mutex,
> >>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
> >>>>>>>> the ACPI hotplug work in step 2,
> >>>>>>>> ^^^--- THIS is how the two events are serialized!
> >>>>>>>
> >>>>>>> Which is unexpected.
> >>>>>>>
> >>>>>>> And quite honestly I'm not sure how this can happen, because
> >>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> >>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
> >>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
> >>>>>>> nothing to do with SCSI.
> >>>>>>
> >>>>>> Oh, you're right, and I was too quick. They cannot be on the same
> >>>>>> queue...
> >>>>>>
> >>>>>>> Maybe the workqueue used for the SCSI completion is freezable?
> >>>>>>
> >>>>>> Yes, that's it:
> >>>>>>
> >>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> >>>>>> /* ... */
> >>>>>> .flags = (unsigned int)4,
> >>>>>> /* WQ_FREEZABLE = 1 << 2 */
> >>>>>>
> >>>>>> Good. But if this workqueue is frozen, the system still cannot make
> >>>>>> progress.
> >>>>>
> >>>>> The problem seems to be that dev->mutex is held while the work item
> >>>>> goes to a freezable workqueue and is waited for, which is an almost
> >>>>> guaranteed deadlock scenario.
> >>>>
> >>>> Ah. Thanks for explanation and direction! I'm going to dive into the
> >>>> block layer and/or SCSI code and bug other people with my findings.
> >>>
> >>> Please feel free to CC me on that in case I can help.
> >>
> >> And here I am again... The frozen workqueue is in fact pm_wq, and the
> >> work item that is waited for is pm_runtime_work. The block layer calls
> >> pm_request_resume() on the device to resume the queue.
> >>
> >> I bet the queue should not be resumed this early. In fact, it seems
> >> that this is somewhat known to the ATA developers, because
> >> ata_scsi_dev_rescan() contains this beautiful comment and code:
> >>
> >> /*
> >> * If the rescan work was scheduled because of a resume
> >> * event, the port is already fully resumed, but the
> >> * SCSI device may not yet be fully resumed. In such
> >> * case, executing scsi_rescan_device() may cause a
> >> * deadlock with the PM code on device_lock(). Prevent
> >> * this by giving up and retrying rescan after a short
> >> * delay.
> >> */
> >> delay_rescan = sdev->sdev_gendev.power.is_suspended;
> >> if (delay_rescan) {
> >> scsi_device_put(sdev);
> >> break;
> >> }
> >>
> >> It just doesn't seem to work as expected, at least not in my case.
> >>
> > Did you test the libata suspend/resume patches from Damien?
>
> I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
> in this thread.
Indeed. I originally hit the hang on 6.5. I captured a coredump and did
most of my analysis on that version (hence the quoted code). But I
tested again with 6.6-rc4 just before sending my email, and the hang is
still there. The rescan work runs after dpm_resume(), but before
dpm_complete(), and I believe that's yet another issue.
Petr T
^ permalink raw reply [flat|nested] 21+ messages in thread