public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed
From: "Petr Tesařík" <petr@tesarici.cz>
To: Damien Le Moal <dlemoal@kernel.org>, Hannes Reinecke <hare@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
	linux-pm@vger.kernel.org, linux-ide@vger.kernel.org
Subject: Re: Thinkpad E595 system deadlock on resume from S3
Date: Wed, 4 Oct 2023 08:42:29 +0200	[thread overview]
Message-ID: <20231004084229.34601732@meshulam.tesarici.cz> (raw)
In-Reply-To: <5c6bf031-52cd-3b81-fc7f-8a514c19d9d0@kernel.org>

On Wed, 4 Oct 2023 15:18:34 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:

> On 10/4/23 15:13, Hannes Reinecke wrote:
> > On 10/3/23 17:18, Petr Tesařík wrote:  
> >> On Tue, 3 Oct 2023 14:57:46 +0200
> >> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>  
> >>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>
> >>>> On Tue, 3 Oct 2023 14:48:13 +0200
> >>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>     
> >>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>>>
> >>>>>> On Tue, 3 Oct 2023 14:34:56 +0200
> >>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>     
> >>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>>>>>
> >>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200
> >>>>>>>> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
> >>>>>>>>     
> >>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> >>>>>>>>>>
> >>>>>>>>>> Hi again (adding more recipients),
> >>>>>>>>>>
> >>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200
> >>>>>>>>>> Petr Tesařík <petr@tesarici.cz> wrote:
> >>>>>>>>>>     
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails
> >>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5
> >>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and
> >>>>>>>>>>> here's my analysis:
> >>>>>>>>>>>
> >>>>>>>>>>> The system never gets to waking up my SATA SSD disk:
> >>>>>>>>>>>
> >>>>>>>>>>> [0:0:0:0]    disk    ATA      KINGSTON SEDC600 H5.1  /dev/sda
> >>>>>>>>>>>
> >>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this
> >>>>>>>>>>> worker is stuck in 'D' state:
> >>>>>>>>>>>     
> >>>>>>>>>>>>>> prog.stack_trace(11032)  
> >>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6  acpi_device_hotplug (../drivers/acpi/scan.c:382:2)
> >>>>>>>>>>> #7  acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2)
> >>>>>>>>>>> #8  process_one_work (../kernel/workqueue.c:2600:2)
> >>>>>>>>>>> #9  worker_thread (../kernel/workqueue.c:2751:4)
> >>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9)
> >>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3)
> >>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304)
> >>>>>>>>>>>
> >>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by
> >>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state:
> >>>>>>>>>>>     
> >>>>>>>>>>>>>> prog.stack_trace(11002)  
> >>>>>>>>>>> #0  context_switch (../kernel/sched/core.c:5381:2)
> >>>>>>>>>>> #1  __schedule (../kernel/sched/core.c:6710:8)
> >>>>>>>>>>> #2  schedule (../kernel/sched/core.c:6786:3)
> >>>>>>>>>>> #3  schedule_preempt_disabled (../kernel/sched/core.c:6845:2)
> >>>>>>>>>>> #4  __mutex_lock_common (../kernel/locking/mutex.c:679:3)
> >>>>>>>>>>> #5  __mutex_lock (../kernel/locking/mutex.c:747:9)
> >>>>>>>>>>> #6  device_lock (../include/linux/device.h:958:2)
> >>>>>>>>>>> #7  device_complete (../drivers/base/power/main.c:1063:2)
> >>>>>>>>>>> #8  dpm_complete (../drivers/base/power/main.c:1121:3)
> >>>>>>>>>>> #9  suspend_devices_and_enter (../kernel/power/suspend.c:516:2)  
> >>>>>>>>>>
> >>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and
> >>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by
> >>>>>>>>>> platform_suspend_begin() and platform_resume_end().
> >>>>>>>>>>
> >>>>>>>>>> My system is an ACPI system, so suspend_ops contains:
> >>>>>>>>>>
> >>>>>>>>>>          .begin = acpi_suspend_begin,
> >>>>>>>>>>          .end = acpi_pm_end,
> >>>>>>>>>>
> >>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through
> >>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end().
> >>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to
> >>>>>>>>>> acquire acpi_scan_lock, the system will deadlock.  
> >>>>>>>>>
> >>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to
> >>>>>>>>> prevent the hotplug from taking place then IIRC.
> >>>>>>>>>     
> >>>>>>>>>> AFAICS either:
> >>>>>>>>>>
> >>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or
> >>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is
> >>>>>>>>>> resumed, or
> >>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex.
> >>>>>>>>>>
> >>>>>>>>>> My gut feeling is that b. is the right answer.  
> >>>>>>>>>
> >>>>>>>>> It's been a while since I looked at that code last time, but then it
> >>>>>>>>> has not changed for quite some time too.
> >>>>>>>>>
> >>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire
> >>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly.  I
> >>>>>>>>> need to check that, though.  
> >>>>>>>>
> >>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If
> >>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep.
> >>>>>>>> No, it's more complicated:
> >>>>>>>>
> >>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock,
> >>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep
> >>>>>>>>     when it gets to acquiring acpi_scan_lock,
> >>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its
> >>>>>>>>     completion while holding dev->mutex,
> >>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as
> >>>>>>>>     the ACPI hotplug work in step 2,
> >>>>>>>>     ^^^--- THIS is how the two events are serialized!  
> >>>>>>>
> >>>>>>> Which is unexpected.
> >>>>>>>
> >>>>>>> And quite honestly I'm not sure how this can happen, because
> >>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called
> >>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system
> >>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has
> >>>>>>> nothing to do with SCSI.  
> >>>>>>
> >>>>>> Oh, you're right, and I was too quick. They cannot be on the same
> >>>>>> queue...
> >>>>>>     
> >>>>>>> Maybe the workqueue used for the SCSI completion is freezable?  
> >>>>>>
> >>>>>> Yes, that's it:
> >>>>>>
> >>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = {
> >>>>>> /* ... */
> >>>>>>          .flags = (unsigned int)4,
> >>>>>> /* WQ_FREEZABLE            = 1 << 2 */
> >>>>>>
> >>>>>> Good. But if this workqueue is frozen, the system still cannot make
> >>>>>> progress.  
> >>>>>
> >>>>> The problem seems to be that dev->mutex is held while the work item
> >>>>> goes to a freezable workqueue and is waited for, which is an almost
> >>>>> guaranteed deadlock scenario.  
> >>>>
> >>>> Ah. Thanks for explanation and direction! I'm going to dive into the
> >>>> block layer and/or SCSI code and bug other people with my findings.  
> >>>
> >>> Please feel free to CC me on that in case I can help.  
> >>
> >> And here I am again... The frozen workqueue is in fact pm_wq, and the
> >> work item that is waited for is pm_runtime_work. The block layer calls
> >> pm_request_resume() on the device to resume the queue.
> >>
> >> I bet the queue should not be resumed this early. In fact, it seems
> >> that this is somewhat known to the ATA developers, because
> >> ata_scsi_dev_rescan() contains this beautiful comment and code:
> >>
> >> 			/*
> >> 			 * If the rescan work was scheduled because of a resume
> >> 			 * event, the port is already fully resumed, but the
> >> 			 * SCSI device may not yet be fully resumed. In such
> >> 			 * case, executing scsi_rescan_device() may cause a
> >> 			 * deadlock with the PM code on device_lock(). Prevent
> >> 			 * this by giving up and retrying rescan after a short
> >> 			 * delay.
> >> 			 */
> >> 			delay_rescan = sdev->sdev_gendev.power.is_suspended;
> >> 			if (delay_rescan) {
> >> 				scsi_device_put(sdev);
> >> 				break;
> >> 			}
> >>
> >> It just doesn't seem to work as expected, at least not in my case.
> >>  
> > Did you test the libata suspend/resume patches from Damien?  
> 
> I think so since the hang was repeated on 6.6-rc4, as mentioned in another email
> in this thread.

Indeed. I originally hit the hang on 6.5. I captured a coredump and did
most of my analysis on that version (hence the quoted code). But I
tested again with 6.6-rc4 just before sending my email, and the hang is
still there. The rescan work runs after dpm_resume(), but before
dpm_complete(), and I believe that's yet another issue.

Petr T

  reply	other threads:[~2023-10-04  6:42 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-30 10:20 Thinkpad E595 system deadlock on resume from S3 Petr Tesařík
2023-10-03  9:31 ` Petr Tesařík
2023-10-03 10:15   ` Rafael J. Wysocki
2023-10-03 11:02     ` Petr Tesařík
2023-10-03 12:34       ` Rafael J. Wysocki
2023-10-03 12:40         ` Petr Tesařík
2023-10-03 12:48           ` Rafael J. Wysocki
2023-10-03 12:51             ` Petr Tesařík
2023-10-03 12:57               ` Rafael J. Wysocki
2023-10-03 15:18                 ` Petr Tesařík
2023-10-03 16:16                   ` Petr Tesařík
2023-10-03 16:48                   ` Rafael J. Wysocki
2023-10-03 17:19                     ` Petr Tesařík
2023-10-03 20:07                     ` Petr Tesařík
2023-10-04  1:25                       ` Damien Le Moal
2023-10-04  6:43                         ` Petr Tesařík
2023-10-04  7:38                           ` Damien Le Moal
2023-10-04  6:13                   ` Hannes Reinecke
2023-10-04  6:18                     ` Damien Le Moal
2023-10-04  6:42                       ` Petr Tesařík [this message]
2023-10-03 12:35       ` Petr Tesařík

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231004084229.34601732@meshulam.tesarici.cz \
    --to=petr@tesarici.cz \
    --cc=dlemoal@kernel.org \
    --cc=hare@suse.de \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox