From: Damien Le Moal <dlemoal@kernel.org>
To: Alan Stern <stern@rowland.harvard.edu>, Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>,
Joe Breuer <linux-kernel@jmbreuer.net>,
Bart Van Assche <bvanassche@acm.org>,
Bagas Sanjaya <bagasdotme@gmail.com>, Pavel Machek <pavel@ucw.cz>,
"Rafael J. Wysocki" <rafael@kernel.org>,
Len Brown <len.brown@intel.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Kees Cook <keescook@chromium.org>,
Tony Luck <tony.luck@intel.com>,
"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
Thorsten Leemhuis <linux@leemhuis.info>,
"James E.J. Bottomley" <jejb@linux.ibm.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
Phillip Potter <phil@philpotter.co.uk>,
Linux Power Management <linux-pm@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux Hardening <linux-hardening@vger.kernel.org>,
Linux Regressions <regressions@lists.linux.dev>,
Linux SCSI <linux-scsi@vger.kernel.org>,
Dan Williams <dan.j.williams@intel.com>,
Hannes Reinecke <hare@suse.com>,
Adrian Hunter <adrian.hunter@intel.com>,
Martin Kepplinger <martin.kepplinger@puri.sm>,
Kai-Heng Feng <kai.heng.feng@canonical.com>
Subject: Re: Fwd: Waking up from resume locks up on sr device
Date: Thu, 15 Jun 2023 09:10:28 +0900 [thread overview]
Message-ID: <41b069c7-8723-4507-3e5a-1d1878db9002@kernel.org> (raw)
In-Reply-To: <859f0eda-4984-4489-9851-c9f6ec454a88@rowland.harvard.edu>
On 6/14/23 23:26, Alan Stern wrote:
> On Wed, Jun 14, 2023 at 04:35:50PM +0900, Damien Le Moal wrote:
>> On 6/14/23 15:57, Hannes Reinecke wrote:
>>> On 6/14/23 06:49, Damien Le Moal wrote:
>>>> On 6/11/23 18:05, Joe Breuer wrote:
>>>>> I'm the reporter of this issue.
>>>>>
>>>>> I just tried this patch against 6.3.4, and it completely fixes my
>>>>> suspend/resume issue.
>>>>>
>>>>> The optical drive stays usable after resume, even suspending/resuming
>>>>> during playback of CDDA content works flawlessly and playback resumes
>>>>> seamlessly after system resume.
>>>>>
>>>>> So, from my perspective: Good one!
>>>>
>>>> In place of Bart's fix, could you please try this patch ?
>>>>
>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>> index b80e68000dd3..a81eb4f882ab 100644
>>>> --- a/drivers/ata/libata-eh.c
>>>> +++ b/drivers/ata/libata-eh.c
>>>> @@ -4006,9 +4006,32 @@ static void ata_eh_handle_port_resume(struct
>>>> ata_port *ap)
>>>> /* tell ACPI that we're resuming */
>>>> ata_acpi_on_resume(ap);
>>>>
>>>> - /* update the flags */
>>>> spin_lock_irqsave(ap->lock, flags);
>>>> +
>>>> + /* Update the flags */
>>>> ap->pflags &= ~(ATA_PFLAG_PM_PENDING | ATA_PFLAG_SUSPENDED);
>>>> +
>>>> + /*
>>>> + * Resuming the port will trigger a rescan of the ATA device(s)
>>>> + * connected to it. Before scheduling the rescan, make sure that
>>>> + * the associated scsi device(s) are fully resumed as well.
>>>> + */
>>>> + ata_for_each_link(link, ap, HOST_FIRST) {
>>>> + ata_for_each_dev(dev, link, ENABLED) {
>>>> + struct scsi_device *sdev = dev->sdev;
>>>> +
>>>> + if (!sdev)
>>>> + continue;
>>>> + if (scsi_device_get(sdev))
>>>> + continue;
>>>> +
>>>> + spin_unlock_irqrestore(ap->lock, flags);
>>>> + device_pm_wait_for_dev(&ap->tdev,
>>>> + &sdev->sdev_gendev);
>>>> + scsi_device_put(sdev);
>>>> + spin_lock_irqsave(ap->lock, flags);
>>>> + }
>>>> + }
>>>> spin_unlock_irqrestore(ap->lock, flags);
>>>> }
>>>> #endif /* CONFIG_PM */
>>>>
>>>> Thanks !
>>>>
>>> Well; not sure if that'll work out.
>>> The whole reason why we initial a rescan is that we need to check if the
>>> ports are still connected, and whether the devices react.
>>> So we can't iterate the ports here as this is the very thing which gets
>>> checked during EH.
>>
>> Hmmm... Right. So we need to move that loop into ata_scsi_dev_rescan(),
>> which itself already loops over the port devices anyway.
>>
>>> We really should claim resume to be finished as soon as we can talk with
>>> the HBA, and kick off EH asynchronously to let it finish the job after
>>> resume has completed.
>>
>> That is what's done already:
>>
>> static int ata_port_pm_resume(struct device *dev)
>> {
>> ata_port_resume_async(to_ata_port(dev), PMSG_RESUME);
>> pm_runtime_disable(dev);
>> pm_runtime_set_active(dev);
>> pm_runtime_enable(dev);
>> return 0;
>> }
>>
>> EH is kicked by ata_port_resume_async() -> ata_port_request_pm() and it
>> is async. There is no synchronization in EH with the PM side though. We
>> probably should have EH check that the port resume is done first, which
>> can be done in ata_eh_handle_port_resume() since that is the first thing
>> done when entering EH.
>>
>> The problem remains though that we *must* wait for the scsi device
>> resume to be done before calling scsi_rescan_device(), which is done
>> asynchronously from EH, as a different work. So that one needs to wait
>> for the scsi side resume to be done.
>>
>> I also thought of trigerring the rescan from the scsi side, but since
>> the resume may be asynchronous, we could endup trigerring it with the
>> ata side not yet resumed... That would only turn the problem around
>> instead of solving it.
>
> The order in which devices get resumed isn't arbitrary. If the system
> is set up not to use async suspends/resumes then the order is always the
> same as the order in which the devices were originally registered (for
> resume, that is -- suspend obviously takes place in the reverse order).
>
> So if you're trying to perform an action that requires two devices to be
> active, you must not do it in the resume handler for the device that was
> registered first. I don't know how the ATA and SCSI pieces interact
> here, but regardless, this is a pretty strict requirement.
>
> It should be okay to perform the action in the resume handler for the
> device that was registered second. But if the two devices aren't in an
> ancestor-descendant relationship then you also have to call
> device_pm_wait_for_dev() (or use device links as Rafael mentioned) to
> handle the async case properly.
>
>> Or... Why the heck scsi_rescan_device() is calling device_lock() ? This
>> is the only place in scsi code I can see that takes this lock. I suspect
>> this is to serialize either rescans, or serialize with resume, or both.
>> For serializing rescans, we can use another lock. For serializing with
>> PM, we should wait for PM transitions...
>> Something is not right here.
>
> Here's what commit e27829dc92e5 ("scsi: serialize ->rescan against
> ->remove", written by Christoph Hellwig) says:
>
> Lock the device embedded in the scsi_device to protect against
> concurrent calls to ->remove.
>
> That's the commit which added the device_lock() call.
Thanks for the information.
+Christoph
Why is adding the device_lock() needed ? We could just do a
scsi_device_get()+scsi_device_put() to serialize against remove. No ?
>
> Alan Stern
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2023-06-15 0:10 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-09 11:04 Fwd: Waking up from resume locks up on sr device Bagas Sanjaya
2023-06-10 6:38 ` Bagas Sanjaya
2023-06-10 8:55 ` Pavel Machek
2023-06-10 13:27 ` Bagas Sanjaya
2023-06-10 15:03 ` Bart Van Assche
2023-06-11 9:05 ` Joe Breuer
2023-06-11 11:31 ` Bagas Sanjaya
2023-06-14 4:49 ` Damien Le Moal
2023-06-14 5:37 ` Kai-Heng Feng
2023-06-14 6:31 ` Damien Le Moal
2023-06-14 7:22 ` Damien Le Moal
2023-06-14 6:57 ` Hannes Reinecke
2023-06-14 7:35 ` Damien Le Moal
2023-06-14 14:26 ` Alan Stern
2023-06-14 14:40 ` Rafael J. Wysocki
2023-06-14 18:04 ` Bart Van Assche
2023-06-14 22:44 ` Damien Le Moal
2023-06-15 0:10 ` Damien Le Moal [this message]
2023-06-15 4:40 ` Christoph Hellwig
2023-06-15 4:57 ` Damien Le Moal
2023-06-15 5:09 ` Christoph Hellwig
2023-06-12 3:09 ` Damien Le Moal
2023-06-12 6:09 ` Hannes Reinecke
2023-06-12 7:22 ` Damien Le Moal
2023-06-12 7:36 ` Kai-Heng Feng
2023-06-12 7:47 ` Damien Le Moal
2023-06-12 14:33 ` Alan Stern
2023-06-12 15:37 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41b069c7-8723-4507-3e5a-1d1878db9002@kernel.org \
--to=dlemoal@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=bagasdotme@gmail.com \
--cc=bvanassche@acm.org \
--cc=dan.j.williams@intel.com \
--cc=gpiccoli@igalia.com \
--cc=gregkh@linuxfoundation.org \
--cc=hare@suse.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jejb@linux.ibm.com \
--cc=kai.heng.feng@canonical.com \
--cc=keescook@chromium.org \
--cc=len.brown@intel.com \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@jmbreuer.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=linux@leemhuis.info \
--cc=martin.kepplinger@puri.sm \
--cc=martin.petersen@oracle.com \
--cc=pavel@ucw.cz \
--cc=phil@philpotter.co.uk \
--cc=rafael@kernel.org \
--cc=regressions@lists.linux.dev \
--cc=stern@rowland.harvard.edu \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).