All of lore.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <dlemoal@kernel.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Yihang Li <liyihang9@huawei.com>,
	cassel@kernel.org, James.Bottomley@hansenpartnership.com,
	martin.petersen@oracle.com, john.g.garry@oracle.com,
	yanaijie@huawei.com, linux-kernel@vger.kernel.org,
	linux-scsi@vger.kernel.org, linuxarm@huawei.com,
	chenxiang66@hisilicon.com, prime.zeng@huawei.com,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [bug report] scsi: SATA devices missing after FLR is triggered during HBA suspended
Date: Tue, 2 Jul 2024 11:38:23 +0900	[thread overview]
Message-ID: <93bd907d-c298-497d-ada2-88706233232e@kernel.org> (raw)
In-Reply-To: <20240701203903.GA16142@bhelgaas>

On 7/2/24 05:39, Bjorn Helgaas wrote:
> [+cc Alex]
> 
> On Thu, Jun 27, 2024 at 09:56:02AM +0900, Damien Le Moal wrote:
>> On 6/27/24 00:15, Bjorn Helgaas wrote:
>>>>> Yes, I am talking about the PCI "Function Level Reset"
>>>>>
>>>>>> FLR and disk/controller suspend execution timing are unrelated.
>>>>>> FLR can be triggered at any time through sysfs. So please give
>>>>>> details here. Why is FLR done when the system is being
>>>>>> suspended ?
>>>>>
>>>>> Yes, it is because FLR can be triggered at any time that we are
>>>>> testing the reliability of executing FLR commands after
>>>>> disk/controller suspended.
>>>>
>>>> "can be triggered" ? FLR is not a random asynchronous event. It
>>>> is an action that is *issued* by a user with sys admin rights.
>>>> And such users can do a lot of things that can break a machine...
>>>>
>>>> I fail to see the point of doing a function reset while the
>>>> device is suspended. But granted, I guess the device should
>>>> comeback up in such case, though I would like to hear what the
>>>> PCI guys have to say about this.
>>>>
>>>> Bjorn,
>>>>
>>>> Is reseting a suspended PCI device something that should be/is
>>>> supported ?
>>>
>>> I doubt it.  The PCI core should be preserving all the generic PCI
>>> state across suspend/resume.  The driver should only need to
>>> save/restore device-specific things the PCI core doesn't know about.
>>>
>>> A reset will clear out most state, and the driver doesn't know the
>>> reset happened, so it will expect most device state to have been
>>> preserved.
>>
>> That is what I suspected. However, checking the code, reset_store() in
>> pci-sysfs.c does:
>>
>> 	pm_runtime_get_sync(dev);
>> 	result = pci_reset_function(pdev);
>> 	pm_runtime_put(dev);
>>
>> and pm_runtime_get_sync() calls __pm_runtime_resume() which will
>> resume a suspended device.
>>
>> So while I still think it is not a good idea to reset a suspended
>> device, things should still work as execpected and not cause any
>> problem with the device state, right ?
> 
> The reset will clear almost all state, including both the generic PCI
> part that pci_reset_function() saves/restores *and* any
> device-specific state the PCI core doesn't know about.
> 
> That device-specific state isn't saved and restored anywhere in the
> sysfs reset path, and the driver doesn't know this reset happened, so
> I think all bets are off and we shouldn't expect the driver to work
> afterwards.
> 
> A user-space reset might make sense if there's no driver bound to the
> device, but I don't think it does if there is a driver (except maybe a
> trivial stub driver that doesn't actually operate the device).

OK, makes sense.

I amstill looking into this though because I did find a nasty issue: if the HBA
is reset while all the drives connected to it are suspended (spun down), the
drives are never woken up and the drive re-scan trigerred by the PCI reset fails
with command timeouts. And even worse, I hit a deadlock when unloading the
driver after that happens.

All of that should not be happening: the HBA reset should simply result in
either all drives coming back or the drives (scsi devices) being dropped and
re-scan creating new ones. But that I think is not a PCI issue but rather a HBA
driver issue, or a problem with libsas/scsi/libata power management.

Thanks for the comments.

-- 
Damien Le Moal
Western Digital Research


      reply	other threads:[~2024-07-02  2:38 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18 13:29 [bug report] scsi: SATA devices missing after FLR is triggered during HBA suspended Yihang Li
2024-06-18 23:11 ` Damien Le Moal
2024-06-22  3:31   ` Yihang Li
2024-06-22 11:25     ` Niklas Cassel
2024-06-24  0:10     ` Damien Le Moal
2024-06-24 12:10       ` Yihang Li
2024-07-01  3:03         ` Damien Le Moal
2024-07-02 11:20           ` Yihang Li
2024-06-26 15:15       ` Bjorn Helgaas
2024-06-27  0:56         ` Damien Le Moal
2024-06-27  8:19           ` Yihang Li
2024-07-01 20:39           ` Bjorn Helgaas
2024-07-02  2:38             ` Damien Le Moal [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=93bd907d-c298-497d-ada2-88706233232e@kernel.org \
    --to=dlemoal@kernel.org \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=cassel@kernel.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=helgaas@kernel.org \
    --cc=john.g.garry@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=liyihang9@huawei.com \
    --cc=martin.petersen@oracle.com \
    --cc=prime.zeng@huawei.com \
    --cc=yanaijie@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.