From: Damien Le Moal <dlemoal@kernel.org>
To: Yihang Li <liyihang9@huawei.com>, cassel@kernel.org
Cc: James.Bottomley@HansenPartnership.com,
martin.petersen@oracle.com, john.g.garry@oracle.com,
yanaijie@huawei.com, linux-kernel@vger.kernel.org,
linux-scsi@vger.kernel.org, linuxarm@huawei.com,
chenxiang66@hisilicon.com, prime.zeng@huawei.com,
"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
Bjorn Helgaas <bhelgaas@google.com>
Subject: Re: [bug report] scsi: SATA devices missing after FLR is triggered during HBA suspended
Date: Mon, 1 Jul 2024 12:03:02 +0900 [thread overview]
Message-ID: <85cebcb9-ce97-43f2-8da5-01c3a745fe2c@kernel.org> (raw)
In-Reply-To: <0d9bce26-c45b-5ce1-93c0-ca8af50547ae@huawei.com>
On 6/24/24 21:10, Yihang Li wrote:
>> Thank you for the explanation, but as Niklas said, it would be a lot easier for
>> me to recreate the issue if you send the exact commands you execute to trigger
>> the issue. E.g. "suspend all disks" in step a can have a lot of different
>> meaning depending on which type os suspend you are using... So please send the
>> exact commands you use.
>> is what exactly ? autosuspend ? or something else ?
I am failing to recreate the exact same issue. I do see a lot of bad things
happening though, but that is not looking like what you sent. I do endup with
the 4 drives connected on my HBA being disabled by libata as revalidate/IDENTIFY
fails. And even worse: I hit a deadlock on dev->mutex when I try to do "rmmod
pm80xx" after running your test.
I am using a pm80xx adapter as that is the only libsas adapter I have.
I think your test just kicked a big can of worms... There seem to be a lot of
wrong things going on, but I now need to sort out if the problems are with the
pm80xx driver, libsas, libata or sd. Probably a combination of all.
ATA device suspend/resume has been a constant source of issues since scsi layer
switched to doing PM operations asynchronouly. Your issue is latest one.
This will take a while to debug.
> In step a, I suspend all disks by issuing the following command to all disks
> attached to the SAS controller 0000:b4:02.0:
> [root@localhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:0/end_device-6:0/target6:0:0/6:0:0:0/power/control
> [root@localhost ~]# echo 5000 > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:0/end_device-6:0/target6:0:0/6:0:0:0/power/autosuspend_delay_ms
> ...
> [root@localhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:6/end_device-6:6/target6:0:6/6:0:6:0/power/control
> [root@localhost ~]# echo 5000 > /sys/devices/pci0000:b4/0000:b4:02.0/host6/port-6:6/end_device-6:6/target6:0:6/6:0:6:0/power/autosuspend_delay_ms
This works as expected on my system and I see my drives going to sleep after 5s.
> Step b, Suspend the SAS controller:
> [root@localhost ~]# echo auto > /sys/devices/pci0000:b4/0000:b4:02.0/power/control
This has no effect for me. Can you confirm that your controller is actually
sleeping ? I.e., what do the following show ?
cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_active_kids
cat /sys/devices/pci0000:b4/0000:b4:02.0/power/runtime_status
?
> At this point, the SAS controller is suspended. Next step c is trigger PCI FLR.
> [root@localhost ~]# echo 1 > /sys/bus/pci/devices/0000:b4:02.0/reset
What does
cat /sys/bus/pci/devices/0000:b4:02.0/reset_method
is on your system ?
Mine is "bus" only.
>>> The issue 2:
>>> a. Suspend all disks on controller B.
>>> b. Suspend controller B.
>>> c. Resuming all disks on controller B.
>>> d. Run the "lsmod" command to check the driver reference counting.
What is the reference count before you do step (a), after you run step (b) and
at step (d) ?
For my system using the pm80xx driver, I get:
pm80xx 352256 0
libsas 155648 1 pm80xx
before and after, and that is all normal. But there is the difference that
suspending the pm80xx controller does not seem to be supported and does nothing.
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2024-07-01 3:03 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-18 13:29 [bug report] scsi: SATA devices missing after FLR is triggered during HBA suspended Yihang Li
2024-06-18 23:11 ` Damien Le Moal
2024-06-22 3:31 ` Yihang Li
2024-06-22 11:25 ` Niklas Cassel
2024-06-24 0:10 ` Damien Le Moal
2024-06-24 12:10 ` Yihang Li
2024-07-01 3:03 ` Damien Le Moal [this message]
2024-07-02 11:20 ` Yihang Li
2024-06-26 15:15 ` Bjorn Helgaas
2024-06-27 0:56 ` Damien Le Moal
2024-06-27 8:19 ` Yihang Li
2024-07-01 20:39 ` Bjorn Helgaas
2024-07-02 2:38 ` Damien Le Moal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=85cebcb9-ce97-43f2-8da5-01c3a745fe2c@kernel.org \
--to=dlemoal@kernel.org \
--cc=James.Bottomley@HansenPartnership.com \
--cc=bhelgaas@google.com \
--cc=cassel@kernel.org \
--cc=chenxiang66@hisilicon.com \
--cc=john.g.garry@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=liyihang9@huawei.com \
--cc=martin.petersen@oracle.com \
--cc=prime.zeng@huawei.com \
--cc=yanaijie@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.