From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
To: Lukas Wunner <lukas@wunner.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
Yazen Ghannam <yazen.ghannam@amd.com>,
Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>,
Bowman Terry <terry.bowman@amd.com>,
Hagan Billy <billy.hagan@amd.com>,
Simon Guinot <simon.guinot@seagate.com>,
"Maciej W . Rozycki" <macro@orcam.me.uk>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com>
Subject: Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction
Date: Tue, 25 Jun 2024 13:20:10 -0700 [thread overview]
Message-ID: <73fd7b2d-9256-9eba-70be-d69ea336fd67@amd.com> (raw)
In-Reply-To: <ZnKNJxJwdtWRphgX@wunner.de>
Hi all,
Sorry for the delay here. Took some time to find a system to run
experiments. Comments inline.
On 6/19/2024 12:47 AM, Lukas Wunner wrote:
> On Tue, Jun 18, 2024 at 02:23:21PM -0700, Smita Koralahalli wrote:
>> On 6/18/2024 11:51 AM, Smita Koralahalli wrote:
>>>>>> But IIUC LBMS is set by hardware but never cleared by hardware, so if
>>>>>> we remove a device and power off the slot, it doesn't seem like LBMS
>>>>>> could be telling us anything useful (what could we do in response to
>>>>>> LBMS when the slot is empty?), so it makes sense to me to clear it.
>>>>>>
>>>>>> It seems like pciehp_unconfigure_device() does sort of PCI core and
>>>>>> driver-related things and possibly could be something shared by all
>>>>>> hotplug drivers, while remove_board() does things more specific to the
>>>>>> hotplug model (pciehp, shpchp, etc).
>>>>>>
>>>>>> From that perspective, clearing LBMS might fit better in
>>>>>> remove_board(). In that case, I wonder whether it should be done
>>>>>> after turning off slot power? This patch clears is *before* turning
>>>>>> off the power, so I wonder if hardware could possibly set it again
>>>>>> before the poweroff?
>>
>> While clearing LBMS in remove_board() here:
>>
>> if (POWER_CTRL(ctrl)) {
>> pciehp_power_off_slot(ctrl);
>> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> PCI_EXP_LNKSTA_LBMS);
>>
>> /*
>> * After turning power off, we must wait for at least 1 second
>> * before taking any action that relies on power having been
>> * removed from the slot/adapter.
>> */
>> msleep(1000);
>>
>> /* Ignore link or presence changes caused by power off */
>> atomic_and(~(PCI_EXP_SLTSTA_DLLSC | PCI_EXP_SLTSTA_PDC),
>> &ctrl->pending_events);
>> }
>>
>> This can happen too right? I.e Just after the slot poweroff and before LBMS
>> clearing the PDC/PDSC could be fired. Then
>> pciehp_handle_presence_or_link_change() would hit case "OFF_STATE" and
>> proceed with pciehp_enable_slot() ....pcie_failed_link_retrain() and
>> ultimately link speed drops..
>>
>> So, I added clearing just before turning off the slot.. Let me know if I'm
>> thinking it right.
I guess I should have experimented before putting this comment out.
After talking to the HW/FW teams, I understood that, none of our CRBs
support power controller for NVMe devices, which means the "Power
Controller Present" in Slot_Cap is always false. That's what makes it a
"surprise removal." If the OS was notified beforehand and there was a
power controller attached, the OS would turn off the power with
SLOT_CNTL. That's an "orderly" removal. So essentially, the entire block
from "if (POWER_CTRL(ctrl))" will never be executed for surprise removal
for us.
There could be board designs outside of us, with power controllers for
the NVME devices, which I'm not aware of.
>
> This was added by 3943af9d01e9 ("PCI: pciehp: Ignore Link State Changes
> after powering off a slot"). You can try reproducing it by writing "0"
> to the slot's "power" file in sysfs, but your hardware needs to support
> slot power.
>
> Basically the idea is that after waiting for 1 sec, chances are very low
> that any DLLSC or PDSC events caused by removing slot power may still
> occur.
PDSC events occurring in our case aren't by removing slot power. It
should/will always happen on a surprise removal along with DLLSC for us.
But this PDSC is been delayed and happens after DLLSC is invoked and
ctrl->state = OFF_STATE in pciehp_disable_slot(). So the PDSC is mistook
to enable slot in pciehp_enable_slot() inside
pciehp_handle_presence_or_link_change().
>
> Arguably the same applies to LBMS changes, so I'd recommend to likewise
> clear stale LBMS after the msleep(1000).
>
> pciehp_ctrl.c only contains the state machine and higher-level logic of
> the hotplug controller and all the actual register accesses are in helpers
> in pciehp_hpc.c. So if you want to do it picture-perfectly, add a helper
> in pciehp_hpc.c to clear LBMS and call that from remove_board().
>
> That all being said, I'm wondering how this plays together with Ilpo's
> bandwidth control driver?
>
> https://lore.kernel.org/all/20240516093222.1684-1-ilpo.jarvinen@linux.intel.com/
I need to yet do a thorough reading of Ilpo's bandwidth control driver.
Ilpo please correct me if I misspeak something as I don't have a
thorough understanding.
Ilpo's bandwidth controller also checks for lbms count to be greater
than zero to bring down link speeds if CONFIG_PCIE_BWCTRL is true. If
false, it follows the default path to check LBMS bit in link status
register. So if, CONFIG_PCIE_BWCTRL is disabled by default we continue
to see link speed drops. Even, if BWCTRL is enabled, LBMS count is
incremented to 1 in pcie_bwnotif_enable() so likely pcie_lbms_seen()
might return true thereby bringing down speeds here as well if DLLLA is
clear?
>
> IIUC, the bandwidth control driver will be in charge of handling LBMS
> changes. So clearing LBMS behind the bandwidth control driver's back
> might be problematic. Ilpo?
>
> Also, since you've confirmed that this issue is fallout from
> a89c82249c37 ("PCI: Work around PCIe link training failures"),
> I'm wondering if the logic introduced by that commit can be
> changed so that the quirk is applied more narrowly, i.e. *not*
> applied to unaffected hardware, such as AMD's hotplug ports.
> That would avoid the need to undo the effect of the quirk and
> work around the downtraining you're seeing.
>
> Maciej, any ideas?
Yeah I'm okay to go down to that approach as well. Any ideas would be
helpful here.
Thanks
Smita
>
> Thanks,
>
> Lukas
>
next prev parent reply other threads:[~2024-06-25 20:20 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-16 20:47 [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction Smita Koralahalli
2024-06-17 20:09 ` Bjorn Helgaas
2024-06-17 22:51 ` Smita Koralahalli
2024-06-17 23:18 ` Bjorn Helgaas
2024-06-18 18:51 ` Smita Koralahalli
2024-06-18 21:23 ` Smita Koralahalli
2024-06-19 7:47 ` Lukas Wunner
2024-06-25 20:20 ` Smita Koralahalli [this message]
2024-07-09 10:52 ` Ilpo Järvinen
2024-07-17 21:14 ` Smita Koralahalli
2024-07-18 7:47 ` Ilpo Järvinen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=73fd7b2d-9256-9eba-70be-d69ea336fd67@amd.com \
--to=smita.koralahallichannabasappa@amd.com \
--cc=billy.hagan@amd.com \
--cc=helgaas@kernel.org \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=macro@orcam.me.uk \
--cc=mahesh@linux.ibm.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=simon.guinot@seagate.com \
--cc=terry.bowman@amd.com \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.