* [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
@ 2025-11-20 16:12 Manivannan Sadhasivam
2025-11-24 11:42 ` Konrad Dybcio
2025-11-24 23:53 ` Bjorn Helgaas
0 siblings, 2 replies; 11+ messages in thread
From: Manivannan Sadhasivam @ 2025-11-20 16:12 UTC (permalink / raw)
To: bhelgaas; +Cc: linux-pci, linux-kernel, Manivannan Sadhasivam, Konrad Dybcio
The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
enabled:
pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
nvme 0006:01:00.0: [ 0] RxErr
Hence, add a quirk to disable L1 by removing the ASPM_L1 CAP for this SSD.
Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
---
Changes in v2:
* Fixed the laptop name
* Rebased on top of v6.18-rc6 for pcie_aspm_remove_cap()
drivers/pci/quirks.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index b9c252aa6fe0..adc54533df7f 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2527,6 +2527,17 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
+static void quirk_disable_aspm_l1(struct pci_dev *dev)
+{
+ pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
+}
+
+/*
+ * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
+ * Port when ASPM L1 is enabled.
+ */
+DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
+
/*
* Some Pericom PCIe-to-PCI bridges in reverse mode need the PCIe Retrain
* Link bit cleared after starting the link retrain process to allow this
--
2.48.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-20 16:12 [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs Manivannan Sadhasivam
@ 2025-11-24 11:42 ` Konrad Dybcio
2025-11-24 23:53 ` Bjorn Helgaas
1 sibling, 0 replies; 11+ messages in thread
From: Konrad Dybcio @ 2025-11-24 11:42 UTC (permalink / raw)
To: Manivannan Sadhasivam, bhelgaas
Cc: linux-pci, linux-kernel, Manivannan Sadhasivam
On 11/20/25 5:12 PM, Manivannan Sadhasivam wrote:
> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
> Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
> enabled:
>
> pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
> nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
> nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
> nvme 0006:01:00.0: [ 0] RxErr
>
> Hence, add a quirk to disable L1 by removing the ASPM_L1 CAP for this SSD.
>
> Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
> ---
This revision also works for me, thank you
Tested-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> # X1E80100 Romulus
Konrad
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-20 16:12 [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs Manivannan Sadhasivam
2025-11-24 11:42 ` Konrad Dybcio
@ 2025-11-24 23:53 ` Bjorn Helgaas
2025-11-25 5:21 ` Manivannan Sadhasivam
1 sibling, 1 reply; 11+ messages in thread
From: Bjorn Helgaas @ 2025-11-24 23:53 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: bhelgaas, linux-pci, linux-kernel, Manivannan Sadhasivam,
Konrad Dybcio, Alexey Bogoslavsky, Jeffrey Lien, Avinash M N
[+cc Alexey, Jeffrey, Avinash]
On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
> Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
> enabled:
>
> pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
> nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
> nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
> nvme 0006:01:00.0: [ 0] RxErr
Do we have any information about whether this error happens with the
SN740 on platforms other than the Surface Laptop 7? Or whether it
happens on the Surface with other endpoints?
I'm a little hesitant about quirking devices and claiming they are
defective without a solid root cause.
Sandisk folks, do you have any insight into this? Any known errata or
possibility of looking into this with an analyzer?
> Hence, add a quirk to disable L1 by removing the ASPM_L1 CAP for this SSD.
>
> Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
> ---
>
> Changes in v2:
>
> * Fixed the laptop name
> * Rebased on top of v6.18-rc6 for pcie_aspm_remove_cap()
>
> drivers/pci/quirks.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index b9c252aa6fe0..adc54533df7f 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -2527,6 +2527,17 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
>
> +static void quirk_disable_aspm_l1(struct pci_dev *dev)
> +{
> + pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
> +}
> +
> +/*
> + * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
> + * Port when ASPM L1 is enabled.
> + */
> +DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-24 23:53 ` Bjorn Helgaas
@ 2025-11-25 5:21 ` Manivannan Sadhasivam
2025-11-25 15:30 ` Alexey Bogoslavsky
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Manivannan Sadhasivam @ 2025-11-25 5:21 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas, linux-pci, linux-kernel,
Konrad Dybcio, Alexey Bogoslavsky, Jeffrey Lien, Avinash M N
On Mon, Nov 24, 2025 at 05:53:07PM -0600, Bjorn Helgaas wrote:
> [+cc Alexey, Jeffrey, Avinash]
>
> On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
> > The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
> > Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
> > enabled:
> >
> > pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
> > nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
> > nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
> > nvme 0006:01:00.0: [ 0] RxErr
>
> Do we have any information about whether this error happens with the
> SN740 on platforms other than the Surface Laptop 7? Or whether it
> happens on the Surface with other endpoints?
>
This device comes pre installed with the Surface Laptop 7 I believe. It is not
very convenient to replace the NVMe in a laptop for testing.
> I'm a little hesitant about quirking devices and claiming they are
> defective without a solid root cause.
>
There are a couple of points that made me convince myself:
* Other X1E laptops are working fine with ASPM L1.
* This laptop has WCN785x WiFi/BT combo card connected to the other controller
instance and L1 is working fine for it.
* There is no known issue with ASPM L1 in X1E chipsets.
Because of these, I was so certain that the NVMe is the fault here.
> Sandisk folks, do you have any insight into this? Any known errata or
> possibility of looking into this with an analyzer?
>
I don't think Konrad has access to the analyzer, neither any of us.
If you are still hesitant, I'd suggest adding the platform check so that this
quirk is only limited to the Surface Laptop 7:
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index adc54533df7f..1655757ba66a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -29,6 +29,7 @@
#include <linux/ktime.h>
#include <linux/mm.h>
#include <linux/nvme.h>
+#include <linux/of.h>
#include <linux/platform_data/x86/apple.h>
#include <linux/pm_runtime.h>
#include <linux/sizes.h>
@@ -2527,15 +2528,19 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
+/*
+ * Sandisk SN740 NVMe SSDs in Microsoft Surface Laptop 7 cause AER timeout
+ * errors on the upstream PCIe Root Port when ASPM L1 is enabled.
+ */
static void quirk_disable_aspm_l1(struct pci_dev *dev)
{
- pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
+ struct device_node *root __free(device_node) = of_find_node_by_path("/");
+ const char *model = of_get_property(root, "compatible", NULL);
+
+ if (!strcmp(model, "microsoft,romulus13"))
+ pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
}
-/*
- * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
- * Port when ASPM L1 is enabled.
- */
DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
/*
This check is similar to the DMI checks we have currently non-DT platforms.
Infact, we have also use the DMI checks on this laptop as it comes with SMBIOS.
Note: I'm not sure if Konrad's lapto is based on "microsoft,romulus13" or
"microsoft,romulus15".
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply related [flat|nested] 11+ messages in thread
* RE: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-25 5:21 ` Manivannan Sadhasivam
@ 2025-11-25 15:30 ` Alexey Bogoslavsky
2025-11-27 18:41 ` Konrad Dybcio
2025-11-27 18:40 ` Konrad Dybcio
2025-12-01 6:48 ` Val Packett
2 siblings, 1 reply; 11+ messages in thread
From: Alexey Bogoslavsky @ 2025-11-25 15:30 UTC (permalink / raw)
To: Manivannan Sadhasivam, Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas@google.com,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
Konrad Dybcio, Jeffrey Lien, Avinash M N
On Mon, Nov 24, 2025 at 05:53:07PM -0600, Bjorn Helgaas wrote:
> > [+cc Alexey, Jeffrey, Avinash]
> >
> > On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
> > > The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream
> > > Root Port of PCIe controller in Microsoft Surface Laptop 7, when
> > > ASPM L1 is
> > > enabled:
> > >
> > > pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
> > > nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
> > > nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
> > > nvme 0006:01:00.0: [ 0] RxErr
> >
> > Do we have any information about whether this error happens with the
> > SN740 on platforms other than the Surface Laptop 7? Or whether it
> > happens on the Surface with other endpoints?
> >
> This device comes pre installed with the Surface Laptop 7 I believe. It is not very convenient to replace the NVMe in a laptop for testing.
> > I'm a little hesitant about quirking devices and claiming they are
> > defective without a solid root cause.
> >
> There are a couple of points that made me convince myself:
> * Other X1E laptops are working fine with ASPM L1.
> * This laptop has WCN785x WiFi/BT combo card connected to the other controller instance and L1 is working fine for it.
> * There is no known issue with ASPM L1 in X1E chipsets.
> Because of these, I was so certain that the NVMe is the fault here.
> > Sandisk folks, do you have any insight into this? Any known errata or
> > possibility of looking into this with an analyzer?
> >
> I don't think Konrad has access to the analyzer, neither any of us.
> If you are still hesitant, I'd suggest adding the platform check so that this quirk is only limited to the Surface Laptop 7:
We at Sandisk are currently checking and double-checking the issue on several platforms and with several devices.
We haven't reached final conclusions yet, but what's clear is that quirking out SN740 unconditionally, for all platforms,
is definitely an overkill. The device performs normally on vast majority of platforms. Applying the quirk for the combination
of SN740 and Surface Laptop 7, as you suggested, is definitely a better choice. Still, we'd like to check a few more
platform / device combinations so we cover everything that needs to be covered while sparing the rest of combinations.
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index adc54533df7f..1655757ba66a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -29,6 +29,7 @@
#include <linux/ktime.h>
#include <linux/mm.h>
#include <linux/nvme.h>
+#include <linux/of.h>
#include <linux/platform_data/x86/apple.h> #include <linux/pm_runtime.h> #include <linux/sizes.h> @@ -2527,15 +2528,19 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, 0x0451, quirk_disable_aspm_l0s DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PASEMI, 0xa002, quirk_disable_aspm_l0s_l1); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HUAWEI, 0x1105, quirk_disable_aspm_l0s_l1);
+/*
+ * Sandisk SN740 NVMe SSDs in Microsoft Surface Laptop 7 cause AER
+timeout
+ * errors on the upstream PCIe Root Port when ASPM L1 is enabled.
+ */
static void quirk_disable_aspm_l1(struct pci_dev *dev) {
- pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
+ struct device_node *root __free(device_node) = of_find_node_by_path("/");
+ const char *model = of_get_property(root, "compatible", NULL);
+
+ if (!strcmp(model, "microsoft,romulus13"))
+ pcie_aspm_remove_cap(dev, PCI_EXP_LNKCAP_ASPM_L1);
}
-/*
- * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
- * Port when ASPM L1 is enabled.
- */
DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
/*
This check is similar to the DMI checks we have currently non-DT platforms.
Infact, we have also use the DMI checks on this laptop as it comes with SMBIOS.
Note: I'm not sure if Konrad's lapto is based on "microsoft,romulus13" or "microsoft,romulus15".
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-25 5:21 ` Manivannan Sadhasivam
2025-11-25 15:30 ` Alexey Bogoslavsky
@ 2025-11-27 18:40 ` Konrad Dybcio
2025-12-01 6:48 ` Val Packett
2 siblings, 0 replies; 11+ messages in thread
From: Konrad Dybcio @ 2025-11-27 18:40 UTC (permalink / raw)
To: Manivannan Sadhasivam, Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas, linux-pci, linux-kernel,
Alexey Bogoslavsky, Jeffrey Lien, Avinash M N
On 11/25/25 6:21 AM, Manivannan Sadhasivam wrote:
> On Mon, Nov 24, 2025 at 05:53:07PM -0600, Bjorn Helgaas wrote:
>> [+cc Alexey, Jeffrey, Avinash]
>>
>> On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
>>> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream Root
>>> Port of PCIe controller in Microsoft Surface Laptop 7, when ASPM L1 is
>>> enabled:
>>>
>>> pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
>>> nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
>>> nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
>>> nvme 0006:01:00.0: [ 0] RxErr
>>
>> Do we have any information about whether this error happens with the
>> SN740 on platforms other than the Surface Laptop 7? Or whether it
>> happens on the Surface with other endpoints?
[...]
> -/*
> - * Sandisk SN740 NVMe SSDs cause AER timeout errors on the upstream PCIe Root
> - * Port when ASPM L1 is enabled.
> - */
> DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5015, quirk_disable_aspm_l1);
>
> /*
>
> This check is similar to the DMI checks we have currently non-DT platforms.
> Infact, we have also use the DMI checks on this laptop as it comes with SMBIOS.
>
> Note: I'm not sure if Konrad's lapto is based on "microsoft,romulus13" or
> "microsoft,romulus15".
15, but they're otherwise identical hardware. Please quirk off both if you
go this route.
Konrad
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-25 15:30 ` Alexey Bogoslavsky
@ 2025-11-27 18:41 ` Konrad Dybcio
0 siblings, 0 replies; 11+ messages in thread
From: Konrad Dybcio @ 2025-11-27 18:41 UTC (permalink / raw)
To: Alexey Bogoslavsky, Manivannan Sadhasivam, Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas@google.com,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
Jeffrey Lien, Avinash M N
On 11/25/25 4:30 PM, Alexey Bogoslavsky wrote:
> On Mon, Nov 24, 2025 at 05:53:07PM -0600, Bjorn Helgaas wrote:
>>> [+cc Alexey, Jeffrey, Avinash]
>>>
>>> On Thu, Nov 20, 2025 at 09:42:53PM +0530, Manivannan Sadhasivam wrote:
>>>> The Sandisk SN740 NVMe SSDs cause below AER errors on the upstream
>>>> Root Port of PCIe controller in Microsoft Surface Laptop 7, when
>>>> ASPM L1 is
>>>> enabled:
>>>>
>>>> pcieport 0006:00:00.0: AER: Correctable error message received from 0006:01:00.0
>>>> nvme 0006:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
>>>> nvme 0006:01:00.0: device [15b7:5015] error status/mask=00000001/0000e000
>>>> nvme 0006:01:00.0: [ 0] RxErr
>>>
>>> Do we have any information about whether this error happens with the
>>> SN740 on platforms other than the Surface Laptop 7? Or whether it
>>> happens on the Surface with other endpoints?
>>>
>
>> This device comes pre installed with the Surface Laptop 7 I believe. It is not very convenient to replace the NVMe in a laptop for testing.
>
>>> I'm a little hesitant about quirking devices and claiming they are
>>> defective without a solid root cause.
>>>
>
>> There are a couple of points that made me convince myself:
>
>> * Other X1E laptops are working fine with ASPM L1.
>> * This laptop has WCN785x WiFi/BT combo card connected to the other controller instance and L1 is working fine for it.
>> * There is no known issue with ASPM L1 in X1E chipsets.
>
>> Because of these, I was so certain that the NVMe is the fault here.
>
>>> Sandisk folks, do you have any insight into this? Any known errata or
>>> possibility of looking into this with an analyzer?
>>>
>
>> I don't think Konrad has access to the analyzer, neither any of us.
>
>> If you are still hesitant, I'd suggest adding the platform check so that this quirk is only limited to the Surface Laptop 7:
>
> We at Sandisk are currently checking and double-checking the issue on several platforms and with several devices.
> We haven't reached final conclusions yet, but what's clear is that quirking out SN740 unconditionally, for all platforms,
> is definitely an overkill. The device performs normally on vast majority of platforms. Applying the quirk for the combination
> of SN740 and Surface Laptop 7, as you suggested, is definitely a better choice. Still, we'd like to check a few more
> platform / device combinations so we cover everything that needs to be covered while sparing the rest of combinations.
Can we interpret that as "SN740 doesn't have L1 issues on at least one
other platform"?
In that case Mani, is there a chance we omitted some related tunable in
pci-qcom that only seems to matter for some PCI devices and not others?
Konrad
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-11-25 5:21 ` Manivannan Sadhasivam
2025-11-25 15:30 ` Alexey Bogoslavsky
2025-11-27 18:40 ` Konrad Dybcio
@ 2025-12-01 6:48 ` Val Packett
2025-12-01 10:37 ` Manivannan Sadhasivam
2025-12-04 12:51 ` Konrad Dybcio
2 siblings, 2 replies; 11+ messages in thread
From: Val Packett @ 2025-12-01 6:48 UTC (permalink / raw)
To: Manivannan Sadhasivam, Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas, linux-pci, linux-kernel,
Konrad Dybcio, Alexey Bogoslavsky, Jeffrey Lien, Avinash M N
On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
> [..]
> There are a couple of points that made me convince myself:
>
> * Other X1E laptops are working fine with ASPM L1.
> * This laptop has WCN785x WiFi/BT combo card connected to the other controller
> instance and L1 is working fine for it.
> * There is no known issue with ASPM L1 in X1E chipsets.
>
> Because of these, I was so certain that the NVMe is the fault here.
There is *a* known issue with ASPM L1 on X1E, reported by maaaany users
on #aarch64-laptops, that we discussed in another thread..
But it is a full system freeze, **not** a correctable AER message, and
it definitely happens with a bunch of various SSDs on various laptops. I
personally have had it happen both with the SN740 and an SK Hynix drive,
on a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the
drive, but keeping it on for the WiFi, was enough to get to month-long
uptime) but not specific to any SSD model.
One bit of news I have about it is that I recently started using EL2
(slbounce), and I did see something that looked like that hang.. but
unlike in EL1, right before the reboot the panic LED did start blinking.
So if that was indeed from the same issue, I should now be able to catch
it into pstore (if pstore works.. trying blk with sdhc instead of efi
now 0.o) Maybe QHEE was eating the fault and itself crashing, since it
"owns" the PCIe IOMMU when it's running.. (???)
~val
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-12-01 6:48 ` Val Packett
@ 2025-12-01 10:37 ` Manivannan Sadhasivam
2025-12-04 12:51 ` Konrad Dybcio
1 sibling, 0 replies; 11+ messages in thread
From: Manivannan Sadhasivam @ 2025-12-01 10:37 UTC (permalink / raw)
To: Val Packett
Cc: Bjorn Helgaas, Manivannan Sadhasivam, bhelgaas, linux-pci,
linux-kernel, Konrad Dybcio, Alexey Bogoslavsky, Jeffrey Lien,
Avinash M N
On Mon, Dec 01, 2025 at 03:48:13AM -0300, Val Packett wrote:
>
> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
> > [..]
> > There are a couple of points that made me convince myself:
> >
> > * Other X1E laptops are working fine with ASPM L1.
> > * This laptop has WCN785x WiFi/BT combo card connected to the other controller
> > instance and L1 is working fine for it.
> > * There is no known issue with ASPM L1 in X1E chipsets.
> >
> > Because of these, I was so certain that the NVMe is the fault here.
>
> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on
> #aarch64-laptops, that we discussed in another thread..
>
The other thread you are referring to is this one I believe:
https://lore.kernel.org/linux-pci/21398de7-3dd9-4c43-97d9-7c3002c401e5@packett.cool/
From this, I cannot conclude that controller was at the fault. Atleast, not
until now.
> But it is a full system freeze, **not** a correctable AER message, and it
> definitely happens with a bunch of various SSDs on various laptops. I
> personally have had it happen both with the SN740 and an SK Hynix drive, on
> a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive,
> but keeping it on for the WiFi, was enough to get to month-long uptime) but
> not specific to any SSD model.
>
Please confirm whether you disabled all ASPM states (L0s, L1 and L1ss) or just
L1ss for the controller instance where SSD is connected. Starting from
v6.18-rc3, only L0s and L1 will be enabled by default without any
cmdline/Kconfig changes.
> One bit of news I have about it is that I recently started using EL2
> (slbounce), and I did see something that looked like that hang.. but unlike
> in EL1, right before the reboot the panic LED did start blinking. So if that
> was indeed from the same issue, I should now be able to catch it into pstore
> (if pstore works.. trying blk with sdhc instead of efi now 0.o)
That would be helpful. I guess Abel did it on XPS13, but need to check more.
> Maybe QHEE
> was eating the fault and itself crashing, since it "owns" the PCIe IOMMU
> when it's running.. (???)
>
Yes, they all are captured by QHEE for post mortem analsys that could only be
performed using Qcom tools and on non-production devices. I don't know how to
capture those logs on production laptops.
Anyhow, to isolate this issue to ASPM L1 on the X1E PCIe controller, please
disable all ASPM states by selecting CONFIG_PCIEASPM_PERFORMANCE in Kconfig and
let it run. If you do not see the crash at all for some time (or days), then the
crash was related to ASPM issue in the controller (since you said the crash was
repro. with other SSDs as well). If not, there is something else going wrong.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-12-01 6:48 ` Val Packett
2025-12-01 10:37 ` Manivannan Sadhasivam
@ 2025-12-04 12:51 ` Konrad Dybcio
2025-12-04 21:28 ` Val Packett
1 sibling, 1 reply; 11+ messages in thread
From: Konrad Dybcio @ 2025-12-04 12:51 UTC (permalink / raw)
To: Val Packett, Manivannan Sadhasivam, Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas, linux-pci, linux-kernel,
Alexey Bogoslavsky, Jeffrey Lien, Avinash M N
On 12/1/25 7:48 AM, Val Packett wrote:
>
> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
>> [..]
>> There are a couple of points that made me convince myself:
>>
>> * Other X1E laptops are working fine with ASPM L1.
>> * This laptop has WCN785x WiFi/BT combo card connected to the other controller
>> instance and L1 is working fine for it.
>> * There is no known issue with ASPM L1 in X1E chipsets.
>>
>> Because of these, I was so certain that the NVMe is the fault here.
>
> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on #aarch64-laptops, that we discussed in another thread..
>
> But it is a full system freeze, **not** a correctable AER message, and it definitely happens with a bunch of various SSDs on various laptops. I personally have had it happen both with the SN740 and an SK Hynix drive, on a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive, but keeping it on for the WiFi, was enough to get to month-long uptime) but not specific to any SSD model.
Are the steps to reproduce roughly
* boot without disabling ASPM
* wait
* system reboots on its own (or just freezes?)
?
Konrad
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs
2025-12-04 12:51 ` Konrad Dybcio
@ 2025-12-04 21:28 ` Val Packett
0 siblings, 0 replies; 11+ messages in thread
From: Val Packett @ 2025-12-04 21:28 UTC (permalink / raw)
To: Konrad Dybcio, Manivannan Sadhasivam, Bjorn Helgaas
Cc: Manivannan Sadhasivam, bhelgaas, linux-pci, linux-kernel
On 12/4/25 9:51 AM, Konrad Dybcio wrote:
> On 12/1/25 7:48 AM, Val Packett wrote:
>> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
>>> [..]
>>> There are a couple of points that made me convince myself:
>>>
>>> * Other X1E laptops are working fine with ASPM L1.
>>> * This laptop has WCN785x WiFi/BT combo card connected to the other controller
>>> instance and L1 is working fine for it.
>>> * There is no known issue with ASPM L1 in X1E chipsets.
>>>
>>> Because of these, I was so certain that the NVMe is the fault here.
>> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on #aarch64-laptops, that we discussed in another thread..
>>
>> But it is a full system freeze, **not** a correctable AER message, and it definitely happens with a bunch of various SSDs on various laptops. I personally have had it happen both with the SN740 and an SK Hynix drive, on a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive, but keeping it on for the WiFi, was enough to get to month-long uptime) but not specific to any SSD model.
> Are the steps to reproduce roughly
>
> * boot without disabling ASPM
> * wait
> * system reboots on its own (or just freezes?)
>
> ?
Yeah.
Wait can be anywhere from minutes to days, it seems completely random
and "luck based".
In EL1, the system freezes for a minute and gets rebooted by the watchdog.
In EL2 as I have just now discovered, some cores can still be running
(presumably those that haven't tried accessing the drive) as others
hang, and we can get a proper panic, I got this logged to efi_pstore:
<0>[ 1500.017790] watchdog: CPU3: Watchdog detected hard LOCKUP on cpu 4
<4>[ 1500.017801] Modules linked in: [..]
<6>[ 1500.017937] Sending NMI from CPU 3 to CPUs 4:
<0>[ 1510.017956] Kernel panic - not syncing: Hard LOCKUP
<4>[ 1510.017970] Call trace: [one with watchdog_hardlockup_check, from
CPU3]
<2>[ 1510.018062] SMP: stopping secondary CPUs
<4>[ 1511.085450] SMP: failed to stop secondary CPUs 4-11
No traces from the frozen cores are logged as they don't respond to NMI.
They are *completely* wedged.
~val
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-12-04 21:28 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20 16:12 [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740 NVMe SSDs Manivannan Sadhasivam
2025-11-24 11:42 ` Konrad Dybcio
2025-11-24 23:53 ` Bjorn Helgaas
2025-11-25 5:21 ` Manivannan Sadhasivam
2025-11-25 15:30 ` Alexey Bogoslavsky
2025-11-27 18:41 ` Konrad Dybcio
2025-11-27 18:40 ` Konrad Dybcio
2025-12-01 6:48 ` Val Packett
2025-12-01 10:37 ` Manivannan Sadhasivam
2025-12-04 12:51 ` Konrad Dybcio
2025-12-04 21:28 ` Val Packett
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox