* Are AER corrected errors worrying?
@ 2021-01-01 22:40 Samuel Thibault
2021-01-04 18:44 ` Keith Busch
0 siblings, 1 reply; 8+ messages in thread
From: Samuel Thibault @ 2021-01-01 22:40 UTC (permalink / raw)
To: linux-nvme
Hello,
Our lab has bought a new Dell Latitude 5410 laptop, I installed debian
bullseye on it with kernel 5.9.0-5-amd64, but it is spitting these
errors now and then (sometimes a dozen per a minute):
Jan 1 23:30:53 begin kernel: [ 46.675818] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Jan 1 23:30:53 begin kernel: [ 46.675933] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 1 23:30:53 begin kernel: [ 46.676048] nvme 0000:02:00.0: device [15b7:5006] error status/mask=00000001/0000e000
Jan 1 23:30:53 begin kernel: [ 46.676140] nvme 0000:02:00.0: [ 0] RxErr
Since it's corrected it's not actually an issue, but how worrying is it
to see such errors on new hardware? Documentation/PCI/pcieaer-howto.rst
is not commenting whether we are really supposed to see some of them. I
see forums telling to use pci=noaer to stop the error logging, but is
that really something to do?
Samuel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-01 22:40 Are AER corrected errors worrying? Samuel Thibault
@ 2021-01-04 18:44 ` Keith Busch
2021-01-04 20:12 ` Samuel Thibault
0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2021-01-04 18:44 UTC (permalink / raw)
To: Samuel Thibault; +Cc: linux-nvme
On Fri, Jan 01, 2021 at 11:40:28PM +0100, Samuel Thibault wrote:
> Hello,
>
> Our lab has bought a new Dell Latitude 5410 laptop, I installed debian
> bullseye on it with kernel 5.9.0-5-amd64, but it is spitting these
> errors now and then (sometimes a dozen per a minute):
>
> Jan 1 23:30:53 begin kernel: [ 46.675818] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
> Jan 1 23:30:53 begin kernel: [ 46.675933] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Jan 1 23:30:53 begin kernel: [ 46.676048] nvme 0000:02:00.0: device [15b7:5006] error status/mask=00000001/0000e000
> Jan 1 23:30:53 begin kernel: [ 46.676140] nvme 0000:02:00.0: [ 0] RxErr
>
> Since it's corrected it's not actually an issue, but how worrying is it
> to see such errors on new hardware? Documentation/PCI/pcieaer-howto.rst
> is not commenting whether we are really supposed to see some of them. I
> see forums telling to use pci=noaer to stop the error logging, but is
> that really something to do?
Additional work has to happen to correct a receiver error, so it's
possible you're getting degraded performance. You may not notice worse
performance if these are infrequent enough, though.
Sometimes these types of errors occur from low power settings, so you
can try disabling the automatic management of these (assuming the
hardware supports it). To disable nvme specific power state transitions,
the kernel parameter is "nvme_core.default_ps_max_latency_us=0". PCI
also has automatic link power savings that you can disable with
parameter "pcie_aspm=off". It might be worth seeing if either of those
changes your observation.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-04 18:44 ` Keith Busch
@ 2021-01-04 20:12 ` Samuel Thibault
2021-01-04 21:36 ` Samuel Thibault
0 siblings, 1 reply; 8+ messages in thread
From: Samuel Thibault @ 2021-01-04 20:12 UTC (permalink / raw)
To: Keith Busch, Vidya Sagar; +Cc: linux-pci, linux-nvme
Hello,
Vidya Sagar wrote:
> Since this is a laptop, I'm suspecting that ASPM states might have
> been enabled which could be causing these errors.
Keith Busch, le lun. 04 janv. 2021 10:44:35 -0800, a ecrit:
> Sometimes these types of errors occur from low power settings, so you
> can try disabling the automatic management of these (assuming the
> hardware supports it). To disable nvme specific power state transitions,
> the kernel parameter is "nvme_core.default_ps_max_latency_us=0".
I have tried to add it, and this one line changed in lspci -vv:
02:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD (prog-if 02 [NVM Express])
[...]
Capabilities: [c0] Express (v2) Endpoint, MSI 00
[...]
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
turned to
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
that last value happens to be what I was seeing for that line with the
manufacturer-provided ubuntu linux kernel.
So far (30m uptime) no corrected error report, I'll watch in the coming
hours/days to see if that avoided the issue. I wasn't able to trigger
such corrected errors by loading the machine, so possibly that's indeed
the converse that I should have been trying: letting it go low power :)
> PCI also has automatic link power savings that you can disable with
> parameter "pcie_aspm=off".
I'll try that if I still see errors with the nvme_core parameter.
Samuel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-04 20:12 ` Samuel Thibault
@ 2021-01-04 21:36 ` Samuel Thibault
2021-01-04 22:33 ` Samuel Thibault
2021-01-06 20:28 ` Samuel Thibault
0 siblings, 2 replies; 8+ messages in thread
From: Samuel Thibault @ 2021-01-04 21:36 UTC (permalink / raw)
To: Keith Busch, Vidya Sagar, linux-nvme, linux-pci
Samuel Thibault, le lun. 04 janv. 2021 21:12:47 +0100, a ecrit:
> Vidya Sagar wrote:
> > Since this is a laptop, I'm suspecting that ASPM states might have
> > been enabled which could be causing these errors.
>
> Keith Busch, le lun. 04 janv. 2021 10:44:35 -0800, a ecrit:
> > Sometimes these types of errors occur from low power settings, so you
> > can try disabling the automatic management of these (assuming the
> > hardware supports it). To disable nvme specific power state transitions,
> > the kernel parameter is "nvme_core.default_ps_max_latency_us=0".
>
> I have tried to add it,
>
> I'll watch in the coming
> hours/days to see if that avoided the issue.
I did get one
Jan 4 22:34:53 begin kernel: [ 7165.207562] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Jan 4 22:34:53 begin kernel: [ 7165.213891] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 4 22:34:53 begin kernel: [ 7165.216949] nvme 0000:02:00.0: device [15b7:5006] error status/mask=00000001/0000e000
Jan 4 22:34:53 begin kernel: [ 7165.219995] nvme 0000:02:00.0: [ 0] RxErr
> > PCI also has automatic link power savings that you can disable with
> > parameter "pcie_aspm=off".
>
> I'll try that if I still see errors with the nvme_core parameter.
I'm on it.
Samuel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-04 21:36 ` Samuel Thibault
@ 2021-01-04 22:33 ` Samuel Thibault
2021-01-06 20:28 ` Samuel Thibault
1 sibling, 0 replies; 8+ messages in thread
From: Samuel Thibault @ 2021-01-04 22:33 UTC (permalink / raw)
To: Keith Busch, Vidya Sagar, linux-nvme, linux-pci
Samuel Thibault, le lun. 04 janv. 2021 22:36:48 +0100, a ecrit:
> Samuel Thibault, le lun. 04 janv. 2021 21:12:47 +0100, a ecrit:
> > Keith Busch, le lun. 04 janv. 2021 10:44:35 -0800, a ecrit:
> > > PCI also has automatic link power savings that you can disable with
> > > parameter "pcie_aspm=off".
> >
> > I'll try that if I still see errors with the nvme_core parameter.
>
> I'm on it.
(FTR It switched these lines)
02:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD (prog-if 02 [NVM Express])
[...]
Capabilities: [c0] Express (v2) Endpoint, MSI 00
[...]
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
-> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
-> DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
[...]
Capabilities: [100 v2] Advanced Error Reporting
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
-> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
[...]
Capabilities: [900 v1] L1 PM Substates
[...]
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=163840ns
-> T_CommonMode=0us LTR1.2_Threshold=81920ns
Samuel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-04 21:36 ` Samuel Thibault
2021-01-04 22:33 ` Samuel Thibault
@ 2021-01-06 20:28 ` Samuel Thibault
2021-01-06 21:48 ` Keith Busch
1 sibling, 1 reply; 8+ messages in thread
From: Samuel Thibault @ 2021-01-06 20:28 UTC (permalink / raw)
To: Keith Busch, Vidya Sagar, linux-nvme, linux-pci
Samuel Thibault, le lun. 04 janv. 2021 22:36:48 +0100, a ecrit:
> Samuel Thibault, le lun. 04 janv. 2021 21:12:47 +0100, a ecrit:
> > Vidya Sagar wrote:
> > > Since this is a laptop, I'm suspecting that ASPM states might have
> > > been enabled which could be causing these errors.
> >
> > Keith Busch, le lun. 04 janv. 2021 10:44:35 -0800, a ecrit:
> > > Sometimes these types of errors occur from low power settings, so you
> > > can try disabling the automatic management of these (assuming the
> > > hardware supports it). To disable nvme specific power state transitions,
> > > the kernel parameter is "nvme_core.default_ps_max_latency_us=0".
> >
> > I have tried to add it,
> >
> > I'll watch in the coming
> > hours/days to see if that avoided the issue.
>
> I did get one
>
> Jan 4 22:34:53 begin kernel: [ 7165.207562] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
> Jan 4 22:34:53 begin kernel: [ 7165.213891] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Jan 4 22:34:53 begin kernel: [ 7165.216949] nvme 0000:02:00.0: device [15b7:5006] error status/mask=00000001/0000e000
> Jan 4 22:34:53 begin kernel: [ 7165.219995] nvme 0000:02:00.0: [ 0] RxErr
>
> > > PCI also has automatic link power savings that you can disable with
> > > parameter "pcie_aspm=off".
> >
> > I'll try that if I still see errors with the nvme_core parameter.
>
> I'm on it.
I tried to make the machine only run apt-get update every 10m for 24h.
With pcie_aspm=off, I didn't get any corrected error
Without it I got 39 corrected errors
So that seems very relevant :)
Is there more I can provide to investigate if that can somehow be fixed
in the driver? I guess I can safely use the system with pcie_aspm=off?
(the energy saving seems neglectible)
Samuel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-06 20:28 ` Samuel Thibault
@ 2021-01-06 21:48 ` Keith Busch
2021-01-06 22:40 ` Samuel Thibault
0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2021-01-06 21:48 UTC (permalink / raw)
To: Samuel Thibault, Vidya Sagar, linux-nvme, linux-pci
On Wed, Jan 06, 2021 at 09:28:23PM +0100, Samuel Thibault wrote:
> Samuel Thibault, le lun. 04 janv. 2021 22:36:48 +0100, a ecrit:
> > Samuel Thibault, le lun. 04 janv. 2021 21:12:47 +0100, a ecrit:
> > > Vidya Sagar wrote:
> > > > Since this is a laptop, I'm suspecting that ASPM states might have
> > > > been enabled which could be causing these errors.
> > >
> > > Keith Busch, le lun. 04 janv. 2021 10:44:35 -0800, a ecrit:
> > > > Sometimes these types of errors occur from low power settings, so you
> > > > can try disabling the automatic management of these (assuming the
> > > > hardware supports it). To disable nvme specific power state transitions,
> > > > the kernel parameter is "nvme_core.default_ps_max_latency_us=0".
> > >
> > > I have tried to add it,
> > >
> > > I'll watch in the coming
> > > hours/days to see if that avoided the issue.
> >
> > I did get one
> >
> > Jan 4 22:34:53 begin kernel: [ 7165.207562] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
> > Jan 4 22:34:53 begin kernel: [ 7165.213891] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> > Jan 4 22:34:53 begin kernel: [ 7165.216949] nvme 0000:02:00.0: device [15b7:5006] error status/mask=00000001/0000e000
> > Jan 4 22:34:53 begin kernel: [ 7165.219995] nvme 0000:02:00.0: [ 0] RxErr
> >
> > > > PCI also has automatic link power savings that you can disable with
> > > > parameter "pcie_aspm=off".
> > >
> > > I'll try that if I still see errors with the nvme_core parameter.
> >
> > I'm on it.
>
> I tried to make the machine only run apt-get update every 10m for 24h.
>
> With pcie_aspm=off, I didn't get any corrected error
> Without it I got 39 corrected errors
>
> So that seems very relevant :)
>
> Is there more I can provide to investigate if that can somehow be fixed
> in the driver? I guess I can safely use the system with pcie_aspm=off?
> (the energy saving seems neglectible)
I don't think there's more to do from the kernel or driver beyond
disabling usage of the problematic feature. I think a proper fix would
have to come from the hardware vendor.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Are AER corrected errors worrying?
2021-01-06 21:48 ` Keith Busch
@ 2021-01-06 22:40 ` Samuel Thibault
0 siblings, 0 replies; 8+ messages in thread
From: Samuel Thibault @ 2021-01-06 22:40 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-pci, linux-nvme, Vidya Sagar
Keith Busch, le mer. 06 janv. 2021 13:48:08 -0800, a ecrit:
> On Wed, Jan 06, 2021 at 09:28:23PM +0100, Samuel Thibault wrote:
> > Is there more I can provide to investigate if that can somehow be fixed
> > in the driver? I guess I can safely use the system with pcie_aspm=off?
> > (the energy saving seems neglectible)
>
> I don't think there's more to do from the kernel or driver beyond
> disabling usage of the problematic feature. I think a proper fix would
> have to come from the hardware vendor.
Ok, thanks!
Samuel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-01-06 22:40 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-01 22:40 Are AER corrected errors worrying? Samuel Thibault
2021-01-04 18:44 ` Keith Busch
2021-01-04 20:12 ` Samuel Thibault
2021-01-04 21:36 ` Samuel Thibault
2021-01-04 22:33 ` Samuel Thibault
2021-01-06 20:28 ` Samuel Thibault
2021-01-06 21:48 ` Keith Busch
2021-01-06 22:40 ` Samuel Thibault
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox