Re: 6.1.0-17: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* Re: 6.1.0-17: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
       [not found] <CAO9zADza=73GsuzAcuyH-YfhS34qjkDtuJjGBReVGpfE6KN_ow@mail.gmail.com>
@ 2024-10-21 19:14 ` Bjorn Helgaas
  2024-10-21 19:32   ` Justin Piszcz
  0 siblings, 1 reply; 2+ messages in thread
From: Bjorn Helgaas @ 2024-10-21 19:14 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: LKML, linux-nvme

On Fri, Jan 05, 2024 at 09:49:58AM -0500, Justin Piszcz wrote:
> Hello,
> 
> Distribution: Debian Stable x86-64
> Kernel: 6.1.0-17
> 
> Reporting this as requested from the kernel message, I have now
> appended the recommended kernel boot parameters
> nvme_core.default_ps_max_latency_us=0 pcie_aspm=off and will see if
> this recurs.

Hi Justin, did anything ever come of this report?  Is it reproducible?
Did it seem to be related to suspend/resume?

> Jan  5 06:18:52 atom kernel: [295306.524933] pcieport 0000:00:06.0:
> AER: Corrected error received: 0000:00:06.0
> Jan  5 06:18:52 atom kernel: [295306.524979] pcieport 0000:00:06.0:
> PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Jan  5 06:18:52 atom kernel: [295306.525004] pcieport 0000:00:06.0:
> device [8086:a74d] error status/mask=00000001/00002000
> Jan  5 06:18:52 atom kernel: [295306.525027] pcieport 0000:00:06.0:
> [ 0] RxErr                  (First)
> Jan  5 06:19:22 atom kernel: [295336.554420] nvme nvme0: controller is
> down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> Jan  5 06:19:22 atom kernel: [295336.554469] nvme nvme0: Does your
> device have a faulty power saving mode enabled?
> Jan  5 06:19:22 atom kernel: [295336.554489] nvme nvme0: Try
> "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> Jan  5 06:19:22 atom kernel: [295336.614521] nvme 0000:03:00.0: Unable
> to change power state from D3cold to D0, device inaccessible
> Jan  5 06:19:22 atom kernel: [295336.614898] nvme nvme0: Removing
> after probe failure status: -19
> Jan  5 06:19:22 atom kernel: [295336.630497] nvme0n1: detected
> capacity change from 7814037168 to 0
> Jan  5 06:19:22 atom kernel: [295336.630502] BTRFS error (device
> nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 0, flush 0, corrupt 0,
> gen 0
> Jan  5 06:19:22 atom kernel: [295336.630513] BTRFS error (device
> nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 0, flush 0, corrupt 0,
> gen 0
> Jan  5 06:19:22 atom kernel: [295336.630542] BTRFS error (device
> nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 3, rd 0, flush 0, corrupt 0,
> gen 0
> 
> Regards,
> Justin


^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: 6.1.0-17: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
  2024-10-21 19:14 ` 6.1.0-17: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff Bjorn Helgaas
@ 2024-10-21 19:32   ` Justin Piszcz
  0 siblings, 0 replies; 2+ messages in thread
From: Justin Piszcz @ 2024-10-21 19:32 UTC (permalink / raw)
  To: 'Bjorn Helgaas'; +Cc: 'LKML', linux-nvme



-----Original Message-----
From: Bjorn Helgaas <helgaas@kernel.org> 
Sent: Monday, October 21, 2024 3:14 PM
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: LKML <linux-kernel@vger.kernel.org>; linux-nvme@lists.infradead.org
Subject: Re: 6.1.0-17: nvme nvme0: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0xffff

On Fri, Jan 05, 2024 at 09:49:58AM -0500, Justin Piszcz wrote:
> Hello,
> 
> Distribution: Debian Stable x86-64
> Kernel: 6.1.0-17
> 
> Reporting this as requested from the kernel message, I have now
> appended the recommended kernel boot parameters
> nvme_core.default_ps_max_latency_us=0 pcie_aspm=off and will see if
> this recurs.

Hi Justin, did anything ever come of this report?  Is it reproducible?
Did it seem to be related to suspend/resume?

[ ..] 

> 
> Regards,
> Justin

Yes-- this turned out to be the result of the Intel i9 13900k/14900k known
CPU failure/bug issue.  After working with Intel to replace the failing
i9-14900k with a replacement unit, everything has been rock solid without
any issues yet, running the new 0x12b microcode as of 10/21/2024.  This is
the first time I have seen stack smashing/NVME errors due to a CPU failing
but that is what it was, I have not seen a single instance of this error
after replacing the CPU.

Three is a more detailed account of the issue that I posted here:
https://forum.level1techs.com/t/debian-linux-stable-on-pro-ws-w680-ace-ipmi-
application-segfaults-kernel-panic/212854/8

Regards,
Justin






^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-10-21 19:33 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAO9zADza=73GsuzAcuyH-YfhS34qjkDtuJjGBReVGpfE6KN_ow@mail.gmail.com>
2024-10-21 19:14 ` 6.1.0-17: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff Bjorn Helgaas
2024-10-21 19:32   ` Justin Piszcz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox