public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
@ 2025-11-17 13:39 Thomas ten Cate
  2025-11-18 20:49 ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas ten Cate @ 2025-11-17 13:39 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, linux-kernel

Dear kernel heroes,

I'm encountering errors with the NVMe drive in my laptop, which appear
to be related to power saving modes (search keywords: APST, ASPM). It
got more serious in some recent kernel version, but seems to have been
present before.

After just booting, starting `dmesg -w` and waiting a bit, the log says:

[   43.710561] could not locate request for tag 0x0
[   43.710585] nvme nvme0: invalid id 0 completed on queue 1
[   43.710593] could not locate request for tag 0x0
[   43.710598] nvme nvme0: invalid id 0 completed on queue 1
[   43.710603] could not locate request for tag 0x0
[   43.710607] nvme nvme0: invalid id 0 completed on queue 1
[   43.710611] could not locate request for tag 0x0
[   43.710615] nvme nvme0: invalid id 0 completed on queue 1
[   73.744791] nvme nvme0: I/O tag 129 (4081) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:32768
[   73.744862] nvme nvme0: I/O tag 130 (a082) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:36864
[   73.744875] nvme nvme0: I/O tag 131 (8083) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:4096
[   73.744886] nvme nvme0: I/O tag 133 (5085) opcode 0x1 (Write) QID 1
timeout, aborting req_op:WRITE(1) size:12288
[   73.756694] nvme nvme0: Abort status: 0x0
[   73.757641] nvme nvme0: Abort status: 0x0
[   73.758533] nvme nvme0: Abort status: 0x0
[   73.759422] nvme nvme0: Abort status: 0x0
[  103.824976] nvme nvme0: I/O tag 129 (4081) opcode 0x1 (Write) QID 1
timeout, reset controller
[  103.966268] nvme nvme0: 16/0/0 default/read/poll queues

Notice the 30 second delays. This problem has been present at since
6.12.40 stable or maybe earlier, but has gone unnoticed until now
because things apparently recovered. Full log of a similar occasion:
https://gist.github.com/ttencate/9f2c4739d9e8a4c0142fd8246b56a7d6

More recently, since 6.12.56 or maybe earlier, I'm also sometimes getting these:

[  336.613637] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 0x9)
[  336.613659] nvme nvme0: invalid id 0 completed on queue 8
[  366.657750] nvme nvme0: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0x10
[  366.657768] nvme nvme0: Does your device have a faulty power saving
mode enabled?
[  366.657773] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0
pcie_aspm=off pcie_port_pm=off" and report a bug
[  366.761391] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[  366.761842] nvme nvme0: Disabling device after reset failure: -19

In this case, the messages are followed by a slew of btrfs errors,
btrfs switches to read-only mode, and the drive becomes entirely
inaccessible until a reboot.

The log suggests to add the kernel arguments
"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
pcie_port_pm=off", which indeed makes all issues go away.

I haven't found a reliable way to trigger the latter error
specifically, though doing something I/O heavy like compiling a kernel
seems to make it more likely. This makes bisect difficult to do, but
it's clear that something was going on in previous versions as well,
so I wouldn't necessarily call this a regression. Either way, the
issue is still present in mainline 6.17.8.

Since it happens only after some idle time, and disabling PM fixes it,
this seems related to power states. But of course, I cannot completely
rule out faulty hardware either.

Machine: Lenovo IdeaPad Pro 5 16APH8
Architecture: x86_64
NVMe drive: SK Hynix HFS001TEJ4X112N
Full lshw output:
https://gist.github.com/ttencate/5540c81454bbe1fa679955effba65eba

Distribution: Arch Linux
Kernel version: 6.17.8 (vanilla from commit 8ac42a6)
Kernel configuration:
https://gitlab.archlinux.org/archlinux/packaging/packages/linux-lts/-/blob/b0cac6a69041703bbe1aba4a2a269585d77b108b/config
(plus `make olddefconfig`)
GCC version: 15.2.1

This is my first kernel bug report, so I hope I didn't miss anything;
if I did, please let me know. I'd be happy to experiment or try out
patches.

Kind regards,

Thomas ten Cate

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
  2025-11-17 13:39 "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5 Thomas ten Cate
@ 2025-11-18 20:49 ` Keith Busch
  2025-11-20 13:52   ` Thomas ten Cate
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2025-11-18 20:49 UTC (permalink / raw)
  To: Thomas ten Cate
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-kernel

On Mon, Nov 17, 2025 at 02:39:17PM +0100, Thomas ten Cate wrote:
> The log suggests to add the kernel arguments
> "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
> pcie_port_pm=off", which indeed makes all issues go away.
> 
> I haven't found a reliable way to trigger the latter error
> specifically, though doing something I/O heavy like compiling a kernel
> seems to make it more likely. This makes bisect difficult to do, but
> it's clear that something was going on in previous versions as well,
> so I wouldn't necessarily call this a regression. Either way, the
> issue is still present in mainline 6.17.8.
> 
> Since it happens only after some idle time, and disabling PM fixes it,
> this seems related to power states. But of course, I cannot completely
> rule out faulty hardware either.
> 
> Machine: Lenovo IdeaPad Pro 5 16APH8
> Architecture: x86_64
> NVMe drive: SK Hynix HFS001TEJ4X112N
> Full lshw output:
> https://gist.github.com/ttencate/5540c81454bbe1fa679955effba65eba
> 
> Distribution: Arch Linux
> Kernel version: 6.17.8 (vanilla from commit 8ac42a6)
> Kernel configuration:
> https://gitlab.archlinux.org/archlinux/packaging/packages/linux-lts/-/blob/b0cac6a69041703bbe1aba4a2a269585d77b108b/config
> (plus `make olddefconfig`)
> GCC version: 15.2.1
> 
> This is my first kernel bug report, so I hope I didn't miss anything;
> if I did, please let me know. I'd be happy to experiment or try out
> patches.

The "report a bug" message was originally pointed at hardware vendors
rather than kernel. Something is wrong with the SSD, the PCIe slot, or
both if the power features cause the endpoint to drop off the bus. The
only recourse we have in the nvme driver is a quirk to disable APST for
the device. The driver doesn't control the PCIe ASPM settings though, so
that would have to be a different quirk if it's really necessary. Do you
need all three of those parameters, or is disabling the nvme driver's
apst sufficient on its own? These parameters do have a negative impact
on your machine's power consumption, so you'd usually want to hone in if
it's just the deepest power state or if every power saving feature
really needs to be disabled.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
  2025-11-18 20:49 ` Keith Busch
@ 2025-11-20 13:52   ` Thomas ten Cate
  2025-11-20 21:04     ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas ten Cate @ 2025-11-20 13:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-kernel

On Tue, Nov 18, 2025 at 9:49 PM Keith Busch <kbusch@kernel.org> wrote:
>
> The "report a bug" message was originally pointed at hardware vendors
> rather than kernel. Something is wrong with the SSD, the PCIe slot, or
> both if the power features cause the endpoint to drop off the bus. The
> only recourse we have in the nvme driver is a quirk to disable APST for
> the device. The driver doesn't control the PCIe ASPM settings though, so
> that would have to be a different quirk if it's really necessary. Do you
> need all three of those parameters, or is disabling the nvme driver's
> apst sufficient on its own? These parameters do have a negative impact
> on your machine's power consumption, so you'd usually want to hone in if
> it's just the deepest power state or if every power saving feature
> really needs to be disabled.

Thanks for your reply!

Just `nvme_core.default_ps_max_latency_us=0` appears to be sufficient,
so presumably it's the drive, not the bus. I wouldn't even know where
to begin reporting a bug to the manufacturer.

Still, it's odd that the problem manifested differently, and much less
severely, on an older kernel. ¯\_(ツ)_/¯

I've linked to this thread on the Arch wiki [1] to let others chime
in, in case a single report is not enough evidence to add a quirk to
the driver.

[1] https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Controller_failure_due_to_broken_APST_support

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
  2025-11-20 13:52   ` Thomas ten Cate
@ 2025-11-20 21:04     ` Keith Busch
  2025-11-21 13:06       ` Thomas ten Cate
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2025-11-20 21:04 UTC (permalink / raw)
  To: Thomas ten Cate
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-kernel

On Thu, Nov 20, 2025 at 02:52:22PM +0100, Thomas ten Cate wrote:
> On Tue, Nov 18, 2025 at 9:49 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > The "report a bug" message was originally pointed at hardware vendors
> > rather than kernel. Something is wrong with the SSD, the PCIe slot, or
> > both if the power features cause the endpoint to drop off the bus. The
> > only recourse we have in the nvme driver is a quirk to disable APST for
> > the device. The driver doesn't control the PCIe ASPM settings though, so
> > that would have to be a different quirk if it's really necessary. Do you
> > need all three of those parameters, or is disabling the nvme driver's
> > apst sufficient on its own? These parameters do have a negative impact
> > on your machine's power consumption, so you'd usually want to hone in if
> > it's just the deepest power state or if every power saving feature
> > really needs to be disabled.
> 
> Thanks for your reply!
> 
> Just `nvme_core.default_ps_max_latency_us=0` appears to be sufficient,
> so presumably it's the drive, not the bus. I wouldn't even know where
> to begin reporting a bug to the manufacturer.

I assume vendor 0x1c5c for SK Hynix, but we also need the device id to
make a quirk. You can get that info from sysfs, for example if your
device is enumerated as "nvme0":

  # cat /sys/class/nvme/nvme0/device/device

There's two quirks we can try: NVME_QUIRK_NO_DEEPEST_PS first, then
NVME_QUIRK_NO_APST if the first one wasn't successful.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
  2025-11-20 21:04     ` Keith Busch
@ 2025-11-21 13:06       ` Thomas ten Cate
  2025-11-24 19:52         ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas ten Cate @ 2025-11-21 13:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-kernel

On Thu, Nov 20, 2025 at 10:04 PM Keith Busch <kbusch@kernel.org> wrote:
>
> I assume vendor 0x1c5c for SK Hynix,

Correct.

> but we also need the device id to
> make a quirk. You can get that info from sysfs, for example if your
> device is enumerated as "nvme0":
>
>   # cat /sys/class/nvme/nvme0/device/device

0x1d59

> There's two quirks we can try: NVME_QUIRK_NO_DEEPEST_PS first, then
> NVME_QUIRK_NO_APST if the first one wasn't successful.

I tried adding the following to linux/drivers/nvme/host/core.c, in the
core_quirks[] array. It takes a model name, not a device id, which I
took from /sys/class/nvme/nvme0/model.

{
  .vid = 0x1c5c,
  .mn = "SKHynix_HFS001TEJ4X112N",
  .quirks = NVME_QUIRK_NO_APST,
}

With NVME_QUIRK_NO_DEEPEST_PS, it still hung almost immediately. With
NVME_QUIRK_NO_APST, it appears to be stable.

Would you like me to send a patch?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
  2025-11-21 13:06       ` Thomas ten Cate
@ 2025-11-24 19:52         ` Keith Busch
  2025-12-15 12:15           ` Thomas ten Cate
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2025-11-24 19:52 UTC (permalink / raw)
  To: Thomas ten Cate
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-kernel

On Fri, Nov 21, 2025 at 02:06:23PM +0100, Thomas ten Cate wrote:
> On Thu, Nov 20, 2025 at 10:04 PM Keith Busch <kbusch@kernel.org> wrote:
> 
> I tried adding the following to linux/drivers/nvme/host/core.c, in the
> core_quirks[] array. It takes a model name, not a device id, which I
> took from /sys/class/nvme/nvme0/model.
> 
> {
>   .vid = 0x1c5c,
>   .mn = "SKHynix_HFS001TEJ4X112N",
>   .quirks = NVME_QUIRK_NO_APST,
> }
> 
> With NVME_QUIRK_NO_DEEPEST_PS, it still hung almost immediately. With
> NVME_QUIRK_NO_APST, it appears to be stable.
> 
> Would you like me to send a patch?

Sure, but I think it'd be more universal to apply the quirk to the
nvme-pci driver's pci device table than to make it based on the model.
It's common for a model name to be different for various capacities, but
I suspect the power behavior is more generic than that, which should be
common with the device ID.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5
  2025-11-24 19:52         ` Keith Busch
@ 2025-12-15 12:15           ` Thomas ten Cate
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas ten Cate @ 2025-12-15 12:15 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1804 bytes --]

Sorry it took a while to get back to this.

Interestingly, the device ID is already in the table under the name
"BC901", with NVME_QUIRK_DISABLE_WRITE_ZEROES. From
https://product.skhynix.com/products/ssd/cssd/pc801_bc901.go I gather
that HFS001TEJ4X112N is *probably* an incarnation of the BC901 series,
although that exact model name is not listed there. The BC901 series
is supposed to be extra low-power... not waking up from sleep is
certainly a great way to accomplish that ;-)

I've attached a patch that works for me. But, like you said, this does
affect power usage, so I'm not sure it should be applied at all... at
least not until it's confirmed that I'm not the only one with this
problem. I'll leave that decision up to you.




On Mon, Nov 24, 2025 at 8:52 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Nov 21, 2025 at 02:06:23PM +0100, Thomas ten Cate wrote:
> > On Thu, Nov 20, 2025 at 10:04 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > I tried adding the following to linux/drivers/nvme/host/core.c, in the
> > core_quirks[] array. It takes a model name, not a device id, which I
> > took from /sys/class/nvme/nvme0/model.
> >
> > {
> >   .vid = 0x1c5c,
> >   .mn = "SKHynix_HFS001TEJ4X112N",
> >   .quirks = NVME_QUIRK_NO_APST,
> > }
> >
> > With NVME_QUIRK_NO_DEEPEST_PS, it still hung almost immediately. With
> > NVME_QUIRK_NO_APST, it appears to be stable.
> >
> > Would you like me to send a patch?
>
> Sure, but I think it'd be more universal to apply the quirk to the
> nvme-pci driver's pci device table than to make it based on the model.
> It's common for a model name to be different for various capacities, but
> I suspect the power behavior is more generic than that, which should be
> common with the device ID.

[-- Attachment #2: skhynix_hfs001tej4x112n.patch --]
[-- Type: application/x-patch, Size: 679 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-15 12:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17 13:39 "controller is down; will reset" on SK Hynix NVMe drive in Lenovo IdeaPad Pro 5 Thomas ten Cate
2025-11-18 20:49 ` Keith Busch
2025-11-20 13:52   ` Thomas ten Cate
2025-11-20 21:04     ` Keith Busch
2025-11-21 13:06       ` Thomas ten Cate
2025-11-24 19:52         ` Keith Busch
2025-12-15 12:15           ` Thomas ten Cate

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox