public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] Lenovo 83JL / WD SN7100S: intermittent loss of secondary NVMe after s2idle resume, root port 00:02.1 retraining fails
@ 2026-04-09 14:52 Jacopo Labardi
  2026-04-09 19:05 ` Bjorn Helgaas
  0 siblings, 1 reply; 3+ messages in thread
From: Jacopo Labardi @ 2026-04-09 14:52 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-pci, linux-pm

Hello,

I am reporting an intermittent suspend/resume failure on a Lenovo IdeaPad Pro 5
14AKP10 (machine type 83JL) where the secondary NVMe drive can become unusable
after s2idle resume on Linux.

I am CCing linux-nvme, linux-pci and linux-pm because the visible failure path
starts with PCIe root-port link retraining on 0000:00:02.1 during resume and
ends with nvme reset failure on the downstream device at 0000:bf:00.0.

Summary
- The problem is intermittent. It does not always happen on a fixed cycle.
- On this machine it may happen on the first resume or only after several
  suspend/resume cycles.
- The clean reproducer below failed on the third suspend/resume cycle, but that
  cycle count is not stable and should not be interpreted as deterministic.
- When it fails, the secondary/data NVMe is effectively lost until a full
  reboot.
- The system NVMe on 0000:c2:00.0 survives.

Hardware
- Laptop: Lenovo IdeaPad Pro 5 14AKP10 (83JL)
- BIOS: LENOVO QKCN29WW, release date 2025-12-23
- Platform: AMD Krackan / Ryzen AI 350
- Root port for failing device: 0000:00:02.1, AMD [1022:1126]
- System NVMe: Lexar NM790 2TB at 0000:c2:00.0
- Secondary/data NVMe: Sandisk/WD PC SN7100S M.2 2242 NVMe SSD (DRAM-less)
  [15b7:5044] at 0000:bf:00.0
- PCIe topology:
  0000:00:02.1 -> 0000:bf:00.0

Software
- Kernel for the clean reproducer: 6.19.11-arch1-1
- Kernel taint during reproducer: 0
- Sleep mode exposed by the platform on Linux: [s2idle]
- acpi_call-dkms is installed on disk, but the acpi_call module was not loaded
  during the reproducer and the kernel was not tainted
- Boot command line used for the clean reproducer:
  quiet nowatchdog rw rootflags=subvol=/@ rootfstype=btrfs
  root=UUID=<redacted> amd_pstate=active iommu=pt i8042.nopnp loglevel=3
  8250.nr_uarts=0 tpm_tis.interrupts=0 random.trust_cpu=on
  snd_hda_intel.power_save=10 snd_hda_intel.power_save_controller=Y

At boot on this kernel, both NVMe controllers log:
- nvme 0000:c2:00.0: platform quirk: setting simple suspend
- nvme 0000:bf:00.0: platform quirk: setting simple suspend

Clean reproduction used for this report
- No NVMe-specific udev overrides were active
- No custom NVMe-related systemd sleep hooks or suspend services were active
- No NVMe-specific kernel parameters were active

Reproducer
1. Boot the machine into 6.19.11-arch1-1.
2. Confirm tainted=0 and that both NVMe devices are present.
3. Suspend to s2idle and resume.
4. Repeat suspend/resume until the failure occurs.

Observed behavior
- In one clean run used for this report:
  - first cycle resumed successfully
  - second cycle resumed successfully
  - third cycle resumed with the WD drive lost
- In prior testing on the same machine, the failure sometimes happened on the
  first cycle and sometimes only after several cycles.

Relevant kernel log excerpt from the failing run
  Apr 09 16:24:34 kernel: PM: suspend entry (s2idle)
  Apr 09 16:24:54 kernel: pcieport 0000:00:02.1: broken device,
retraining non-functional downstream link at 2.5GT/s
  Apr 09 16:24:54 kernel: pcieport 0000:00:02.1: retraining failed
  Apr 09 16:24:54 kernel: pcieport 0000:00:02.1: Data Link Layer Link
Active not set in 100 msec
  Apr 09 16:24:54 kernel: nvme nvme1: Disabling device after reset failure: -19
  Apr 09 16:24:54 kernel: PM: suspend exit
  Apr 09 16:25:04 kernel: ntfs3(nvme1n1p5): failed to read volume at
offset 0x10c000
  Apr 09 16:25:26 kernel: ntfs3: 107 callbacks suppressed
  Apr 09 16:25:40 kernel: nvme nvme1: Identify namespace failed (-5)

State after the failure
- nvme list shows only the system Lexar drive; the WD is no longer listed
- lspci -nnvv -s bf:00.0 still shows the device, but with:
  - !!! Unknown header type 7f
  - Kernel driver in use: nvme
- sysfs state at collection time:
  - /sys/bus/pci/devices/0000:bf:00.0/power_state = D3cold
  - /sys/bus/pci/devices/0000:bf:00.0/power/runtime_status = active
  - /sys/bus/pci/devices/0000:bf:00.0/d3cold_allowed = 1
  - /sys/bus/pci/devices/0000:bf:00.0/power/control = on
  - /sys/bus/pci/devices/0000:00:02.1/power_state = D0
  - /sys/bus/pci/devices/0000:00:02.1/power/runtime_status = active
  - /sys/bus/pci/devices/0000:00:02.1/d3cold_allowed = 1
  - /sys/bus/pci/devices/0000:00:02.1/power/control = auto
- smartctl -x /dev/nvme1 failed with "Resource temporarily unavailable"

Expected behavior
- The secondary WD NVMe should resume normally and remain usable after s2idle.

Previous mitigation/debug attempts on the same machine
These are not part of the clean reproducer above; they are prior experiments
done to narrow the failure mode.

- The issue also reproduced while the machine was configured with
  pcie_aspm.policy=performance; removing that option did not eliminate it
- nvme_core.default_ps_max_latency_us=2000
  - no fix; failure still reproduced
- Force power/control=on and d3cold_allowed=0 on both 0000:00:02.1 and
  0000:bf:00.0
  - no fix; in failing runs the WD could still end up inaccessible / in D3cold
- pm_async=off
  - same failure mode
- SuspendState=freeze via systemd sleep configuration
  - ineffective on this platform; kernel still reported s2idle behavior and the
    issue remained
- nvme.noacpi=1
  - removed the "platform quirk: setting simple suspend" message, but resume
    often degraded into a black screen / forced reboot instead of fixing the WD
- pcie_port_pm=off
  - no usable fix; often resulted in black-screen resume / forced reboot
- pcie_ports=compat
  - no usable fix; often resulted in black-screen resume / forced reboot
- User-space suspend/resume hooks that tried unbind/bind/remove/rescan around
  the WD path
  - no reliable recovery; they only lengthened resume and still ended in reset
    failure / missing device

Additional context
- Linux on this machine exposes only s2idle.
- The issue was reproduced on multiple Linux distributions, not only one
  userspace/kernel packaging combination.
- Windows on the same hardware resumes correctly.
- I do not currently have a known-good Linux kernel version on this machine, so
  I am not claiming this is a regression in a specific upstream release.

I am not sure whether the underlying bug belongs primarily in nvme/pci, PCIe
power-management, or a platform/firmware interaction. The first failing
messages in the clean repro are from the root-port retraining path, which is
why I am CCing PCI/PM in addition to NVMe.

I collected a local bundle containing:
- full kernel log from the reproducer boot
- systemd suspend log
- lspci -nnvv for 0000:00:02.1 and 0000:bf:00.0
- sysfs power-state snapshots
- dmidecode output

The raw bundle contains machine serial/UUID, so I am not attaching it publicly
as-is. I can provide redacted logs or specific files immediately if requested.

If useful, I can also test additional debug options or a current mainline/-rc
kernel.

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-09 19:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 14:52 [BUG] Lenovo 83JL / WD SN7100S: intermittent loss of secondary NVMe after s2idle resume, root port 00:02.1 retraining fails Jacopo Labardi
2026-04-09 19:05 ` Bjorn Helgaas
2026-04-09 19:18   ` Jacopo Labardi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox