public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
* [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing"
@ 2025-09-19  3:52 Matthew Ruffell
  2025-09-19  5:02 ` Mario Limonciello
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Ruffell @ 2025-09-19  3:52 UTC (permalink / raw)
  To: mario.limonciello, bhelgaas@google.com; +Cc: linux-pci, lkml, Jay Vosburgh

Hi Mario, Bjorn,

I am debugging a kexec regression, and I could use some help please.

The AWS "c5.metal" instance type fails to kexec into another kernel, and gets
stuck during boot trying to mount the rootfs from the NVME drive, and then moves
at a glacier pace and never actually boots:

[   79.172085] EXT4-fs (nvme0n1p1): orphan cleanup on readonly fs
[   79.193407] EXT4-fs (nvme0n1p1): mounted filesystem
a4f7c460-5723-4ed1-9e86-04496bd66119 ro with ordered data mode. Quota
mode: none.
[  109.606598] systemd[1]: Inserted module 'autofs4'
[  139.786021] systemd[1]: systemd 257.9-0ubuntu1 running in system
mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +IPE +SMACK +SECCOMP +GCRYPT
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC
+KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIBFDISK +PCRE2
+PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD
+BPF_FRAMEWORK +BTF -XKBCOMMON -UTMP +SYSVINIT +LIBARCHIVE)
[  139.943485] systemd[1]: Detected architecture x86-64.
[  169.994695] systemd[1]: Hostname set to <ip-172-31-48-167>.
[  170.102479] systemd[1]: bpf-restrict-fs: BPF LSM hook not enabled
in the kernel, BPF LSM not supported.
[  200.503000] systemd[1]: Queued start job for default target graphical.target.
[  200.550056] systemd[1]: Created slice system-modprobe.slice - Slice
/system/modprobe.
[  230.922947] systemd[1]: Created slice system-serial\x2dgetty.slice
- Slice /system/serial-getty.
[  261.131318] systemd[1]: Created slice system-systemd\x2dfsck.slice
- Slice /system/systemd-fsck.
[  291.338906] systemd[1]: Created slice user.slice - User and Session Slice.
[  321.546200] systemd[1]: Started systemd-ask-password-wall.path -
Forward Password Requests to Wall Directory Watch.

I bisected the issue, and the behaviour starts with:

commit 4d4c10f763d7808fbade28d83d237411603bca05
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:  Wed Apr 23 23:31:32 2025 -0500
Subject: PCI: Explicitly put devices into D0 when initializing
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d4c10f763d7808fbade28d83d237411603bca05

I also tried the follow up commit:

commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:  Wed Jun 11 18:31:16 2025 -0500
Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0

and the behaviour still exists.

If I revert both from 6.17-rc3, as well as the downstream Ubuntu stable kernels,
the system kexec's successfully as normal.

lspci -vvv as root (nvme device)
https://paste.ubuntu.com/p/x7Zyjp8Brr/

lscpi -vvv as root (full output)
https://paste.ubuntu.com/p/NTdbByTqjR/

Strangely, the behaviour works like this:

Kernel without 4d4c10f76 -> kernel without 4d4c10f76 = success
Kernel without 4d4c10f76 -> kernel with 4d4c10f76 = success
Kernel with 4d4c10f76 -> kernel without 4d4c10f76 = failure
Kernel with 4d4c10f76 -> kernel with 4d4c10f76 = failure

Steps to reproduce:
1) On AWS, Launch a c5.metal instance type
2) Install a kernel with 4d4c10f76, note it might need AWS specific patches,
perhaps try a recent downstream distro kernel such as 6.17.0-1001-aws in Ubuntu
Questing with AMI ami-069b93def587ece0f
(ubuntu/images-testing/hvm-ssd-gp3/ubuntu-questing-daily-amd64-server-20250822)
with a full apt update && apt upgrade
3) sudo reboot, to get a fresh full boot. Note, this takes approx 17 minutes.
4) sudo apt install kexec-tools
5) kernel=6.17.0-1001-aws
kexec -l -t bzImage /boot/vmlinuz-$kernel
--initrd=/boot/initrd.img-$kernel --reuse-cmdline
kexec -e
6) On EC2 console, Actions > Monitor and troubleshoot > EC2 serial console,
and watch progress.

I am more than happy to try any patches / debug printk's etc.

Thanks,
Matthew

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-02-25  5:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-19  3:52 [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing" Matthew Ruffell
2025-09-19  5:02 ` Mario Limonciello
2025-12-04  5:04   ` Matthew Ruffell
2025-12-04  5:29     ` Mario Limonciello
2025-12-05  3:06       ` Matthew Ruffell
2025-12-05  3:10         ` Matthew Ruffell
2025-12-05  5:31           ` Mario Limonciello
2026-01-06  6:06             ` Mario Limonciello
2026-02-13  5:54               ` Matthew Ruffell
2026-02-13 19:26                 ` Bjorn Helgaas
2026-02-17 14:36                   ` Mario Limonciello
2026-02-23  6:04                     ` Mario Limonciello
2026-02-25  5:21                       ` Matthew Ruffell
2026-02-25  5:42                         ` Mario Limonciello
2025-12-05  5:28         ` Mario Limonciello

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox