public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
From: Mario Limonciello <mario.limonciello@amd.com>
To: Bjorn Helgaas <helgaas@kernel.org>,
	Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: "bhelgaas@google.com" <bhelgaas@google.com>,
	linux-pci@vger.kernel.org, lkml <linux-kernel@vger.kernel.org>,
	Jay Vosburgh <jay.vosburgh@canonical.com>,
	kengyu@lexical.tw, Brian Norris <briannorris@chromium.org>
Subject: Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing"
Date: Tue, 17 Feb 2026 08:36:38 -0600	[thread overview]
Message-ID: <a7d03703-04fa-4c26-987a-a34d2e96a4a7@amd.com> (raw)
In-Reply-To: <20260213192607.GA3167656@bhelgaas>

On 2/13/26 1:26 PM, Bjorn Helgaas wrote:
> [+cc Brian, author of 51c0996dadae ("PCI/PM: Prevent runtime suspend
> until devices are fully initialized") (though it's not clear to me
> whether this makes any difference)]
> 
> On Fri, Feb 13, 2026 at 06:54:13PM +1300, Matthew Ruffell wrote:
>> Hi Mario,
>>
>> I have been doing some experiments with c5.metal on AWS with an
>> Ubuntu resolute userspace and a 7.0 merge window kernel.
>>
>> Amazon AMI:
>> ami-04080dce0ccaa893d
>>
>> Linux HEAD 582a1ef360a05bff4350bbf6e383f61d26b804f0 (Linus's tree).
>>
>> $ sudo lspci
>> https://paste.ubuntu.com/p/mtgbn2HJtW/
>>
>> $ sudo lspci -vvv -t
>> https://paste.ubuntu.com/p/s6NhBz5FZy/
>>
>> I noticed that nvme_pci_enable() calls pci_set_master() with the pci
>> device, and I wondered, is this called with the NVMe device, so I
>> went and had a look.
>>
>> I added your patch:
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 595ec12d85df..bd116dccd897 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1300,6 +1300,7 @@ int pci_power_up(struct pci_dev *dev)
>>       bool need_restore;
>>       pci_power_t state;
>>       u16 pmcsr;
>> +    u16 old_cmd;
>>
>>       platform_pci_set_power_state(dev, PCI_D0);
>>
>> @@ -1347,6 +1348,10 @@ int pci_power_up(struct pci_dev *dev)
>>           udelay(PCI_PM_D2_DELAY);
>>
>>   end:
>> +        pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
>> +        pci_info(dev, "Bus mastering bit is %sabled in D0\n",
>> +                (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis");
>> +
>>       dev->current_state = PCI_D0;
>>       if (need_restore)
>>           return 1;
>>
>> as well as some additional logging in nvme_pci_enable:
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 80df992d1ae8..2d1898016d40 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2992,6 +2992,7 @@ static bool nvme_pci_update_nr_queues(struct
>> nvme_dev *dev)
>>
>>   static int nvme_pci_enable(struct nvme_dev *dev)
>>   {
>> +    dev_info(dev->ctrl.device, "mruffell: nvme_pci_enable() start/n");
>>       int result = -ENOMEM;
>>       struct pci_dev *pdev = to_pci_dev(dev->dev);
>>       unsigned int flags = PCI_IRQ_ALL_TYPES;
>> @@ -2999,6 +3000,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>>       if (pci_enable_device_mem(pdev))
>>           return result;
>>
>> +    pci_info(pdev, "mruffell: nvme_pci_enable() set master\n");
>>       pci_set_master(pdev);
>>
>>       if (readl(dev->bar + NVME_REG_CSTS) == -1) {
>> @@ -3064,11 +3066,13 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>>       result = nvme_pci_configure_admin_queue(dev);
>>       if (result)
>>           goto free_irq;
>> +    pci_info(pdev, "mruffell: nvme_pci_enable() end\n");
>>       return result;
>>
>>    free_irq:
>>       pci_free_irq_vectors(pdev);
>>    disable:
>> +    pci_info(pdev, "mruffell: nvme_pci_enable() disable\n");
>>       pci_disable_device(pdev);
>>       return result;
>>   }
>>
>> So, booting the system on a clean reboot, we see nvme_pci_enable
>> does infact get called with the nvme device, and all PCI devices in
>> D0 with the bus master bit enabled.
>>
>> https://paste.ubuntu.com/p/zp3khwfMg2/
>>
>> Then I did a kexec.
>>
>> https://paste.ubuntu.com/p/SWB5jz6x4g/
>>
>> nvme_pci_enable() still gets called, but it doesn't seem to change
>> anything.  The system still fails to issue I/O to the NVMe and all
>> pci devices say the bus mastering bit is disabled in D0.
>>
>> I then reverted
>>
>> commit 51c0996dadaea20d73eb0495aeda9cb0422243e8
>> Author: Brian Norris <briannorris@chromium.org>
>> Date:   Thu Jan 22 09:48:15 2026 -0800
>> Subject: PCI/PM: Prevent runtime suspend until devices are fully initialized
>>
>> commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
>> Author: Mario Limonciello <mario.limonciello@amd.com>
>> Date:   Wed Jun 11 18:31:16 2025 -0500
>> Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
>>
>> commit 4d4c10f763d7808fbade28d83d237411603bca05
>> Author: Mario Limonciello <mario.limonciello@amd.com>
>> Date:   Wed Apr 23 23:31:32 2025 -0500
>> Subject: PCI: Explicitly put devices into D0 when initializing
>>
>> and repeated the tests.
>>
>> This time, on a clean reboot I get:
>>
>> https://paste.ubuntu.com/p/qm6ZBWNqG5/
>>
>> I get no messages about bus mastering at all. I did a kexec reboot
>> and the system comes up correctly, again with no messages about bus
>> mastering. So we must return early before the printout.
>>
>> I then added some more logging, to see if we can see what state
>> these devices are in:
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index bd116dccd897..deed439af6e4 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1302,10 +1302,16 @@ int pci_power_up(struct pci_dev *dev)
>>       u16 pmcsr;
>>       u16 old_cmd;
>>
>> +    state = platform_pci_get_power_state(dev);
>> +    pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
>> +        pci_info(dev, "Initial: Bus mastering bit is %sabled in %d\n",
>> +                (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis", state);
> 
> If you include "%s", paired with "__func__", in the printf string,
> we'll get the locations directly.
> 
>>       platform_pci_set_power_state(dev, PCI_D0);
>>
>>       if (!dev->pm_cap) {
>>           state = platform_pci_get_power_state(dev);
>> +        pci_info(dev, "mruffell: pci power state is %d\n", state);
>>           if (state == PCI_UNKNOWN)
>>               dev->current_state = PCI_D0;
>>           else
>> @@ -1349,7 +1355,7 @@ int pci_power_up(struct pci_dev *dev)
>>
>>   end:
>>           pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
>> -        pci_info(dev, "Bus mastering bit is %sabled in D0\n",
>> +        pci_info(dev, "mruffell: final: Bus mastering bit is %sabled in D0\n",
>>                   (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis");
>>
>>       dev->current_state = PCI_D0;
>>
>> The result for a clean reboot is:
>>
>> https://paste.ubuntu.com/p/qh84c8T9ND/
>>
>> We seem to skip pci_power_up() entirely for most devices, including
>> the NVMe drive. Those that enter pci_power_up() are in state 5, or
>> PCI_UNKNOWN.
>>
>> I did a kexec, and we get:
>>
>> https://paste.ubuntu.com/p/D2GhjKxDG3/
>>
>> Pretty much identical results, and the system comes up correctly.
>> Each time the bus mastering printout at the very end of
>> pci_power_up() is never called.
>>
>> The only conclusions I can seem to draw from this is when "PCI:
>> Explicitly put devices into D0 when initializing" is present, all
>> devices get their bus mastering bit, and are placed into D0, and on
>> kexec, all devices lose their bus mastering bit and never get it
>> back again.
>>
>> If we revert "PCI: Explicitly put devices into D0 when
>> initializing", devices that work bypass pci_power_up(), get their
>> bus mastering bits set correctly, and kexec works.
> 
> Thanks for all your instrumentation and debugging!
> 
> I don't think it's safe for the PCI core to indiscriminately enable
> bus mastering during boot.  That allows devices to DMA to/from system
> memory even if there's no driver to operate the device.  Bus mastering
> should only be enabled by a driver.
> 

It seems to me the problem is a mismatch with the device and what the 
kernel did in the context of kexec.

I wonder if instead we should do the inverse - explicitly turn off bus 
mastering before we put it into D0 at startup.  Because as it stands 
today there appears to be some devices that the BIOS will have bus 
mastering enabled, kexec might have turned on bus mastering and we are 
just "getting lucky" there hasn't been a problem from that.

This would then also align the kernel behavior to the design pattern of 
bus mastering should always be off until a driver asks for it.

> Per D2GhjKxDG3, nvme_pci_enable() is being called after the kexec, as
> it should be.  We need to figure out why it isn't enabling bus
> mastering.
> 
> It looks like something is wrong before we even get to
> nvme_pci_enable() because it took 80+ seconds to get there after
> kexec (D2GhjKxDG3), vs 2 seconds for the clean reboot (qh84c8T9ND).
> 
> Can you please put the entire dmesg log in a pastebin, possibly also
> with a note about the circumstances and a pointer to another pastebin
> with the "git diff" from upstream to the kernel?  It's hard for me to
> pull all the pieces together from the excerpts and keep track of which
> logs correspond to reboots vs kexecs, etc.
> 
> Bjorn


  reply	other threads:[~2026-02-17 14:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-19  3:52 [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing" Matthew Ruffell
2025-09-19  5:02 ` Mario Limonciello
2025-12-04  5:04   ` Matthew Ruffell
2025-12-04  5:29     ` Mario Limonciello
2025-12-05  3:06       ` Matthew Ruffell
2025-12-05  3:10         ` Matthew Ruffell
2025-12-05  5:31           ` Mario Limonciello
2026-01-06  6:06             ` Mario Limonciello
2026-02-13  5:54               ` Matthew Ruffell
2026-02-13 19:26                 ` Bjorn Helgaas
2026-02-17 14:36                   ` Mario Limonciello [this message]
2026-02-23  6:04                     ` Mario Limonciello
2026-02-25  5:21                       ` Matthew Ruffell
2026-02-25  5:42                         ` Mario Limonciello
2025-12-05  5:28         ` Mario Limonciello

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a7d03703-04fa-4c26-987a-a34d2e96a4a7@amd.com \
    --to=mario.limonciello@amd.com \
    --cc=bhelgaas@google.com \
    --cc=briannorris@chromium.org \
    --cc=helgaas@kernel.org \
    --cc=jay.vosburgh@canonical.com \
    --cc=kengyu@lexical.tw \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=matthew.ruffell@canonical.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox