* [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
@ 2025-08-11 3:23 Coiby Xu
2025-08-11 13:02 ` Thomas Gleixner
0 siblings, 1 reply; 13+ messages in thread
From: Coiby Xu @ 2025-08-11 3:23 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: linux-arm-kernel, linux-pci, kexec
Hi Thomas,
Recently I met an issue that on certain virtual machines, the kdump
kernel fails to get DHCP IP address most of times starting from
6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
# good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
# MSI_FLAG_PCI_MSI_MASK_PARENT
git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
# good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
# Provide MSI parent infrastructure
git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
# good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
# Prepare for PCI MSI/MSIX
git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
# first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
# irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
I'll appreciate it you can provide any suggestion on fixing this issue.
And I'd be glad to verify any idea as this issue can seem to be reliably
reproduced only on certain machines.
The aarch64 kdump kernel on Fedora is booted wit "irqpoll nr_cpus=1
reset_devices cgroup_disable=memory udev.children-max=2 panic=10
swiotlb=noforce novmcoredd cma=0 hugetlb_cma=0". Removing "nr_cpus=1" or
"adding pci=nomsi" can make this issue disappear.
Note I've also reported this issue to
https://bugzilla.kernel.org/show_bug.cgi?id=220328
--
Best regards,
Coiby
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-11 3:23 [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1 Coiby Xu
@ 2025-08-11 13:02 ` Thomas Gleixner
2025-08-11 13:03 ` Thomas Gleixner
2025-08-12 3:29 ` Coiby Xu
0 siblings, 2 replies; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-11 13:02 UTC (permalink / raw)
To: Coiby Xu; +Cc: linux-arm-kernel, linux-pci, kexec
On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
> Recently I met an issue that on certain virtual machines, the kdump
> kernel fails to get DHCP IP address most of times starting from
> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
>
> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
> # MSI_FLAG_PCI_MSI_MASK_PARENT
> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
> # Provide MSI parent infrastructure
> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
> # Prepare for PCI MSI/MSIX
> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
There were follow up fixes on this, so isolating this one is not really
conclusive.
Is the problem still there on v6.16 and v6.17-rc1?
Thanks,
tglx
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-11 13:02 ` Thomas Gleixner
@ 2025-08-11 13:03 ` Thomas Gleixner
2025-08-11 14:52 ` Marc Zyngier
2025-08-12 3:29 ` Coiby Xu
1 sibling, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-11 13:03 UTC (permalink / raw)
To: Coiby Xu; +Cc: linux-arm-kernel, linux-pci, kexec, Marc Zyngier
On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
CC+ Marc
> On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
>> Recently I met an issue that on certain virtual machines, the kdump
>> kernel fails to get DHCP IP address most of times starting from
>> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
>> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
>>
>> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
>> # MSI_FLAG_PCI_MSI_MASK_PARENT
>> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
>> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
>> # Provide MSI parent infrastructure
>> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
>> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
>> # Prepare for PCI MSI/MSIX
>> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
>> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
>> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
>
> There were follow up fixes on this, so isolating this one is not really
> conclusive.
>
> Is the problem still there on v6.16 and v6.17-rc1?
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-11 13:03 ` Thomas Gleixner
@ 2025-08-11 14:52 ` Marc Zyngier
2025-08-12 10:09 ` Coiby Xu
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2025-08-11 14:52 UTC (permalink / raw)
To: Thomas Gleixner, Coiby Xu; +Cc: linux-arm-kernel, linux-pci, kexec
On Mon, 11 Aug 2025 14:03:21 +0100,
Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
>
> CC+ Marc
>
> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
> >> Recently I met an issue that on certain virtual machines, the kdump
> >> kernel fails to get DHCP IP address most of times starting from
> >> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
> >>
> >> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
> >> # MSI_FLAG_PCI_MSI_MASK_PARENT
> >> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
> >> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
> >> # Provide MSI parent infrastructure
> >> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
> >> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
> >> # Prepare for PCI MSI/MSIX
> >> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
> >> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
> >> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
> >
> > There were follow up fixes on this, so isolating this one is not really
> > conclusive.
> >
> > Is the problem still there on v6.16 and v6.17-rc1?
Yeah, there are way too many things that have been addressed since.
kdump is also a particularly nasty case, as it tends to rely on the
redistributor tables programmed by the previous kernel.
Also, this says "virtual machines". What's the hypervisor? How hard is
it to reproduce?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-11 13:02 ` Thomas Gleixner
2025-08-11 13:03 ` Thomas Gleixner
@ 2025-08-12 3:29 ` Coiby Xu
1 sibling, 0 replies; 13+ messages in thread
From: Coiby Xu @ 2025-08-12 3:29 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: linux-arm-kernel, linux-pci, kexec
On Mon, Aug 11, 2025 at 03:02:33PM +0200, Thomas Gleixner wrote:
>On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
>> Recently I met an issue that on certain virtual machines, the kdump
>> kernel fails to get DHCP IP address most of times starting from
>> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
>> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
>>
>> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
>> # MSI_FLAG_PCI_MSI_MASK_PARENT
>> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
>> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
>> # Provide MSI parent infrastructure
>> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
>> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
>> # Prepare for PCI MSI/MSIX
>> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
>> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
>> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
>
>There were follow up fixes on this, so isolating this one is not really
>conclusive.
>
>Is the problem still there on v6.16 and v6.17-rc1?
Thanks for the reply! Yes, I can confirm it still happens to
6.16.0-200.fc42.aarch64 and 6.17.0-0.rc1.17.fc43.aarch64.
>
>Thanks,
>
> tglx
>
--
Best regards,
Coiby
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-11 14:52 ` Marc Zyngier
@ 2025-08-12 10:09 ` Coiby Xu
2025-08-12 10:17 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Coiby Xu @ 2025-08-12 10:09 UTC (permalink / raw)
To: Marc Zyngier; +Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec
On Mon, Aug 11, 2025 at 03:52:04PM +0100, Marc Zyngier wrote:
>On Mon, 11 Aug 2025 14:03:21 +0100,
>Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
>>
>> CC+ Marc
>>
>> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
>> >> Recently I met an issue that on certain virtual machines, the kdump
>> >> kernel fails to get DHCP IP address most of times starting from
>> >> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
>> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
>> >>
>> >> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
>> >> # MSI_FLAG_PCI_MSI_MASK_PARENT
>> >> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
>> >> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
>> >> # Provide MSI parent infrastructure
>> >> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
>> >> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
>> >> # Prepare for PCI MSI/MSIX
>> >> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
>> >> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
>> >> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
>> >
>> > There were follow up fixes on this, so isolating this one is not really
>> > conclusive.
>> >
>> > Is the problem still there on v6.16 and v6.17-rc1?
>
>Yeah, there are way too many things that have been addressed since.
>kdump is also a particularly nasty case, as it tends to rely on the
>redistributor tables programmed by the previous kernel.
Thanks for providing a clue. This may also explain explain why I fails
to reproduce this issue against 1st kernel even with the same cmdline of
the kdump kernel.
>
>Also, this says "virtual machines". What's the hypervisor?
I'll contact the lab administrator. What kinds of info I should collect
to help you narrow down the issue?
> How hard is it to reproduce?
It can be reproduced reliably on certain machines. But as of writing I
haven't reproduced it on other KVM virtual machines on three different
host machines.
>
>Thanks,
>
> M.
>
>--
>Without deviation from the norm, progress is not possible.
>
--
Best regards,
Coiby
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-12 10:09 ` Coiby Xu
@ 2025-08-12 10:17 ` Marc Zyngier
2025-08-12 11:07 ` Coiby Xu
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2025-08-12 10:17 UTC (permalink / raw)
To: Coiby Xu; +Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec
On Tue, 12 Aug 2025 11:09:12 +0100,
Coiby Xu <coxu@redhat.com> wrote:
>
> On Mon, Aug 11, 2025 at 03:52:04PM +0100, Marc Zyngier wrote:
> > On Mon, 11 Aug 2025 14:03:21 +0100,
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
> >>
> >> CC+ Marc
> >>
> >> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
> >> >> Recently I met an issue that on certain virtual machines, the kdump
> >> >> kernel fails to get DHCP IP address most of times starting from
> >> >> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
> >> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
> >> >>
> >> >> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
> >> >> # MSI_FLAG_PCI_MSI_MASK_PARENT
> >> >> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
> >> >> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
> >> >> # Provide MSI parent infrastructure
> >> >> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
> >> >> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
> >> >> # Prepare for PCI MSI/MSIX
> >> >> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
> >> >> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
> >> >> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
> >> >
> >> > There were follow up fixes on this, so isolating this one is not really
> >> > conclusive.
> >> >
> >> > Is the problem still there on v6.16 and v6.17-rc1?
> >
> > Yeah, there are way too many things that have been addressed since.
> > kdump is also a particularly nasty case, as it tends to rely on the
> > redistributor tables programmed by the previous kernel.
>
> Thanks for providing a clue. This may also explain explain why I fails
> to reproduce this issue against 1st kernel even with the same cmdline of
> the kdump kernel.
I'm not sure that's a clue. It's only an indication that things are
not necessarily easy to spot.
Has it ever been reproduced on bare metal? Have you tried v6.16 as
instructed?
>
> >
> > Also, this says "virtual machines". What's the hypervisor?
>
> I'll contact the lab administrator. What kinds of info I should collect
> to help you narrow down the issue?
Surely you know what hypervisor you're running on, right?
>
> > How hard is it to reproduce?
>
> It can be reproduced reliably on certain machines. But as of writing I
> haven't reproduced it on other KVM virtual machines on three different
> host machines.
Which machines? I'm sorry, but if you want help on this, you'll have
to provide actual information.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-12 10:17 ` Marc Zyngier
@ 2025-08-12 11:07 ` Coiby Xu
2025-08-12 13:14 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Coiby Xu @ 2025-08-12 11:07 UTC (permalink / raw)
To: Marc Zyngier; +Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec
On Tue, Aug 12, 2025 at 11:17:04AM +0100, Marc Zyngier wrote:
>On Tue, 12 Aug 2025 11:09:12 +0100,
>Coiby Xu <coxu@redhat.com> wrote:
>>
>> On Mon, Aug 11, 2025 at 03:52:04PM +0100, Marc Zyngier wrote:
>> > On Mon, 11 Aug 2025 14:03:21 +0100,
>> > Thomas Gleixner <tglx@linutronix.de> wrote:
>> >>
>> >> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
>> >>
>> >> CC+ Marc
>> >>
>> >> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
>> >> >> Recently I met an issue that on certain virtual machines, the kdump
>> >> >> kernel fails to get DHCP IP address most of times starting from
>> >> >> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
>> >> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
>> >> >>
>> >> >> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
>> >> >> # MSI_FLAG_PCI_MSI_MASK_PARENT
>> >> >> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
>> >> >> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
>> >> >> # Provide MSI parent infrastructure
>> >> >> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
>> >> >> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
>> >> >> # Prepare for PCI MSI/MSIX
>> >> >> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
>> >> >> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
>> >> >> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
>> >> >
>> >> > There were follow up fixes on this, so isolating this one is not really
>> >> > conclusive.
>> >> >
>> >> > Is the problem still there on v6.16 and v6.17-rc1?
>> >
>> > Yeah, there are way too many things that have been addressed since.
>> > kdump is also a particularly nasty case, as it tends to rely on the
>> > redistributor tables programmed by the previous kernel.
>>
>> Thanks for providing a clue. This may also explain explain why I fails
>> to reproduce this issue against 1st kernel even with the same cmdline of
>> the kdump kernel.
>
>I'm not sure that's a clue. It's only an indication that things are
>not necessarily easy to spot.
>
>Has it ever been reproduced on bare metal? Have you tried v6.16 as
>instructed?
Thanks for replying so quickly!
No, I haven't reproduced it on a bare metal machine and our QE engineers
haven't noticed this issue on any bare metal machine either.
And I can confirm this issue still happens to 6.16.0-200.fc42.aarch64
and 6.17.0-0.rc1.17.fc43.aarch64 on the type of KVM VMS (QEMU PnP device
PNP0c02) where the issue was found.
>
>>
>> >
>> > Also, this says "virtual machines". What's the hypervisor?
>>
>> I'll contact the lab administrator. What kinds of info I should collect
>> to help you narrow down the issue?
>
>Surely you know what hypervisor you're running on, right?
Yes, the hypervisor is KVM. Sorry, I thought merely providing the
hypervisor info isn't sufficient and also misunderstood your request as
providing more details on the host machine.
>
>>
>> > How hard is it to reproduce?
>>
>> It can be reproduced reliably on certain machines. But as of writing I
>> haven't reproduced it on other KVM virtual machines on three different
>> host machines.
>
>Which machines? I'm sorry, but if you want help on this, you'll have
>to provide actual information.
Sorry, I didn't mean to be vague. I thought you question is on how
reproducible this issue is and there is no need to provide the details
on the machines where I can't reproduce this issue. Since you explicitly
request it, I'll be glad to share the details.
I just grabbed three arbitrary bare metal machines having Fedora-42
installed and launched some KVM VMs to see if this issue can be
reproduced easily. Two host machines are as follows (sorry I can't find
the info of the 3rd one)
- GIGABYTE PnP device PNP0c02, ARMv8 (M128-30)
- LTHPCSR112 (01234567890123456789AB), ARMv8 (Q80-30)
The virtual machine image is downloaded from
https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2.
I tried different vCPUs (2, 4), different RAM (4G, 35G) and also two
different UEFI firmware (the default one and one from edk2-experimental
package) but haven't reproduced this issue so far.
>
>Thanks,
>
> M.
>
>--
>Without deviation from the norm, progress is not possible.
>
--
Best regards,
Coiby
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-12 11:07 ` Coiby Xu
@ 2025-08-12 13:14 ` Marc Zyngier
[not found] ` <yweverlt7onyse3rbm7phxzwrwfk4pq2dipzdjenrx4onrak6r@dsm4ra3x3gv6>
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2025-08-12 13:14 UTC (permalink / raw)
To: Coiby Xu; +Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec
On Tue, 12 Aug 2025 12:07:56 +0100,
Coiby Xu <coxu@redhat.com> wrote:
>
> On Tue, Aug 12, 2025 at 11:17:04AM +0100, Marc Zyngier wrote:
> > On Tue, 12 Aug 2025 11:09:12 +0100,
> > Coiby Xu <coxu@redhat.com> wrote:
> >>
> >> On Mon, Aug 11, 2025 at 03:52:04PM +0100, Marc Zyngier wrote:
> >> > On Mon, 11 Aug 2025 14:03:21 +0100,
> >> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >> >>
> >> >> On Mon, Aug 11 2025 at 15:02, Thomas Gleixner wrote:
> >> >>
> >> >> CC+ Marc
> >> >>
> >> >> > On Mon, Aug 11 2025 at 11:23, Coiby Xu wrote:
> >> >> >> Recently I met an issue that on certain virtual machines, the kdump
> >> >> >> kernel fails to get DHCP IP address most of times starting from
> >> >> >> 6.11-rc2. git bisection shows commit b5712bf89b4b ("irqchip/gic-v3-its:
> >> >> >> Provide MSI parent for PCI/MSI[-X]") is the 1st bad commit,
> >> >> >>
> >> >> >> # good: [7d189c77106ed6df09829f7a419e35ada67b2bd0] PCI/MSI: Provide
> >> >> >> # MSI_FLAG_PCI_MSI_MASK_PARENT
> >> >> >> git bisect good 7d189c77106ed6df09829f7a419e35ada67b2bd0
> >> >> >> # good: [48f71d56e2b87839052d2a2ec32fc97a79c3e264] irqchip/gic-v3-its:
> >> >> >> # Provide MSI parent infrastructure
> >> >> >> git bisect good 48f71d56e2b87839052d2a2ec32fc97a79c3e264
> >> >> >> # good: [8c41ccec839c622b2d1be769a95405e4e9a4cb20] irqchip/irq-msi-lib:
> >> >> >> # Prepare for PCI MSI/MSIX
> >> >> >> git bisect good 8c41ccec839c622b2d1be769a95405e4e9a4cb20
> >> >> >> # first bad commit: [b5712bf89b4bbc5bcc9ebde8753ad222f1f68296]
> >> >> >> # irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]
> >> >> >
> >> >> > There were follow up fixes on this, so isolating this one is not really
> >> >> > conclusive.
> >> >> >
> >> >> > Is the problem still there on v6.16 and v6.17-rc1?
> >> >
> >> > Yeah, there are way too many things that have been addressed since.
> >> > kdump is also a particularly nasty case, as it tends to rely on the
> >> > redistributor tables programmed by the previous kernel.
> >>
> >> Thanks for providing a clue. This may also explain explain why I fails
> >> to reproduce this issue against 1st kernel even with the same cmdline of
> >> the kdump kernel.
> >
> > I'm not sure that's a clue. It's only an indication that things are
> > not necessarily easy to spot.
> >
> > Has it ever been reproduced on bare metal? Have you tried v6.16 as
> > instructed?
>
> Thanks for replying so quickly!
>
> No, I haven't reproduced it on a bare metal machine and our QE engineers
> haven't noticed this issue on any bare metal machine either.
> And I can confirm this issue still happens to 6.16.0-200.fc42.aarch64
> and 6.17.0-0.rc1.17.fc43.aarch64 on the type of KVM VMS (QEMU PnP device
> PNP0c02) where the issue was found.
What is that device? Is that the emulated PCI bridge?
> >> > Also, this says "virtual machines". What's the hypervisor?
> >>
> >> I'll contact the lab administrator. What kinds of info I should collect
> >> to help you narrow down the issue?
> >
> > Surely you know what hypervisor you're running on, right?
>
> Yes, the hypervisor is KVM. Sorry, I thought merely providing the
> hypervisor info isn't sufficient and also misunderstood your request as
> providing more details on the host machine.
Well, knowing that it is KVM is definitely relevant, given that this
is my own turf.
> >> > How hard is it to reproduce?
> >>
> >> It can be reproduced reliably on certain machines. But as of writing I
> >> haven't reproduced it on other KVM virtual machines on three different
> >> host machines.
> >
> > Which machines? I'm sorry, but if you want help on this, you'll have
> > to provide actual information.
>
> Sorry, I didn't mean to be vague. I thought you question is on how
> reproducible this issue is and there is no need to provide the details
> on the machines where I can't reproduce this issue. Since you explicitly
> request it, I'll be glad to share the details.
>
> I just grabbed three arbitrary bare metal machines having Fedora-42
> installed and launched some KVM VMs to see if this issue can be
> reproduced easily. Two host machines are as follows (sorry I can't find
> the info of the 3rd one)
> - GIGABYTE PnP device PNP0c02, ARMv8 (M128-30)
> - LTHPCSR112 (01234567890123456789AB), ARMv8 (Q80-30)
Are these both Ampere Altra boxes?
> The virtual machine image is downloaded from
> https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2.
> I tried different vCPUs (2, 4), different RAM (4G, 35G) and also two
> different UEFI firmware (the default one and one from edk2-experimental
> package) but haven't reproduced this issue so far.
Hold on. Above, you say that you have reproduced it with
6.16.0-200.fc42.aarch64. So have you, or have you not reproduced it?
Can you at the very least share:
- the boot log of the guest on its first kernel
- the boot log of the guest running kdump
- the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
running both kernels
- the QEMU command-line to get to run the whole thing
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
[not found] ` <yweverlt7onyse3rbm7phxzwrwfk4pq2dipzdjenrx4onrak6r@dsm4ra3x3gv6>
@ 2025-08-19 23:30 ` Coiby Xu
2025-08-20 8:56 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Coiby Xu @ 2025-08-19 23:30 UTC (permalink / raw)
To: Marc Zyngier; +Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec
[-- Attachment #1: Type: text/plain, Size: 948 bytes --]
On Wed, Aug 13, 2025 at 08:08:28PM +0800, Coiby Xu wrote:
>On Tue, Aug 12, 2025 at 02:14:25PM +0100, Marc Zyngier wrote:
[...]
>>
>>Can you at the very least share:
Thanks for your patience! I've attached a zip file with the info you
need. Additionally I've included the dmidecode of guest
(dmidecode_guest), host machine (dmidecode_host) and the domain info
of guest (libvirt.xml) in case they may be helpful. If you need further
info or any experiment I need to do, feel free to let me know! Now I
have access to the host machine so I can respond much faster.
>>
>>- the boot log of the guest on its first kernel
Please check file boot_log_1st_kernel
>>
>>- the boot log of the guest running kdump
boot_log_2nd_kernel
>>
>>- the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
>> running both kernels
vgic-state_{1st,2nd}_kernel
>>
>>- the QEMU command-line to get to run the whole thing
qemu_cmdline
--
Best regards,
Coiby
[-- Attachment #2: debug_info_ampere-mtsnow-altramax-02-vm-11.zip --]
[-- Type: application/zip, Size: 54871 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-19 23:30 ` Coiby Xu
@ 2025-08-20 8:56 ` Marc Zyngier
2025-08-23 3:00 ` Coiby Xu
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2025-08-20 8:56 UTC (permalink / raw)
To: Coiby Xu; +Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec
On Wed, 20 Aug 2025 00:30:12 +0100,
Coiby Xu <coxu@redhat.com> wrote:
>
> On Wed, Aug 13, 2025 at 08:08:28PM +0800, Coiby Xu wrote:
> > On Tue, Aug 12, 2025 at 02:14:25PM +0100, Marc Zyngier wrote:
> [...]
> >>
> >> Can you at the very least share:
>
> Thanks for your patience! I've attached a zip file with the info you
> need. Additionally I've included the dmidecode of guest
> (dmidecode_guest), host machine (dmidecode_host) and the domain info
> of guest (libvirt.xml) in case they may be helpful. If you need further
> info or any experiment I need to do, feel free to let me know! Now I
> have access to the host machine so I can respond much faster.
>
> >>
> >> - the boot log of the guest on its first kernel
>
> Please check file boot_log_1st_kernel
Old kernel. It would have been better to use a vanilla v6.16, so that
we know exactly what you are running. I have zero interest in finding
out what 6.15.9-201.fc42.aarch64 corresponds to in real life.
> >> - the boot log of the guest running kdump
>
> boot_log_2nd_kernel
Same thing.
>
> >>
> >> - the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
> >> running both kernels
>
> vgic-state_{1st,2nd}_kernel
What is the host running? It also looks like a pre-6.16 kernel, which
lacks important information.
>
> >>
> >> - the QEMU command-line to get to run the whole thing
>
> qemu_cmdline
I'm sorry, but that doesn't look like a command line as I know it. I
certainly cannot feed this to QEMU and reproduce your findings.
Now, there is *one* thing that is interesting:
The second vgic_state dump indicates that LPI 8225 is routed to
vcpu-3. Given that your guest boots into the second kernel on vcpu-0,
and that this is the only online vcpu at this stage, the LPI will
never be presented to the CPU (and the vgic has it as pending, which
is what I'd expect).
I'd suggest you instrument the second kernel to try and see why this
affinity is not changed.
Thanks,
M.
--
Jazz isn't dead. It just smells funny.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-20 8:56 ` Marc Zyngier
@ 2025-08-23 3:00 ` Coiby Xu
2025-08-27 8:17 ` Coiby Xu
0 siblings, 1 reply; 13+ messages in thread
From: Coiby Xu @ 2025-08-23 3:00 UTC (permalink / raw)
To: Marc Zyngier
Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec, jbastian,
cmirabil
[-- Attachment #1: Type: text/plain, Size: 6196 bytes --]
Hi Marc,
If I understand correctly, you want to reproduce the issue by yourself.
Then finally I manage to reproduce this issue by playing with the setup
shared by my collogue. Here are the five prerequisites to reproduce the
bug,
1. Guest kernel
Newer than commit b5712bf89b4b ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
2. Host kernel
Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
v6.17.0 don't have this issue.
3. QEMU <= v6.2
4. Specific host machines
I'm not familiar with the hardware so currently I haven't figured out
what hardware factor makes the issue reproducible. I've attached
dmidecode outputs of four machines (files inside indmidecode_host folder).
Two systems (dmidecode_not_work*) can reproduce this issue and the
other two systems (dmidecode_work*) can't despite all have the same
product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
01234567890123456789AB. One difference that doesn't seem to found in
the dmidecode output is the two machines that can't reproduce the issue
have the model name "PnP device PNP0c02" where the problematic
machines have "R152-P31-00 (01234567890123456789AB)" according to our
internal web pages that show the hardware info.
5. The Guest needs to be bridged to a physical host interface.
Bridging the guest to tun interface can't reproduce the issue (for
example, the default bridge (virbr0) created by libvirtd uses tun
interface)
With the above conditions met, I can reproduce the issue simply with
Fedora Cloud Base 42 image,
1. Start the VM
qemu-system-aarch64 -cpu host -machine virt \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=00:16:3e:3d:5f:b8 \
-netdev bridge,id=hn0,br=br0,helper=/usr/local/libexec/qemu-bridge-helper \
-hda /var/lib/libvirt/images/f42_1.qcow2 \
-accel kvm -boot d \
-drive if=pflash,format=raw,readonly,file=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw \
-m 35840 -serial stdio -smp 16
2. Set up kdump to dump vmcore to a remote NFS server
dnf install kdump-utils nfs-utils -y
echo nfs NFS_SERVER:EXPORT_PATH >> /etc/kdump.conf
systemctl enable kdump
kdumpctl reset-crashkernel
systemctl reboot
3. After rebooting, trigger 1st kernel crash
If kdump works i.e. DHCP works, you will need to trigger kernel crash
again until it doesn't work. In my experience, repeating this step for 6
consecutive times will surely lead to one time that DHCP doesn't
work.
Note f42_1.qcow2 was created from Fedora Cloud Base 42 image
https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2
Considering QEMU 6.12 was released about 4 years ago, do you think there
is an need to further dig into this problem to find out how the five
prerequisite conditions interplay with each other to create the bug? If
you think it's worth the efforts, I'll do a bisection against QEMU to
find out the 1st bad commit and also provide other debugging info you
need.
On Wed, Aug 20, 2025 at 09:56:50AM +0100, Marc Zyngier wrote:
>On Wed, 20 Aug 2025 00:30:12 +0100,
>Coiby Xu <coxu@redhat.com> wrote:
>>
>> On Wed, Aug 13, 2025 at 08:08:28PM +0800, Coiby Xu wrote:
>> > On Tue, Aug 12, 2025 at 02:14:25PM +0100, Marc Zyngier wrote:
>> [...]
>> >>
>> >> Can you at the very least share:
>>
>> Thanks for your patience! I've attached a zip file with the info you
>> need. Additionally I've included the dmidecode of guest
>> (dmidecode_guest), host machine (dmidecode_host) and the domain info
>> of guest (libvirt.xml) in case they may be helpful. If you need further
>> info or any experiment I need to do, feel free to let me know! Now I
>> have access to the host machine so I can respond much faster.
>>
>> >>
>> >> - the boot log of the guest on its first kernel
>>
>> Please check file boot_log_1st_kernel
>
>Old kernel. It would have been better to use a vanilla v6.16, so that
>we know exactly what you are running. I have zero interest in finding
>out what 6.15.9-201.fc42.aarch64 corresponds to in real life.
Thanks for the suggestion! I've built v6.16 and attached the logs.
Please check 04_not_work/boot_log_{1st,2nd}_kernel.
Btw, I'm curious to know why you want a vanilla v6.16. Is it because you
are worried a Fedora kernel can be so different from a vanilla v6.16
that it can obscure the problem?
>
>> >> - the boot log of the guest running kdump
>>
>> boot_log_2nd_kernel
>
>Same thing.
>
>>
>> >>
>> >> - the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
>> >> running both kernels
>>
>> vgic-state_{1st,2nd}_kernel
>
>What is the host running? It also looks like a pre-6.16 kernel, which
>lacks important information.
The host is running RHEL8.6. But I can confirm Fedora kernel
6.10.0-64.fc41.aarch64 can also reproduce the issue but
not latest ones like 6.17.0-0.rc2.24.fc43.aarch64.
>
>>
>> >>
>> >> - the QEMU command-line to get to run the whole thing
>>
>> qemu_cmdline
>
>I'm sorry, but that doesn't look like a command line as I know it. I
>certainly cannot feed this to QEMU and reproduce your findings.
Sorry I didn't realize you want to reproduce the issue. Previously I
hadn't reproduced the issue and thought it's not easy to reproduce it. Thus I
merely shared the cmdline generated by libvirt/virt-install so you may
find something suspicious.
>
>Now, there is *one* thing that is interesting:
>
>The second vgic_state dump indicates that LPI 8225 is routed to
>vcpu-3. Given that your guest boots into the second kernel on vcpu-0,
>and that this is the only online vcpu at this stage, the LPI will
>never be presented to the CPU (and the vgic has it as pending, which
>is what I'd expect).
>
>I'd suggest you instrument the second kernel to try and see why this
>affinity is not changed.
Currently, I'm not familiar with interrupts. But I notice for the 2nd
kernel, /proc/irq/*/smp_affinity of the 2nd kernel all have the same
value 1 and /proc/interrupts only list one CPU. If you want me to try
other things, please let me know.
>
>Thanks,
>
> M.
>
>--
>Jazz isn't dead. It just smells funny.
>
--
Best regards,
Coiby
[-- Attachment #2: debug_info_VGICv3_not_work_for_kdump.zip --]
[-- Type: application/zip, Size: 111284 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
2025-08-23 3:00 ` Coiby Xu
@ 2025-08-27 8:17 ` Coiby Xu
0 siblings, 0 replies; 13+ messages in thread
From: Coiby Xu @ 2025-08-27 8:17 UTC (permalink / raw)
To: Marc Zyngier
Cc: Thomas Gleixner, linux-arm-kernel, linux-pci, kexec, jbastian,
cmirabil
On Sat, Aug 23, 2025 at 11:00:11AM +0800, Coiby Xu wrote:
>Hi Marc,
>
>If I understand correctly, you want to reproduce the issue by yourself.
>Then finally I manage to reproduce this issue by playing with the setup
>shared by my collogue. Here are the five prerequisites to reproduce the
>bug,
Hi Marc,
It turns out host kernel and host machine are not absolute prerequisites to
reproduce the problem. But they matter because they can make it much
more difficult to reproduce this problem. I also did a bisection against
QEMU to find out which commit make the issue gone. For details, please
check following inline comments.
>
>1. Guest kernel Newer than commit b5712bf89b4b
>("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
>
>2. Host kernel
> Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
> v6.17.0 don't have this issue.
It turns out with other conditions met, the latest host kernel
(6.17.0-0.rc3) can still reproduce the issue but it's much more
difficult to reproduce it. For example, with RHEL8 kernel
4.18.0-372.9.1.el8.aarch64, I need to trigger kernel crash for 3
times at maximum to reproduce it. But for Fedora rawhide kernel
6.17.0-0.rc3.31.fc43.aarch64, 3/10 times I can't reproduce this issue
after triggering kernel crash for 60 consecutive times. For a
comparison, I've listed the times of triggering kernel crash to reproduce
the issue in 10 trials,
RHEL8: 2 1 1 1 1 1 2 1 3 2
Fedora rawhide: 43 60 47 60 12 56 60 45 49 18
>
>3. QEMU <= v6.2
I did a bisection and it shows the issue is gone with QEMU commit
f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 ("kvm: Atomic memslot updates")
which is last/3rd patch of patch set "KVM: allow listener to stop all
vcpus before"
https://lists.nongnu.org/archive/html/qemu-devel/2022-11/msg02172.html
Note this commit shows in QEMU > 7.2 so QEMU <= v7.2.0 can also
reproduce this issue.
>
>4. Specific host machines I'm not familiar with the hardware so
>currently I haven't figured out
> what hardware factor makes the issue reproducible. I've attached
> dmidecode outputs of four machines (files inside indmidecode_host folder).
> Two systems (dmidecode_not_work*) can reproduce this issue and the
> other two systems (dmidecode_work*) can't despite all have the same
> product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
> 01234567890123456789AB. One difference that doesn't seem to found in
> the dmidecode output is the two machines that can't reproduce the issue
> have the model name "PnP device PNP0c02" where the problematic
> machines have "R152-P31-00 (01234567890123456789AB)" according to our
> internal web pages that show the hardware info.
It turns out all four machines can reproduce the issue. I tried to
reproduce this issue for 10 times and counted the times to trigger
kernel crash and here's a comparison
R152-P31-00: 2 1 1 1 1 1 2 1 3 2
PnP device PNP0c02: 8 3 5 15 11 18 2 5 12 4
>
>5. The Guest needs to be bridged to a physical host interface.
>Bridging the guest to tun interface can't reproduce the issue (for
> example, the default bridge (virbr0) created by libvirtd uses tun
> interface)
I tried triggering kernel crash for 100 consecutive times for virbr0 in
one trial but can't reproduce it. So I think bridging the guest to a
physical network interface is still a must.
[...]
--
Best regards,
Coiby
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-08-27 8:20 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 3:23 [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1 Coiby Xu
2025-08-11 13:02 ` Thomas Gleixner
2025-08-11 13:03 ` Thomas Gleixner
2025-08-11 14:52 ` Marc Zyngier
2025-08-12 10:09 ` Coiby Xu
2025-08-12 10:17 ` Marc Zyngier
2025-08-12 11:07 ` Coiby Xu
2025-08-12 13:14 ` Marc Zyngier
[not found] ` <yweverlt7onyse3rbm7phxzwrwfk4pq2dipzdjenrx4onrak6r@dsm4ra3x3gv6>
2025-08-19 23:30 ` Coiby Xu
2025-08-20 8:56 ` Marc Zyngier
2025-08-23 3:00 ` Coiby Xu
2025-08-27 8:17 ` Coiby Xu
2025-08-12 3:29 ` Coiby Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).