From: Marc Zyngier <maz@kernel.org>
To: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: James Morse <james.morse@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org, Vikram Sethi <vsethi@nvidia.com>,
Zenghui Yu <yuzenghui@huawei.com>,
Oliver Upton <oliver.upton@linux.dev>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Ard Biesheuvel <ardb@kernel.org>
Subject: Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown
Date: Fri, 20 Jan 2023 12:00:01 +0000 [thread overview]
Message-ID: <86r0vpmn5q.wl-maz@kernel.org> (raw)
In-Reply-To: <b3bf3a46-410b-a756-dfd9-ee74d5dc31e0@nvidia.com>
On Fri, 20 Jan 2023 05:02:15 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>
> [1 <text/plain; UTF-8 (7bit)>]
> Hi Marc,
>
> On 1/19/23 08:01, Marc Zyngier wrote:
> >> On 1/19/23 01:11, Marc Zyngier wrote:
> >>> So you can see the VM being torn down while the vgic save sequence is
> >>> still in progress?
> >>>
> >>> If you can actually see that, then this is a much bigger bug than the
> >>> simple race you are describing, and we're missing a reference on the
> >>> kvm structure. This would be a*MAJOR* bug.
> >>>
> >> How do we know vGIC save sequence is in progress while VM is being
> >> teardown? I'm launching/terminating ~32 VMs in a loop to reproduce
> >> the issue.
> > Errr...*you* know when you are issuing the save ioctl, right? You
> > also know when you are terminating the VM (closing its fd or killing
> > the VMM).
> >
>
> Added debug statements to trace the code patch, and tagged each log message
> with 'struct kvm *'. Attached the complete kernel log messages including
> debug messages.
>
> All 32 VMs launched, time 258s to 291s
> [ 258.519837] kvm_create_vm(1236) called kvm=ffff8000303e0000 --> 1st VM
> ...
> [ 291.801179] kvm_create_vm(1236) called kvm=ffff800057a60000 --> 32nd VM
>
> Test script inside VM issues poweroff command after sleeping 200sec.
>
> Working case kvm=ffff8000303e0000:
>
> $ cat gicv4-debug.txt | grep ffff8000303e0000
> [ 258.519837] kvm_create_vm(1236) called kvm=ffff8000303e0000
> [ 258.667101] vgic_v4_init(267) called kvm=ffff8000303e0000 doorbell=140(64)
> [ 517.942167] vgic_set_common_attr(263) called kvm=ffff8000303e0000
> [ 517.948415] vgic_v3_save_pending_tables(397) called kvm=ffff8000303e0000
> [ 517.955602] vgic_v3_save_pending_tables(448) called kvm=ffff8000303e0000
> [ 518.099696] kvm_vm_release(1374) called kvm=ffff8000303e0000
> [ 518.126833] vgic_v4_teardown(323) started kvm=ffff8000303e0000 doorbell=140(64)
> [ 518.134677] vgic_v4_teardown(333) finished kvm=ffff8000303e0000 doorbell=140(64)
>
> Not working case kvm=ffff80001e0a0000:
>
> $ cat gicv4-debug.txt | grep ffff80001e0a0000
> [ 277.684981] kvm_create_vm(1236) called kvm=ffff80001e0a0000
> [ 278.158511] vgic_v4_init(267) called kvm=ffff80001e0a0000 doorbell=20812(64)
> [ 545.079117] vgic_set_common_attr(263) called kvm=ffff80001e0a0000
> [ 545.085358] vgic_v3_save_pending_tables(397) called kvm=ffff80001e0a0000
> [ 545.092580] vgic_v3_save_pending_tables(448) called kvm=ffff80001e0a0000
> [ 545.099562] irq: irqd_set_activated: CPU49 IRQ20821 lost IRQD_IRQ_INPROGRESS old=0x10401400 new=0x10401600, expected=0x10441600 kvm=ffff80001e0a0000^M
> [ 545.113177] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS set time [545.099561]^M
> [ 545.121454] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS clr time [545.099562]^M
> [ 545.129755] irq: irqd_set_activated: CPU49 IRQ20826 lost IRQD_IRQ_INPROGRESS old=0x10441400 new=0x10441600, expected=0x10401600 kvm=ffff80001e0a0000^M
> [ 545.143365] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS set time [545.129754]^M
> [ 545.151654] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS clr time [545.129755]^M
> [ 545.163250] kvm_vm_release(1374) called kvm=ffff80001e0a0000
> [ 545.169204] vgic_v4_teardown(323) started kvm=ffff80001e0a0000 doorbell=20812(64)
>
> IRQD_IRQ_INPROGRESS is corrupted before calling kvm_vm_release(),
You keep missing my point. Yes, we have established the interrupt race
and have a way to fix it, let's move on...
What I am asking agin is: is there any overlap between any vgic ioctl
and the teardown of the VM? Do you ever see kvm_vm_release() being
called before kvm_device_release()?
Because that's the overlap I've been talking all along.
M.
--
Without deviation from the norm, progress is not possible.
WARNING: multiple messages have this Message-ID (diff)
From: Marc Zyngier <maz@kernel.org>
To: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: James Morse <james.morse@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org, Vikram Sethi <vsethi@nvidia.com>,
Zenghui Yu <yuzenghui@huawei.com>,
Oliver Upton <oliver.upton@linux.dev>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Ard Biesheuvel <ardb@kernel.org>
Subject: Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown
Date: Fri, 20 Jan 2023 12:00:01 +0000 [thread overview]
Message-ID: <86r0vpmn5q.wl-maz@kernel.org> (raw)
In-Reply-To: <b3bf3a46-410b-a756-dfd9-ee74d5dc31e0@nvidia.com>
On Fri, 20 Jan 2023 05:02:15 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>
> [1 <text/plain; UTF-8 (7bit)>]
> Hi Marc,
>
> On 1/19/23 08:01, Marc Zyngier wrote:
> >> On 1/19/23 01:11, Marc Zyngier wrote:
> >>> So you can see the VM being torn down while the vgic save sequence is
> >>> still in progress?
> >>>
> >>> If you can actually see that, then this is a much bigger bug than the
> >>> simple race you are describing, and we're missing a reference on the
> >>> kvm structure. This would be a*MAJOR* bug.
> >>>
> >> How do we know vGIC save sequence is in progress while VM is being
> >> teardown? I'm launching/terminating ~32 VMs in a loop to reproduce
> >> the issue.
> > Errr...*you* know when you are issuing the save ioctl, right? You
> > also know when you are terminating the VM (closing its fd or killing
> > the VMM).
> >
>
> Added debug statements to trace the code patch, and tagged each log message
> with 'struct kvm *'. Attached the complete kernel log messages including
> debug messages.
>
> All 32 VMs launched, time 258s to 291s
> [ 258.519837] kvm_create_vm(1236) called kvm=ffff8000303e0000 --> 1st VM
> ...
> [ 291.801179] kvm_create_vm(1236) called kvm=ffff800057a60000 --> 32nd VM
>
> Test script inside VM issues poweroff command after sleeping 200sec.
>
> Working case kvm=ffff8000303e0000:
>
> $ cat gicv4-debug.txt | grep ffff8000303e0000
> [ 258.519837] kvm_create_vm(1236) called kvm=ffff8000303e0000
> [ 258.667101] vgic_v4_init(267) called kvm=ffff8000303e0000 doorbell=140(64)
> [ 517.942167] vgic_set_common_attr(263) called kvm=ffff8000303e0000
> [ 517.948415] vgic_v3_save_pending_tables(397) called kvm=ffff8000303e0000
> [ 517.955602] vgic_v3_save_pending_tables(448) called kvm=ffff8000303e0000
> [ 518.099696] kvm_vm_release(1374) called kvm=ffff8000303e0000
> [ 518.126833] vgic_v4_teardown(323) started kvm=ffff8000303e0000 doorbell=140(64)
> [ 518.134677] vgic_v4_teardown(333) finished kvm=ffff8000303e0000 doorbell=140(64)
>
> Not working case kvm=ffff80001e0a0000:
>
> $ cat gicv4-debug.txt | grep ffff80001e0a0000
> [ 277.684981] kvm_create_vm(1236) called kvm=ffff80001e0a0000
> [ 278.158511] vgic_v4_init(267) called kvm=ffff80001e0a0000 doorbell=20812(64)
> [ 545.079117] vgic_set_common_attr(263) called kvm=ffff80001e0a0000
> [ 545.085358] vgic_v3_save_pending_tables(397) called kvm=ffff80001e0a0000
> [ 545.092580] vgic_v3_save_pending_tables(448) called kvm=ffff80001e0a0000
> [ 545.099562] irq: irqd_set_activated: CPU49 IRQ20821 lost IRQD_IRQ_INPROGRESS old=0x10401400 new=0x10401600, expected=0x10441600 kvm=ffff80001e0a0000^M
> [ 545.113177] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS set time [545.099561]^M
> [ 545.121454] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS clr time [545.099562]^M
> [ 545.129755] irq: irqd_set_activated: CPU49 IRQ20826 lost IRQD_IRQ_INPROGRESS old=0x10441400 new=0x10441600, expected=0x10401600 kvm=ffff80001e0a0000^M
> [ 545.143365] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS set time [545.129754]^M
> [ 545.151654] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS clr time [545.129755]^M
> [ 545.163250] kvm_vm_release(1374) called kvm=ffff80001e0a0000
> [ 545.169204] vgic_v4_teardown(323) started kvm=ffff80001e0a0000 doorbell=20812(64)
>
> IRQD_IRQ_INPROGRESS is corrupted before calling kvm_vm_release(),
You keep missing my point. Yes, we have established the interrupt race
and have a way to fix it, let's move on...
What I am asking agin is: is there any overlap between any vgic ioctl
and the teardown of the VM? Do you ever see kvm_vm_release() being
called before kvm_device_release()?
Because that's the overlap I've been talking all along.
M.
--
Without deviation from the norm, progress is not possible.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-01-20 12:00 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-18 2:23 [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown Shanker Donthineni
2023-01-18 2:23 ` Shanker Donthineni
2023-01-18 11:54 ` Marc Zyngier
2023-01-18 11:54 ` Marc Zyngier
2023-01-18 19:24 ` Shanker Donthineni
2023-01-18 19:24 ` Shanker Donthineni
2023-01-19 7:11 ` Marc Zyngier
2023-01-19 7:11 ` Marc Zyngier
2023-01-19 13:00 ` Shanker Donthineni
2023-01-19 13:00 ` Shanker Donthineni
2023-01-19 14:01 ` Marc Zyngier
2023-01-19 14:01 ` Marc Zyngier
2023-01-19 14:16 ` Shanker Donthineni
2023-01-19 14:16 ` Shanker Donthineni
2023-01-20 3:55 ` Shanker Donthineni
2023-01-20 5:02 ` Shanker Donthineni
2023-01-20 12:00 ` Marc Zyngier [this message]
2023-01-20 12:00 ` Marc Zyngier
2023-01-21 15:28 ` Shanker Donthineni
2023-01-21 15:28 ` Shanker Donthineni
2023-01-21 15:35 ` Shanker Donthineni
2023-01-21 15:35 ` Shanker Donthineni
2023-01-23 11:23 ` Marc Zyngier
2023-01-23 11:23 ` Marc Zyngier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86r0vpmn5q.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=ardb@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=james.morse@arm.com \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=oliver.upton@linux.dev \
--cc=sdonthineni@nvidia.com \
--cc=suzuki.poulose@arm.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.