From: Shanker Donthineni <sdonthineni@nvidia.com>
To: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org, Vikram Sethi <vsethi@nvidia.com>,
Zenghui Yu <yuzenghui@huawei.com>,
Oliver Upton <oliver.upton@linux.dev>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Ard Biesheuvel <ardb@kernel.org>
Subject: Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown
Date: Thu, 19 Jan 2023 08:16:59 -0600 [thread overview]
Message-ID: <5e70b4e9-94e4-e673-db37-03dcb5551a62@nvidia.com> (raw)
In-Reply-To: <86wn5imxm9.wl-maz@kernel.org>
Hi Marc,
On 1/19/23 08:01, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 19 Jan 2023 13:00:49 +0000,
> Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>>
>>
>>
>> On 1/19/23 01:11, Marc Zyngier wrote:
>>> So you can see the VM being torn down while the vgic save sequence is
>>> still in progress?
>>>
>>> If you can actually see that, then this is a much bigger bug than the
>>> simple race you are describing, and we're missing a reference on the
>>> kvm structure. This would be a *MAJOR* bug.
>>>
>> How do we know vGIC save sequence is in progress while VM is being
>> teardown? I'm launching/terminating ~32 VMs in a loop to reproduce
>> the issue.
>
> Errr... *you* know when you are issuing the save ioctl, right? You
> also know when you are terminating the VM (closing its fd or killing
> the VMM).
>
>>
>>> Please post the full traces, not snippets. The absolutely full kernel
>>> log, the configuration, what you run, how you run it, *EVERYTHING*. I
>>> need to be able to reproduce this.
>> Sure, I'll share the complete boot log messages of host kernel next run.
>>
>>>
>>>>
>>>>>>
>>>>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>>>>>> {
>>>>>> irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>>>> raw_spin_unlock(&desc->lock);
>>>>>>
>>>>>> ret = handle_irq_event_percpu(desc);
>>>>>>
>>>>>> raw_spin_lock(&desc->lock);
>>>>>> irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>>>> }
>>>>>
>>>>> How is that relevant to this trace? Do you see this function running
>>>>> concurrently with the teardown? If it matters here, it must be a VPE
>>>>> doorbell, right? But you claim that this is on a GICv4 platform, while
>>>>> this would only affect GICv4.1... Or are you using GICv4.1?
>>>>>
>>>> handle_irq_event() is running concurrently with irq_domain_activate_irq()
>>>> which happens before free_irq() called. Corruption at [78.983544] and
>>>> teardown started at [87.360891].
>>>
>>> But that doesn't match the description you made of concurrent
>>> events. Does it take more than 9 seconds for the vgic state to be
>>> saved to memory?
>>
>> Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
>> state bit other than concurrent accesses?
>
> Forget about this bit. You said that we could see the VM teardown
> happening *at the same time* as the vgic state saving, despite the
> vgic device holding a reference on the kvm structure. If that's the
> case, this bit is the least of our worries. Think of the consequences
> for a second...
>
> [...]
>
>> Using the below steps for launching/terminating 32 VMs in loop. The
>> failure is intermittent. The same issue is reproducible with KVMTOOL
>> also.
>
> kvmtool never issue a KVM_DEV_ARM_VGIC_GRP_CTRL with the
> KVM_DEV_ARM_ITS_SAVE_TABLES argument, so the code path we discussed is
> never used. What is the exact problem you're observing with kvmtool
> as the VMM?
>
Unfortunately I didn't capture the log messages of kvmtool based VM boot.
Saw once kernel crash. Now I'm trying to reproduce the issue with QEMU tool
and share complete log messages.
> M.
>
> --
> Without deviation from the norm, progress is not possible.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-01-19 14:18 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-18 2:23 [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown Shanker Donthineni
2023-01-18 11:54 ` Marc Zyngier
2023-01-18 19:24 ` Shanker Donthineni
2023-01-19 7:11 ` Marc Zyngier
2023-01-19 13:00 ` Shanker Donthineni
2023-01-19 14:01 ` Marc Zyngier
2023-01-19 14:16 ` Shanker Donthineni [this message]
[not found] ` <b3bf3a46-410b-a756-dfd9-ee74d5dc31e0@nvidia.com>
2023-01-20 12:00 ` Marc Zyngier
2023-01-21 15:28 ` Shanker Donthineni
2023-01-21 15:35 ` Shanker Donthineni
2023-01-23 11:23 ` Marc Zyngier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5e70b4e9-94e4-e673-db37-03dcb5551a62@nvidia.com \
--to=sdonthineni@nvidia.com \
--cc=ardb@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=james.morse@arm.com \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=suzuki.poulose@arm.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).