From: Marc Zyngier <maz@kernel.org>
To: Gavin Shan <gshan@redhat.com>
Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, pbonzini@redhat.com,
seanjc@google.com, oliver.upton@linux.dev, hshuai@redhat.com,
zhenyzha@redhat.com, shan.gavin@gmail.com,
Andrea Arcangeli <aarcange@redhat.com>,
Peter Xu <peterx@redhat.com>,
David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
Date: Fri, 09 Jun 2023 09:06:36 +0100 [thread overview]
Message-ID: <877csdnjoz.wl-maz@kernel.org> (raw)
In-Reply-To: <695fe5a2-a295-4105-b31b-4cc92b089659@redhat.com>
On Fri, 09 Jun 2023 00:17:34 +0100,
Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Marc,
>
> [Cc Andrea/David/Peter Xu]
>
> On 6/9/23 12:31 AM, Marc Zyngier wrote:
> > On Thu, 08 Jun 2023 10:03:48 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >>
> >> We run into guest hang in edk2 firmware when KSM is kept as running
> >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
> >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
> >> erasing or buffered write. The status is returned by reading the
> >> memory region of the pflash device and the read request should
> >> have been forwarded to QEMU and emulated by it. Unfortunately, the
> >> read request is covered by an illegal stage2 mapping when the guest
> >> hang issue occurs. The read request is completed with QEMU bypassed and
> >> wrong status is fetched.
> >>
> >> The illegal stage2 mapping is populated due to same page mering by
> >> KSM at (C) even the associated memory slot has been marked as invalid
> >> at (B).
> >>
> >> CPU-A CPU-B
> >> ----- -----
> >> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> >> kvm_vm_ioctl_set_memory_region
> >> kvm_set_memory_region
> >> __kvm_set_memory_region
> >> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> >> kvm_invalidate_memslot
> >> kvm_copy_memslot
> >> kvm_replace_memslot
> >> kvm_swap_active_memslots (A)
> >> kvm_arch_flush_shadow_memslot (B)
> >> same page merging by KSM
> >> kvm_mmu_notifier_change_pte
> >> kvm_handle_hva_range
> >> __kvm_handle_hva_range (C)
> >>
> >> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> >> illegal stage2 mapping. Without the illegal stage2 mapping, the read
> >> request for the pflash's status is forwarded to QEMU and emulated by
> >> it. The correct pflash's status can be returned from QEMU to break
> >> the infinite wait in edk2 firmware.
> >
> > Huh, nice one :-(.
> >
>
> Yeah, it's a sneaky one :)
>
> >>
> >> Cc: stable@vger.kernel.org # v5.13+
> >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> >> Reported-by: Shuai Hu <hshuai@redhat.com>
> >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >> virt/kvm/kvm_main.c | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 479802a892d4..7f81a3a209b6 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >> unsigned long hva_start, hva_end;
> >> slot = container_of(node, struct
> >> kvm_memory_slot, hva_node[slots->node_idx]);
> >> + if (slot->flags & KVM_MEMSLOT_INVALID)
> >> + continue;
> >> +
> >> hva_start = max(range->start, slot->userspace_addr);
> >> hva_end = min(range->end, slot->userspace_addr +
> >> (slot->npages << PAGE_SHIFT));
> >
> > I don't immediately see what makes it safer. If we're not holding one
> > of slots_{,arch_}lock in the notifier, we can still race against the
> > update, can't we? I don't think holding the srcu lock helps us here.
> >
[...]
> change_pte() is always surrounded by invalidate_range_start and
> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
> larger than zero when change_pte() is called. With this condition
> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive
> and active memory slots by kvm_swap_active_memslots() can't be done.
> So there are two cases for one memory slot when change_pte() is called:
> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
> by kvm_invalidate_memslot(), called before invalidate_range_start();
> (b) the memory slot has been deleted from the active memory slots. We're
> only concerned by (a) when the active memory slots are iterated in
> __kvm_handle_hva_range().
OK, so to sum it up:
- the memslot cannot be swapped while we're walking the active
memslots because kvm->mn_active_invalidate_count is elevated, and
kvm_swap_active_memslots() will busy loop until this has reached 0
again
- holding the srcu read_lock prevents an overlapping memslot update
from being published at the wrong time (synchronize_srcu_expedited()
in kvm_swap_active_memslots())
If the above holds, then I agree the fix looks correct. I'd definitely
want to see some of this rationale captured in the commit message
though.
Thanks,
M.
>
> static void kvm_mmu_notifier_change_pte(...)
> {
> :
> /*
> * .change_pte() must be surrounded by .invalidate_range_{start,end}().
> * If mmu_invalidate_in_progress is zero, then no in-progress
> * invalidations, including this one, found a relevant memslot at
> * start(); rechecking memslots here is unnecessary. Note, a false
> * positive (count elevated by a different invalidation) is sub-optimal
> * but functionally ok.
> */
> WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> return;
> :
> }
>
>
> The srcu lock in __kvm_handle_hva_range() prevents the swapping of
> the active and inactive memory slots by kvm_swap_active_memslots(). For
> this particular case, it's not relevant because the swapping between
> the inactive and active memory slots has been done for once, before
> invalidate_range_start() is called.
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2023-06-09 8:06 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-08 9:03 [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot Gavin Shan
2023-06-08 14:31 ` Marc Zyngier
2023-06-08 23:17 ` Gavin Shan
2023-06-09 8:06 ` Marc Zyngier [this message]
2023-06-09 9:02 ` Gavin Shan
2023-06-12 14:41 ` Sean Christopherson
2023-06-13 1:06 ` Gavin Shan
2023-06-13 14:58 ` Sean Christopherson
2023-06-14 0:01 ` Gavin Shan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877csdnjoz.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=aarcange@redhat.com \
--cc=david@redhat.com \
--cc=gshan@redhat.com \
--cc=hshuai@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=seanjc@google.com \
--cc=shan.gavin@gmail.com \
--cc=zhenyzha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.