public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Gavin Shan <gshan@redhat.com>
Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	seanjc@google.com, oliver.upton@linux.dev, hshuai@redhat.com,
	zhenyzha@redhat.com, shan.gavin@gmail.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
Date: Fri, 09 Jun 2023 09:06:36 +0100	[thread overview]
Message-ID: <877csdnjoz.wl-maz@kernel.org> (raw)
In-Reply-To: <695fe5a2-a295-4105-b31b-4cc92b089659@redhat.com>

On Fri, 09 Jun 2023 00:17:34 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Marc,
> 
> [Cc Andrea/David/Peter Xu]
> 
> On 6/9/23 12:31 AM, Marc Zyngier wrote:
> > On Thu, 08 Jun 2023 10:03:48 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >> 
> >> We run into guest hang in edk2 firmware when KSM is kept as running
> >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
> >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
> >> erasing or buffered write. The status is returned by reading the
> >> memory region of the pflash device and the read request should
> >> have been forwarded to QEMU and emulated by it. Unfortunately, the
> >> read request is covered by an illegal stage2 mapping when the guest
> >> hang issue occurs. The read request is completed with QEMU bypassed and
> >> wrong status is fetched.
> >> 
> >> The illegal stage2 mapping is populated due to same page mering by
> >> KSM at (C) even the associated memory slot has been marked as invalid
> >> at (B).
> >> 
> >>    CPU-A                    CPU-B
> >>    -----                    -----
> >>                             ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> >>                             kvm_vm_ioctl_set_memory_region
> >>                             kvm_set_memory_region
> >>                             __kvm_set_memory_region
> >>                             kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> >>                               kvm_invalidate_memslot
> >>                                 kvm_copy_memslot
> >>                                 kvm_replace_memslot
> >>                                 kvm_swap_active_memslots        (A)
> >>                                 kvm_arch_flush_shadow_memslot   (B)
> >>    same page merging by KSM
> >>    kvm_mmu_notifier_change_pte
> >>    kvm_handle_hva_range
> >>    __kvm_handle_hva_range       (C)
> >> 
> >> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> >> illegal stage2 mapping. Without the illegal stage2 mapping, the read
> >> request for the pflash's status is forwarded to QEMU and emulated by
> >> it. The correct pflash's status can be returned from QEMU to break
> >> the infinite wait in edk2 firmware.
> > 
> > Huh, nice one :-(.
> > 
> 
> Yeah, it's a sneaky one :)
> 
> >> 
> >> Cc: stable@vger.kernel.org # v5.13+
> >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> >> Reported-by: Shuai Hu <hshuai@redhat.com>
> >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   virt/kvm/kvm_main.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 479802a892d4..7f81a3a209b6 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >>   			unsigned long hva_start, hva_end;
> >>     			slot = container_of(node, struct
> >> kvm_memory_slot, hva_node[slots->node_idx]);
> >> +			if (slot->flags & KVM_MEMSLOT_INVALID)
> >> +				continue;
> >> +
> >>   			hva_start = max(range->start, slot->userspace_addr);
> >>   			hva_end = min(range->end, slot->userspace_addr +
> >>   						  (slot->npages << PAGE_SHIFT));
> > 
> > I don't immediately see what makes it safer. If we're not holding one
> > of slots_{,arch_}lock in the notifier, we can still race against the
> > update, can't we?  I don't think holding the srcu lock helps us here.
> > 

[...]

> change_pte() is always surrounded by invalidate_range_start and
> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
> larger than zero when change_pte() is called. With this condition
> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive
> and active memory slots by kvm_swap_active_memslots() can't be done.
> So there are two cases for one memory slot when change_pte() is called:
> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
> by kvm_invalidate_memslot(), called before invalidate_range_start();
> (b) the memory slot has been deleted from the active memory slots. We're
> only concerned by (a) when the active memory slots are iterated in
> __kvm_handle_hva_range().

OK, so to sum it up:

- the memslot cannot be swapped while we're walking the active
  memslots because kvm->mn_active_invalidate_count is elevated, and
  kvm_swap_active_memslots() will busy loop until this has reached 0
  again

- holding the srcu read_lock prevents an overlapping memslot update
  from being published at the wrong time (synchronize_srcu_expedited()
  in kvm_swap_active_memslots())

If the above holds, then I agree the fix looks correct. I'd definitely
want to see some of this rationale captured in the commit message
though.

Thanks,

	M.

> 
> static void kvm_mmu_notifier_change_pte(...)
> {
>         :
>         /*
>          * .change_pte() must be surrounded by .invalidate_range_{start,end}().
>          * If mmu_invalidate_in_progress is zero, then no in-progress
>          * invalidations, including this one, found a relevant memslot at
>          * start(); rechecking memslots here is unnecessary.  Note, a false
>          * positive (count elevated by a different invalidation) is sub-optimal
>          * but functionally ok.
>          */
>         WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>         if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>                 return;
>         :
> }
> 
> 
> The srcu lock in __kvm_handle_hva_range() prevents the swapping of
> the active and inactive memory slots by kvm_swap_active_memslots(). For
> this particular case, it's not relevant because the swapping between
> the inactive and active memory slots has been done for once, before
> invalidate_range_start() is called.


-- 
Without deviation from the norm, progress is not possible.

  reply	other threads:[~2023-06-09  8:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-08  9:03 [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot Gavin Shan
2023-06-08 14:31 ` Marc Zyngier
2023-06-08 23:17   ` Gavin Shan
2023-06-09  8:06     ` Marc Zyngier [this message]
2023-06-09  9:02       ` Gavin Shan
2023-06-12 14:41 ` Sean Christopherson
2023-06-13  1:06   ` Gavin Shan
2023-06-13 14:58     ` Sean Christopherson
2023-06-14  0:01       ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877csdnjoz.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=gshan@redhat.com \
    --cc=hshuai@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    --cc=shan.gavin@gmail.com \
    --cc=zhenyzha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox